MegaContext Storage Format

This document describes the on-disk storage format for the MegaContext Tree, including binary layouts, headers, compression strategies, and access patterns.


Overview

The MegaContext Tree persists to disk as a set of binary files with deterministic layout:

  • LOD0.ctx - Raw token IDs (vocabulary indices)
  • LOD1.ctx - Level 1 gist vectors
  • LOD2.ctx - Level 2 gist vectors
  • metadata.json - Tree metadata and configuration

All files use a common binary format with self-describing headers.


Binary File Format

File Header (64 bytes)

Every .ctx file begins with a 64-byte header:

Offset  Field            Type      Size    Meaning
------  -----            ----      ----    -------
0       magic            uint32    4       0x4D434354 ("MCCT")
4       version          uint16    2       Format revision (start at 1)
6       level            uint16    2       0 (LOD0), 1 (LOD1), or 2 (LOD2)
8       block_size       uint16    2       K = 32
10      embedding_dim    uint16    2       Base model embedding width d
12      dtype_code       uint16    2       0=uint32, 1=fp16, 2=bf16, 3=fp32
14      num_entries      uint64    8       Total entries in this file
22      model_name       char[32]  32      "SmolLM3-3B", etc.
54      reserved         10 bytes  10      Zeroed for future use

Magic number: 0x4D434354 (“MCCT” in ASCII) identifies MegaContext tree files

Version: Format revision for backward compatibility

  • Version 1: Initial POC format
  • Future versions may add compression, encryption, etc.

Dtype codes:

  • 0 = uint32 (token IDs in LOD0.ctx)
  • 1 = fp16 (half-precision floats for gists)
  • 2 = bf16 (brain-float 16 for gists)
  • 3 = fp32 (full precision, rarely used due to size)

LOD0.ctx: Token Storage

Content: Raw token IDs from the base model’s vocabulary

Layout:

[64-byte header]
[token_0] [token_1] [token_2] ... [token_N-1]

Entry format: Each token is a uint32 (4 bytes)

Offset calculation:

def get_l0_token_offset(block_index: int, token_index: int) -> int:
    """Get byte offset for a specific token."""
    header_size = 64
    block_size = 32  # K
    absolute_token = block_index * block_size + token_index
    return header_size + (absolute_token * 4)  # 4 bytes per uint32

Example:

# Block 10, token 5
offset = 64 + ((10 * 32 + 5) * 4)
       = 64 + (325 * 4)
       = 64 + 1300
       = 1364 bytes

Access pattern: Sequential reads for ingestion, random access for Working Context assembly


LOD1.ctx / LOD2.ctx: Gist Storage

Content: Compressed Gist Embedding generated by GistNet

Layout:

[64-byte header]
[gist_0: d-dimensional vector]
[gist_1: d-dimensional vector]
...
[gist_N-1: d-dimensional vector]

Entry format: Each gist is a d-dimensional vector (typically d=2048 or d=4096)

  • Stored as fp16 (2 bytes per dimension) or bf16 in POC
  • Future: quantized formats (int8, int4) for further compression

Offset calculation:

def get_gist_offset(gist_index: int, embedding_dim: int, dtype_bytes: int) -> int:
    """Get byte offset for a specific gist."""
    header_size = 64
    gist_size_bytes = embedding_dim * dtype_bytes  # e.g., 2048 * 2 = 4096 for fp16
    return header_size + (gist_index * gist_size_bytes)

Example (d=2048, fp16):

# Gist 42 in LOD1.ctx
embedding_dim = 2048
dtype_bytes = 2  # fp16
offset = 64 + (42 * 2048 * 2)
       = 64 + (42 * 4096)
       = 64 + 172032
       = 172096 bytes

Access pattern:

  • Sequential writes during GistNet compression
  • Random reads during Working Context assembly
  • Cache-friendly due to fixed stride

Deterministic Offsets

The storage format enables O(1) offset calculation without external indexes:

Benefits

  1. No index overhead: File offsets computed from (level, block_id, embedding_dim)
  2. Memory-mapped I/O: Can mmap() files for zero-copy access
  3. Partial loading: Read only needed gists, not entire file
  4. Concurrent access: Multiple readers without coordination

Implementation

class MegaContextTreeStorage:
    def __init__(self, base_path: str, embedding_dim: int):
        self.base_path = base_path
        self.embedding_dim = embedding_dim
        self.l0_file = open(f"{base_path}/LOD0.ctx", "rb+")
        self.l1_file = open(f"{base_path}/LOD1.ctx", "rb+")
        self.l2_file = open(f"{base_path}/LOD2.ctx", "rb+")
 
    def read_l0_block(self, block_id: int) -> np.ndarray:
        """Read 32 token IDs."""
        offset = 64 + (block_id * 32 * 4)
        self.l0_file.seek(offset)
        token_ids = np.fromfile(self.l0_file, dtype=np.uint32, count=32)
        return token_ids
 
    def read_l1_gist(self, gist_id: int) -> np.ndarray:
        """Read LOD1 gist vector."""
        offset = 64 + (gist_id * self.embedding_dim * 2)  # fp16
        self.l1_file.seek(offset)
        gist = np.fromfile(self.l1_file, dtype=np.float16, count=self.embedding_dim)
        return gist

Metadata File

Format: JSON for human readability in POC

Location: {base_path}/metadata.json

Content:

{
  "version": 1,
  "created_at": "2025-01-15T10:30:00Z",
  "model_name": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
  "embedding_dim": 2048,
  "block_size": 32,
  "levels": {
    "LOD0": {
      "num_blocks": 31250,
      "num_tokens": 1000000,
      "file_size_bytes": 4000064
    },
    "LOD1": {
      "num_gists": 31250,
      "file_size_bytes": 128000064
    },
    "LOD2": {
      "num_gists": 976,
      "file_size_bytes": 3997760
    }
  },
  "gistnet_checkpoint": "checkpoints/gistnet_epoch10.pt",
  "ingestion_complete": true,
  "last_modified": "2025-01-15T12:45:00Z"
}

Storage Size Analysis

Example: 1M Token Corpus

LevelEntriesEntry SizeTotal Size
LOD01M tokens4 bytes4 MB
LOD131,250 gists4 KB (2048 × fp16)128 MB
LOD2976 gists4 KB4 MB
Metadata1 file~2 KB2 KB
Total136 MB

Tree overhead: LOD1+LOD2 add 132 MB to store 1M tokens (4 MB), ~3.2% overhead

Scaling

Corpus SizeLOD0 SizeTree Size (fp16)Tree Size (int8)
10k tokens40 KB1.3 MB650 KB
100k tokens400 KB13 MB6.5 MB
1M tokens4 MB136 MB68 MB
10M tokens40 MB1.36 GB680 MB
100M tokens400 MB13.6 GB6.8 GB

See Performance Sketch for detailed storage analysis.


Compression Strategies

Block Compression (Future)

Apply per-file compression:

[64-byte header]
[compressed payload]

Algorithms:

  • zstd (level 3): ~2–3× compression, fast decompression
  • LZ4: ~1.5–2× compression, very fast
  • brotli (level 6): ~3–4× compression, slower

Trade-offs:

  • Loses deterministic offsets (requires index)
  • Adds decompression latency (~5–10 ms for 1 MB)
  • Best for cold storage, not hot Working Context assembly

Quantization

Reduce gist precision:

FormatBits/dimCompressionQuality Loss
fp32321× (baseline)None
fp1616Negligible
bf1616Slightly worse than fp16
int88~0.5% ΔNLL increase
int44~2% ΔNLL increase

POC: Uses fp16 for good quality/size balance

Future: Hybrid approach—fp16 for recent gists, int8 for old ones


Memory-Mapped I/O

Benefits

  • Zero-copy access: No read() syscalls, direct memory access
  • Lazy loading: OS pages in data only when accessed
  • Shared memory: Multiple processes can share same mapping

Implementation

import mmap
 
class MMapTreeStorage:
    def __init__(self, base_path: str):
        self.l0_file = open(f"{base_path}/LOD0.ctx", "rb")
        self.l0_mmap = mmap.mmap(
            self.l0_file.fileno(),
            length=0,  # entire file
            access=mmap.ACCESS_READ
        )
        # Similar for LOD1, LOD2
 
    def read_l0_block(self, block_id: int) -> np.ndarray:
        offset = 64 + (block_id * 32 * 4)
        token_bytes = self.l0_mmap[offset:offset+128]  # 32 tokens × 4 bytes
        return np.frombuffer(token_bytes, dtype=np.uint32)

When to use:

  • Large trees (>1 GB) that don’t fit in RAM
  • Random access patterns (Working Context assembly)
  • Not needed for POC (RAM-resident)

Checkpointing & Versioning

Snapshots

Periodically save complete tree state:

checkpoints/
  2025-01-15_10-30-00/
    LOD0.ctx
    LOD1.ctx
    LOD2.ctx
    metadata.json

Use cases:

  • Rollback after corrupted ingestion
  • A/B testing different GistNet checkpoints
  • Reproducible experiments

Incremental Updates

When GistNet is retrained:

  1. Read old gist from LOD1.ctx
  2. Recompute with new GistNet
  3. Write updated gist to same offset
  4. Update gistnet_checkpoint in metadata

Versioning: Track which GistNet checkpoint generated each gist


Future Enhancements

Distributed Storage

Shard tree across multiple nodes:

node_0: LOD0.ctx [blocks 0-10000]
node_1: LOD0.ctx [blocks 10001-20000]
...

Benefits: Parallel assembly, higher throughput

Tiered Storage

Hot tier (SSD):   Recent 10% of tree
Warm tier (HDD):  Last 50% of tree
Cold tier (S3):   Oldest 40% of tree

See MegaCuration for pruning strategies.

Encryption

Add per-file encryption for sensitive data:

[64-byte header (plaintext)]
[encrypted payload with AES-256-GCM]

Summary

The MegaContext storage format provides:

  1. Deterministic offsets - O(1) random access without indexes
  2. Self-describing - Headers enable format evolution
  3. Compact - ~3.2% overhead over raw tokens
  4. Efficient - Memory-mapped I/O for large trees
  5. Portable - Binary format works across platforms

This design enables the MegaContext Tree to scale from kilobytes to terabytes while maintaining constant-time access patterns.

See MegaContext Tree for tree structure and POC Architecture for implementation details.