MegaContext Storage Format
This document describes the on-disk storage format for the MegaContext Tree, including binary layouts, headers, compression strategies, and access patterns.
Overview
The MegaContext Tree persists to disk as a set of binary files with deterministic layout:
LOD0.ctx- Raw token IDs (vocabulary indices)LOD1.ctx- Level 1 gist vectorsLOD2.ctx- Level 2 gist vectorsmetadata.json- Tree metadata and configuration
All files use a common binary format with self-describing headers.
Binary File Format
File Header (64 bytes)
Every .ctx file begins with a 64-byte header:
Offset Field Type Size Meaning
------ ----- ---- ---- -------
0 magic uint32 4 0x4D434354 ("MCCT")
4 version uint16 2 Format revision (start at 1)
6 level uint16 2 0 (LOD0), 1 (LOD1), or 2 (LOD2)
8 block_size uint16 2 K = 32
10 embedding_dim uint16 2 Base model embedding width d
12 dtype_code uint16 2 0=uint32, 1=fp16, 2=bf16, 3=fp32
14 num_entries uint64 8 Total entries in this file
22 model_name char[32] 32 "SmolLM3-3B", etc.
54 reserved 10 bytes 10 Zeroed for future use
Magic number: 0x4D434354 (“MCCT” in ASCII) identifies MegaContext tree files
Version: Format revision for backward compatibility
- Version 1: Initial POC format
- Future versions may add compression, encryption, etc.
Dtype codes:
- 0 =
uint32(token IDs in LOD0.ctx) - 1 =
fp16(half-precision floats for gists) - 2 =
bf16(brain-float 16 for gists) - 3 =
fp32(full precision, rarely used due to size)
LOD0.ctx: Token Storage
Content: Raw token IDs from the base model’s vocabulary
Layout:
[64-byte header]
[token_0] [token_1] [token_2] ... [token_N-1]
Entry format: Each token is a uint32 (4 bytes)
Offset calculation:
def get_l0_token_offset(block_index: int, token_index: int) -> int:
"""Get byte offset for a specific token."""
header_size = 64
block_size = 32 # K
absolute_token = block_index * block_size + token_index
return header_size + (absolute_token * 4) # 4 bytes per uint32Example:
# Block 10, token 5
offset = 64 + ((10 * 32 + 5) * 4)
= 64 + (325 * 4)
= 64 + 1300
= 1364 bytesAccess pattern: Sequential reads for ingestion, random access for Working Context assembly
LOD1.ctx / LOD2.ctx: Gist Storage
Content: Compressed Gist Embedding generated by GistNet
Layout:
[64-byte header]
[gist_0: d-dimensional vector]
[gist_1: d-dimensional vector]
...
[gist_N-1: d-dimensional vector]
Entry format: Each gist is a d-dimensional vector (typically d=2048 or d=4096)
- Stored as
fp16(2 bytes per dimension) orbf16in POC - Future: quantized formats (int8, int4) for further compression
Offset calculation:
def get_gist_offset(gist_index: int, embedding_dim: int, dtype_bytes: int) -> int:
"""Get byte offset for a specific gist."""
header_size = 64
gist_size_bytes = embedding_dim * dtype_bytes # e.g., 2048 * 2 = 4096 for fp16
return header_size + (gist_index * gist_size_bytes)Example (d=2048, fp16):
# Gist 42 in LOD1.ctx
embedding_dim = 2048
dtype_bytes = 2 # fp16
offset = 64 + (42 * 2048 * 2)
= 64 + (42 * 4096)
= 64 + 172032
= 172096 bytesAccess pattern:
- Sequential writes during GistNet compression
- Random reads during Working Context assembly
- Cache-friendly due to fixed stride
Deterministic Offsets
The storage format enables O(1) offset calculation without external indexes:
Benefits
- No index overhead: File offsets computed from (level, block_id, embedding_dim)
- Memory-mapped I/O: Can
mmap()files for zero-copy access - Partial loading: Read only needed gists, not entire file
- Concurrent access: Multiple readers without coordination
Implementation
class MegaContextTreeStorage:
def __init__(self, base_path: str, embedding_dim: int):
self.base_path = base_path
self.embedding_dim = embedding_dim
self.l0_file = open(f"{base_path}/LOD0.ctx", "rb+")
self.l1_file = open(f"{base_path}/LOD1.ctx", "rb+")
self.l2_file = open(f"{base_path}/LOD2.ctx", "rb+")
def read_l0_block(self, block_id: int) -> np.ndarray:
"""Read 32 token IDs."""
offset = 64 + (block_id * 32 * 4)
self.l0_file.seek(offset)
token_ids = np.fromfile(self.l0_file, dtype=np.uint32, count=32)
return token_ids
def read_l1_gist(self, gist_id: int) -> np.ndarray:
"""Read LOD1 gist vector."""
offset = 64 + (gist_id * self.embedding_dim * 2) # fp16
self.l1_file.seek(offset)
gist = np.fromfile(self.l1_file, dtype=np.float16, count=self.embedding_dim)
return gistMetadata File
Format: JSON for human readability in POC
Location: {base_path}/metadata.json
Content:
{
"version": 1,
"created_at": "2025-01-15T10:30:00Z",
"model_name": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
"embedding_dim": 2048,
"block_size": 32,
"levels": {
"LOD0": {
"num_blocks": 31250,
"num_tokens": 1000000,
"file_size_bytes": 4000064
},
"LOD1": {
"num_gists": 31250,
"file_size_bytes": 128000064
},
"LOD2": {
"num_gists": 976,
"file_size_bytes": 3997760
}
},
"gistnet_checkpoint": "checkpoints/gistnet_epoch10.pt",
"ingestion_complete": true,
"last_modified": "2025-01-15T12:45:00Z"
}Storage Size Analysis
Example: 1M Token Corpus
| Level | Entries | Entry Size | Total Size |
|---|---|---|---|
| LOD0 | 1M tokens | 4 bytes | 4 MB |
| LOD1 | 31,250 gists | 4 KB (2048 × fp16) | 128 MB |
| LOD2 | 976 gists | 4 KB | 4 MB |
| Metadata | 1 file | ~2 KB | 2 KB |
| Total | 136 MB |
Tree overhead: LOD1+LOD2 add 132 MB to store 1M tokens (4 MB), ~3.2% overhead
Scaling
| Corpus Size | LOD0 Size | Tree Size (fp16) | Tree Size (int8) |
|---|---|---|---|
| 10k tokens | 40 KB | 1.3 MB | 650 KB |
| 100k tokens | 400 KB | 13 MB | 6.5 MB |
| 1M tokens | 4 MB | 136 MB | 68 MB |
| 10M tokens | 40 MB | 1.36 GB | 680 MB |
| 100M tokens | 400 MB | 13.6 GB | 6.8 GB |
See Performance Sketch for detailed storage analysis.
Compression Strategies
Block Compression (Future)
Apply per-file compression:
[64-byte header]
[compressed payload]
Algorithms:
- zstd (level 3): ~2–3× compression, fast decompression
- LZ4: ~1.5–2× compression, very fast
- brotli (level 6): ~3–4× compression, slower
Trade-offs:
- Loses deterministic offsets (requires index)
- Adds decompression latency (~5–10 ms for 1 MB)
- Best for cold storage, not hot Working Context assembly
Quantization
Reduce gist precision:
| Format | Bits/dim | Compression | Quality Loss |
|---|---|---|---|
| fp32 | 32 | 1× (baseline) | None |
| fp16 | 16 | 2× | Negligible |
| bf16 | 16 | 2× | Slightly worse than fp16 |
| int8 | 8 | 4× | ~0.5% ΔNLL increase |
| int4 | 4 | 8× | ~2% ΔNLL increase |
POC: Uses fp16 for good quality/size balance
Future: Hybrid approach—fp16 for recent gists, int8 for old ones
Memory-Mapped I/O
Benefits
- Zero-copy access: No read() syscalls, direct memory access
- Lazy loading: OS pages in data only when accessed
- Shared memory: Multiple processes can share same mapping
Implementation
import mmap
class MMapTreeStorage:
def __init__(self, base_path: str):
self.l0_file = open(f"{base_path}/LOD0.ctx", "rb")
self.l0_mmap = mmap.mmap(
self.l0_file.fileno(),
length=0, # entire file
access=mmap.ACCESS_READ
)
# Similar for LOD1, LOD2
def read_l0_block(self, block_id: int) -> np.ndarray:
offset = 64 + (block_id * 32 * 4)
token_bytes = self.l0_mmap[offset:offset+128] # 32 tokens × 4 bytes
return np.frombuffer(token_bytes, dtype=np.uint32)When to use:
- Large trees (>1 GB) that don’t fit in RAM
- Random access patterns (Working Context assembly)
- Not needed for POC (RAM-resident)
Checkpointing & Versioning
Snapshots
Periodically save complete tree state:
checkpoints/
2025-01-15_10-30-00/
LOD0.ctx
LOD1.ctx
LOD2.ctx
metadata.json
Use cases:
- Rollback after corrupted ingestion
- A/B testing different GistNet checkpoints
- Reproducible experiments
Incremental Updates
When GistNet is retrained:
- Read old gist from
LOD1.ctx - Recompute with new GistNet
- Write updated gist to same offset
- Update
gistnet_checkpointin metadata
Versioning: Track which GistNet checkpoint generated each gist
Future Enhancements
Distributed Storage
Shard tree across multiple nodes:
node_0: LOD0.ctx [blocks 0-10000]
node_1: LOD0.ctx [blocks 10001-20000]
...
Benefits: Parallel assembly, higher throughput
Tiered Storage
Hot tier (SSD): Recent 10% of tree
Warm tier (HDD): Last 50% of tree
Cold tier (S3): Oldest 40% of tree
See MegaCuration for pruning strategies.
Encryption
Add per-file encryption for sensitive data:
[64-byte header (plaintext)]
[encrypted payload with AES-256-GCM]
Summary
The MegaContext storage format provides:
- Deterministic offsets - O(1) random access without indexes
- Self-describing - Headers enable format evolution
- Compact - ~3.2% overhead over raw tokens
- Efficient - Memory-mapped I/O for large trees
- Portable - Binary format works across platforms
This design enables the MegaContext Tree to scale from kilobytes to terabytes while maintaining constant-time access patterns.
See MegaContext Tree for tree structure and POC Architecture for implementation details.