MegaContext Storage Format

This document describes the on-disk storage format for the MegaContext Tree, including binary layouts, headers, compression strategies, and access patterns.

Overview

The MegaContext Tree persists to disk as a set of binary files with deterministic layout:

LOD0.ctx - Raw token IDs (vocabulary indices)
LOD1.ctx - Level 1 gist vectors
LOD2.ctx - Level 2 gist vectors
metadata.json - Tree metadata and configuration

All files use a common binary format with self-describing headers.

Binary File Format

File Header (64 bytes)

Every .ctx file begins with a 64-byte header:

Offset  Field            Type      Size    Meaning
------  -----            ----      ----    -------
0       magic            uint32    4       0x4D434354 ("MCCT")
4       version          uint16    2       Format revision (start at 1)
6       level            uint16    2       0 (LOD0), 1 (LOD1), or 2 (LOD2)
8       block_size       uint16    2       K = 32
10      embedding_dim    uint16    2       Base model embedding width d
12      dtype_code       uint16    2       0=uint32, 1=fp16, 2=bf16, 3=fp32
14      num_entries      uint64    8       Total entries in this file
22      model_name       char[32]  32      "SmolLM3-3B", etc.
54      reserved         10 bytes  10      Zeroed for future use

Magic number: 0x4D434354 (“MCCT” in ASCII) identifies MegaContext tree files

Version: Format revision for backward compatibility

Version 1: Initial POC format
Future versions may add compression, encryption, etc.

Dtype codes:

0 = uint32 (token IDs in LOD0.ctx)
1 = fp16 (half-precision floats for gists)
2 = bf16 (brain-float 16 for gists)
3 = fp32 (full precision, rarely used due to size)

LOD0.ctx: Token Storage

Content: Raw token IDs from the base model’s vocabulary

Layout:

[64-byte header]
[token_0] [token_1] [token_2] ... [token_N-1]

Entry format: Each token is a uint32 (4 bytes)

Offset calculation:

def get_l0_token_offset(block_index: int, token_index: int) -> int:
    """Get byte offset for a specific token."""
    header_size = 64
    block_size = 32  # K
    absolute_token = block_index * block_size + token_index
    return header_size + (absolute_token * 4)  # 4 bytes per uint32

Example:

# Block 10, token 5
offset = 64 + ((10 * 32 + 5) * 4)
       = 64 + (325 * 4)
       = 64 + 1300
       = 1364 bytes

Access pattern: Sequential reads for ingestion, random access for Working Context assembly

LOD1.ctx / LOD2.ctx: Gist Storage

Content: Compressed Gist Embedding generated by GistNet

Layout:

[64-byte header]
[gist_0: d-dimensional vector]
[gist_1: d-dimensional vector]
...
[gist_N-1: d-dimensional vector]

Entry format: Each gist is a d-dimensional vector (typically d=2048 or d=4096)

Stored as fp16 (2 bytes per dimension) or bf16 in POC
Future: quantized formats (int8, int4) for further compression

Offset calculation:

def get_gist_offset(gist_index: int, embedding_dim: int, dtype_bytes: int) -> int:
    """Get byte offset for a specific gist."""
    header_size = 64
    gist_size_bytes = embedding_dim * dtype_bytes  # e.g., 2048 * 2 = 4096 for fp16
    return header_size + (gist_index * gist_size_bytes)

Example (d=2048, fp16):

# Gist 42 in LOD1.ctx
embedding_dim = 2048
dtype_bytes = 2  # fp16
offset = 64 + (42 * 2048 * 2)
       = 64 + (42 * 4096)
       = 64 + 172032
       = 172096 bytes

Access pattern:

Sequential writes during GistNet compression
Random reads during Working Context assembly
Cache-friendly due to fixed stride

Deterministic Offsets

The storage format enables O(1) offset calculation without external indexes:

Benefits

No index overhead: File offsets computed from (level, block_id, embedding_dim)
Memory-mapped I/O: Can mmap() files for zero-copy access
Partial loading: Read only needed gists, not entire file
Concurrent access: Multiple readers without coordination

Implementation

class MegaContextTreeStorage:
    def __init__(self, base_path: str, embedding_dim: int):
        self.base_path = base_path
        self.embedding_dim = embedding_dim
        self.l0_file = open(f"{base_path}/LOD0.ctx", "rb+")
        self.l1_file = open(f"{base_path}/LOD1.ctx", "rb+")
        self.l2_file = open(f"{base_path}/LOD2.ctx", "rb+")
 
    def read_l0_block(self, block_id: int) -> np.ndarray:
        """Read 32 token IDs."""
        offset = 64 + (block_id * 32 * 4)
        self.l0_file.seek(offset)
        token_ids = np.fromfile(self.l0_file, dtype=np.uint32, count=32)
        return token_ids
 
    def read_l1_gist(self, gist_id: int) -> np.ndarray:
        """Read LOD1 gist vector."""
        offset = 64 + (gist_id * self.embedding_dim * 2)  # fp16
        self.l1_file.seek(offset)
        gist = np.fromfile(self.l1_file, dtype=np.float16, count=self.embedding_dim)
        return gist

Metadata File

Format: JSON for human readability in POC

Location: {base_path}/metadata.json

Content:

{
  "version": 1,
  "created_at": "2025-01-15T10:30:00Z",
  "model_name": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
  "embedding_dim": 2048,
  "block_size": 32,
  "levels": {
    "LOD0": {
      "num_blocks": 31250,
      "num_tokens": 1000000,
      "file_size_bytes": 4000064
    },
    "LOD1": {
      "num_gists": 31250,
      "file_size_bytes": 128000064
    },
    "LOD2": {
      "num_gists": 976,
      "file_size_bytes": 3997760
    }
  },
  "gistnet_checkpoint": "checkpoints/gistnet_epoch10.pt",
  "ingestion_complete": true,
  "last_modified": "2025-01-15T12:45:00Z"
}

Storage Size Analysis

Example: 1M Token Corpus

Level	Entries	Entry Size	Total Size
LOD0	1M tokens	4 bytes	4 MB
LOD1	31,250 gists	4 KB (2048 × fp16)	128 MB
LOD2	976 gists	4 KB	4 MB
Metadata	1 file	~2 KB	2 KB
Total			136 MB

Tree overhead: LOD1+LOD2 add 132 MB to store 1M tokens (4 MB), ~3.2% overhead

Scaling

Corpus Size	LOD0 Size	Tree Size (fp16)	Tree Size (int8)
10k tokens	40 KB	1.3 MB	650 KB
100k tokens	400 KB	13 MB	6.5 MB
1M tokens	4 MB	136 MB	68 MB
10M tokens	40 MB	1.36 GB	680 MB
100M tokens	400 MB	13.6 GB	6.8 GB

See Performance Sketch for detailed storage analysis.

Compression Strategies

Block Compression (Future)

Apply per-file compression:

[64-byte header]
[compressed payload]

Algorithms:

zstd (level 3): ~2–3× compression, fast decompression
LZ4: ~1.5–2× compression, very fast
brotli (level 6): ~3–4× compression, slower

Trade-offs:

Loses deterministic offsets (requires index)
Adds decompression latency (~5–10 ms for 1 MB)
Best for cold storage, not hot Working Context assembly

Quantization

Reduce gist precision:

Format	Bits/dim	Compression	Quality Loss
fp32	32	1× (baseline)	None
fp16	16	2×	Negligible
bf16	16	2×	Slightly worse than fp16
int8	8	4×	~0.5% ΔNLL increase
int4	4	8×	~2% ΔNLL increase

POC: Uses fp16 for good quality/size balance

Future: Hybrid approach—fp16 for recent gists, int8 for old ones

Memory-Mapped I/O

Benefits

Zero-copy access: No read() syscalls, direct memory access
Lazy loading: OS pages in data only when accessed
Shared memory: Multiple processes can share same mapping

Implementation

import mmap
 
class MMapTreeStorage:
    def __init__(self, base_path: str):
        self.l0_file = open(f"{base_path}/LOD0.ctx", "rb")
        self.l0_mmap = mmap.mmap(
            self.l0_file.fileno(),
            length=0,  # entire file
            access=mmap.ACCESS_READ
        )
        # Similar for LOD1, LOD2
 
    def read_l0_block(self, block_id: int) -> np.ndarray:
        offset = 64 + (block_id * 32 * 4)
        token_bytes = self.l0_mmap[offset:offset+128]  # 32 tokens × 4 bytes
        return np.frombuffer(token_bytes, dtype=np.uint32)

When to use:

Large trees (>1 GB) that don’t fit in RAM
Random access patterns (Working Context assembly)
Not needed for POC (RAM-resident)

Checkpointing & Versioning

Snapshots

Periodically save complete tree state:

checkpoints/
  2025-01-15_10-30-00/
    LOD0.ctx
    LOD1.ctx
    LOD2.ctx
    metadata.json

Use cases:

Rollback after corrupted ingestion
A/B testing different GistNet checkpoints
Reproducible experiments

Incremental Updates

When GistNet is retrained:

Read old gist from LOD1.ctx
Recompute with new GistNet
Write updated gist to same offset
Update gistnet_checkpoint in metadata

Versioning: Track which GistNet checkpoint generated each gist

Future Enhancements

Distributed Storage

Shard tree across multiple nodes:

node_0: LOD0.ctx [blocks 0-10000]
node_1: LOD0.ctx [blocks 10001-20000]
...

Benefits: Parallel assembly, higher throughput

Tiered Storage

Hot tier (SSD):   Recent 10% of tree
Warm tier (HDD):  Last 50% of tree
Cold tier (S3):   Oldest 40% of tree

See MegaCuration for pruning strategies.

Encryption

Add per-file encryption for sensitive data:

[64-byte header (plaintext)]
[encrypted payload with AES-256-GCM]

Summary

The MegaContext storage format provides:

Deterministic offsets - O(1) random access without indexes
Self-describing - Headers enable format evolution
Compact - ~3.2% overhead over raw tokens
Efficient - Memory-mapped I/O for large trees
Portable - Binary format works across platforms

This design enables the MegaContext Tree to scale from kilobytes to terabytes while maintaining constant-time access patterns.

See MegaContext Tree for tree structure and POC Architecture for implementation details.

Mega Context

Explorer

Storage Format

MegaContext Storage Format

Overview

Binary File Format

File Header (64 bytes)

LOD0.ctx: Token Storage

LOD1.ctx / LOD2.ctx: Gist Storage

Deterministic Offsets

Benefits

Implementation

Metadata File

Storage Size Analysis

Example: 1M Token Corpus

Scaling

Compression Strategies

Block Compression (Future)

Quantization

Memory-Mapped I/O

Benefits

Implementation

Checkpointing & Versioning

Snapshots

Incremental Updates

Future Enhancements

Distributed Storage

Tiered Storage

Encryption

Summary

Graph View

Table of Contents

Backlinks