Architecture Details: The Two-Context System

MegaContext virtualizes context by pairing a disk-backed gist tree called the MegaContext Tree with a budgeted working context governed by GistNet, LensNet, and the Focus Allocator. This two-context architecture (1) separates concerns between long-term storage and active processing.

Implementation note: The current notebook prototype lives under src/megacontext/ and implements only the minimal pieces (gistnet, basic runtime wrappers). The full production stack will move into the nanochat fork per the MegaContext PRD Index; reference those PRDs when you need the plan-of-record contracts.

It separates a model’s context into a MegaContext Tree (stored on disk) and a Working Context (on GPU). A learned GistNet model is used to build the MegaContext Tree as a hierarchy of gists [5, 7]. The Working Context compresses the MegaContext Tree into a fixed-size mix of tokens and gists that are used for inference.

To dynamically adapt level of detail, a learned LensNet model [2, 3], continuously/incrementally refocuses the MegaContext Tree onto the Working Context, giving the model effectively infinite memory at constant compute with automatic context management.



Table of Contents

  1. Why Two Contexts?
  2. The Two-Context Architecture Explained
  3. Detailed Context Comparison
  4. How the Contexts Interact
  5. Data Flow Between Contexts
  6. Why This Architecture Enables System Properties
  7. Core Components
  8. Runtime Lifecycle
  9. Key Terms & Invariants
  10. Document Roadmap

Why Two Contexts?

The Fundamental Problem

Large language models face an inherent trade-off between memory capacity and computational efficiency:

1. Fixed Context Windows: Traditional LLMs have a fixed context window (e.g., 4k, 8k, 32k tokens) [12]. Once you exceed this limit, you must either: - Truncate old information (losing history) - Use sliding windows (losing distant context) [12] - Compress everything equally (losing important details)

  1. Uniform Attention Cost: Standard transformer attention has O(n²) complexity, where n is the context length. Every token attends to every other token with equal computational cost, regardless of relevance [17].

  2. Static Representation: Once text is processed, its representation is fixed. You cannot dynamically adjust the level of detail based on changing relevance as the conversation evolves [9, 11].

The Two-Context Solution

MegaContext solves these problems by separating concerns into two complementary contexts:

MegaContext Tree: The “Hard Drive” of Memory [1]

  • Purpose: Store the complete history indefinitely
  • Storage: Disk-backed (RAM for POC), hierarchical structure [1]
  • Capacity: Effectively unlimited (millions to billions of tokens)
  • Access Pattern: Random access, multi-resolution
  • Cost Model: Storage cost only, no computation per token

Working Context: The “RAM” of Active Memory

  • Purpose: Provide the relevant subset for immediate inference [9]
  • Storage: GPU memory, flat sequence [14]
  • Capacity: Fixed budget (8k-32k tokens)
  • Access Pattern: Sequential processing (left-to-right)
  • Cost Model: Full attention cost during inference [17]

Why This Separation Is Necessary

1. Scalability: You cannot fit millions of tokens in GPU memory or process them with O(n²) attention [17] in real-time.

2. Efficiency: Most historical context is not relevant for the current task. Processing everything equally is wasteful [9].

3. Adaptability: Relevance changes over time. Something unimportant earlier may become critical later. The system needs to dynamically refocus.

4. Practicality: Consumer-grade applications at 100M+ context lengths require sub-linear memory and compute scaling.

The Key Insight

The two-context architecture recognizes that there are fundamentally different requirements for:

  • Long-term storage (complete, persistent, multi-resolution)
  • Active processing (focused, fixed-size, high-detail where needed)

By separating these concerns, MegaContext can optimize each independently while maintaining a coherent view of the entire interaction history.


The Two-Context Architecture Explained

How They Work Together

  1. GistNet compresses incoming tokens into the MegaContext Tree hierarchy
  2. LensNet + Focus Allocator selects which parts of the tree to load into Working Context
  3. Base LLM operates only on the Working Context (remains frozen, unmodified)
  4. As new tokens are generated, they flow back into the MegaContext Tree
  5. The cycle repeats, continuously refocusing the Working Context

Detailed Context Comparison

MegaContext Tree vs. Working Context

AspectMegaContext TreeWorking Context
PurposeLong-term storage of complete historyActive processing window for inference
Storage LocationDisk (RAM in POC)GPU memory
CapacityEffectively unlimited (millions-billions of tokens)Fixed budget: 8k-32k tokens
StructureHierarchical tree (LOD0→LOD1→LOD2→LOD3…)Flat, contiguous sequence
ContentAll tokens + all gists at all levelsMixed: selected tokens and gists
GranularityMulti-resolution (32:1 compression per level)Variable per entry (LOD0, LOD1, LOD2, etc.)
Access PatternRandom access to any nodeSequential processing (left-to-right)
MutabilityAppend-only (grows monotonically)Dynamic (refocused continuously)
Temporal CoverageComplete: every moment since conversation startSelective: contiguous but variable detail
Computational CostNo inference cost (storage only)Full attention cost during decode
Update FrequencyBlock-aligned (every 32 tokens)Every decode step (via refocus)
PersistencePermanent (survives across sessions)Ephemeral (rebuilt each step)
Visibility to Base LLMInvisible (never seen directly)Fully visible (only thing LLM sees)
Data FormatTree nodes with parent/child pointersEmbedding sequence (4096-dim vectors)
IndexingTree coordinates (level, position)Linear array (0 to W_max)
Compression MethodHierarchical gisting via GistNetNo compression (but entries may be gists)
Detail ControlImplicit (by level)Explicit (selected by LensNet/FA)
Memory Overhead~1.5-2x of raw tokens (tree structure)Exactly W_max embeddings
LatencyDisk I/O (negligible for RAM)Zero (already in GPU)
ParallelismCan build gists in parallelSequential attention
Failure ModeDisk full (rare at GB scales)Budget exceeded (handled by FA)
Optimization TargetMinimize ΔNLL (compression loss)Maximize task performance

Detailed Breakdown

MegaContext Tree Structure

The MegaContext Tree is a hierarchical compression of the complete history:

Level 3:  ●─────────●─────────●        (each covers 32,768 tokens)
          │         │         │
Level 2:  ●──●──●──●──●──●──●──●      (each covers 1,024 tokens)
          │  │  │  │  │  │  │  │
Level 1:  ●●●●●●●●●●●●●●●●●●●●●●●●    (each covers 32 tokens)
          │││││││││││││││││││││││
Level 0:  [32][32][32][32][32][32]... (raw token blocks)
  • LOD0: Raw token blocks (32 tokens each)
  • LOD1: Each gist summarizes 32 LOD0 blocks (1,024 tokens → 1 gist)
  • LOD2: Each gist summarizes 32 LOD1 gists (32,768 tokens → 1 gist)
  • LOD3: Each gist summarizes 32 LOD2 gists (1,048,576 tokens → 1 gist)

Key Properties:

  • Each node has at most 32 children
  • Compression ratio: 32:1 per level
  • Tree depth grows logarithmically: depth = ⌈log₃₂(n)⌉
  • Total storage: ~1.5-2× raw tokens (due to redundancy)

Working Context Structure

The Working Context is a contiguous sequence mixing different levels of detail:

Position: [0  ][1  ][2  ][3  ][4  ][5  ][6  ][7  ][8  ]
Content:  [LOD0  ][LOD0  ][LOD1  ][LOD0  ][LOD2  ][LOD1  ][LOD0  ][LOD0  ][LOD0  ]
Cost:     [32  ][32  ][1   ][32  ][1   ][1   ][32  ][32  ][32  ]
Timeline: [0-31][32  ][64  ][96  ][128 ][160 ][192 ][224 ][256 ]
          |----Recent Context----|  |-Mid-| |----Distant Context---|
          (high detail)              (mid)  (low detail)

Key Properties:

  • Each entry covers exactly one time interval (no gaps, no overlaps)
  • Entries can be at different levels (LOD0, LOD1, LOD2, etc.)
  • Total token cost ≤ W_max (enforced by Focus Allocator)
  • Temporally contiguous (left-to-right = past-to-present)
  • Recent content typically at higher detail (LOD0)
  • Distant content typically at lower detail (LOD2, LOD3)

How the Contexts Interact

Three Types of Operations

1. Write: Tokens → MegaContext Tree (via GistNet)

New tokens (from user input or model generation) are written to the MegaContext Tree:

Incoming tokens → LOD0 buffer (32 tokens) → GistNet → LOD1 gist
                                              ↓
                  LOD1 buffer (32 gists) → GistNet → LOD2 gist
                                              ↓
                  LOD2 buffer (32 gists) → GistNet → LOD3 gist

Process:

  1. Buffer incoming tokens until 32 are collected
  2. GistNet compresses the 32-token block into a single LOD1 gist
  3. Store both the LOD0 block and LOD1 gist in the tree
  4. When 32 LOD1 gists accumulate, compress to LOD2
  5. Repeat hierarchically up the tree

Triggering:

  • Happens automatically as tokens arrive
  • Block-aligned (every 32 tokens)
  • Independent of Working Context state

2. Read: MegaContext Tree → Working Context (via LensNet + Focus Allocator)

The Working Context is assembled by selecting entries from the MegaContext Tree:

MegaContext Tree (select nodes) → Working Context Assembly → [LOD0][LOD1][LOD0][LOD2]...
                                         ↑
                              LensNet + Focus Allocator
                              (decides what to include)

Process:

  1. LensNet scores each current Working Context entry for “focus value”
    • Positive score = expand to higher detail
    • Negative score = collapse to lower detail
  2. Focus Allocator applies scores while maintaining invariants:
    • Contiguity: no temporal gaps
    • Budget: total cost ≤ W_max
    • Block-alignment: changes respect 32-token boundaries
  3. Requested entries are fetched from MegaContext Tree
  4. Working Context is rebuilt with new mix of detail levels

Triggering:

  • Every decode step (or every N steps)
  • Before each LLM inference pass
  • Adaptive based on LensNet scores

3. Update: Refocusing the Working Context

The Working Context is continuously updated to reflect changing relevance:

Old Working Context → LensNet → Focus Scores → Focus Allocator → New Working Context
         ↑                                                              ↓
         └──────────────────────── (feed to LLM) ────────────────────┘

Example Refocus Cycle:

Step T:   [LOD0][LOD0][LOD1][LOD1][LOD2][LOD0][LOD0]  (current WC)
          ↓
LensNet:  [+1][+2][-1][-2][+3][0 ][0 ]  (focus scores)
          ↓
FA:       expand expand collapse collapse expand keep keep
          ↓
Step T+1: [LOD0][LOD0][LOD0][LOD2][LOD3][LOD0][LOD0]  (updated WC)
                  ^^^ ^^^ ^^^
                 (detail changed)

Why This Matters:

  • Relevance changes as conversation evolves
  • Something mentioned briefly 10k tokens ago might suddenly become crucial
  • System can “zoom in” on newly relevant regions
  • Or “zoom out” on distractors to save budget for important content

Data Flow Between Contexts

Complete Data Flow Diagram

┌────────────────────────────────────────────────────────────────────┐
│                         INFERENCE CYCLE                             │
└────────────────────────────────────────────────────────────────────┘

  User Input / Generated Tokens
         │
         ▼
  ┌─────────────┐
  │ Token Buffer│  (accumulate 32 tokens)
  └──────┬──────┘
         │ (every 32 tokens)
         ▼
  ┌─────────────┐
  │   GistNet   │  (compress 32→1)
  └──────┬──────┘
         │
         ▼
  ┌──────────────────────┐
  │  MegaContext Tree    │  (append LOD0 block + LOD1 gist)
  │  ┌───┐ ┌───┐ ┌───┐  │
  │  │LOD3 │─│LOD2 │─│LOD1 │  │
  │  └───┘ └─┬─┘ └─┬─┘  │
  │          │     │     │
  │        ┌─┴─────┴─┐  │
  │        │ LOD0 Blocks│  │
  │        └─────────┘  │
  └──────────┬───────────┘
             │ (read selective entries)
             ▼
  ┌─────────────────────┐
  │  Working Context    │
  │  [LOD0][LOD1][LOD0][LOD2]   │  ◄─────┐
  └──────────┬───────────┘        │
             │                    │
             ▼                    │ (refocus)
  ┌─────────────────────┐        │
  │      LensNet        │        │
  │  (score relevance)  │        │
  └──────────┬───────────┘        │
             │                    │
             ▼                    │
  ┌─────────────────────┐        │
  │  Focus Allocator    │────────┘
  │ (expand/collapse)   │
  └──────────┬───────────┘
             │
             ▼
  ┌─────────────────────┐
  │   Frozen Base LLM   │  (inference)
  │   (e.g., Llama)     │
  └──────────┬───────────┘
             │
             ▼
    Next Token(s) ─────┘ (loop back to buffer)

Step-by-Step Data Flow

Phase 1: Token Ingestion

1. User types: "What did we discuss about machine learning?"
   └─> Buffer: ["What", "did", "we", "discuss", "about", ...]

2. Buffer fills to 32 tokens
   └─> GistNet input: 32 token embeddings [e₁, e₂, ..., e₃₂]

3. GistNet compresses
   └─> LOD1 gist: single embedding [g₁]

4. Write to MegaContext Tree
   ├─> LOD0 node: [e₁, e₂, ..., e₃₂] (32 embeddings)
   └─> LOD1 node: [g₁] (1 embedding, parent of LOD0)

5. Update tree metadata
   ├─> ΔNLL: compression loss metric
   ├─> Timestamps: token positions
   └─> Parent/child pointers

Phase 2: Working Context Assembly

1. LensNet reads current Working Context
   └─> Input: [WC₁, WC₂, ..., WCₙ] + [tail gists from MC Tree]

2. LensNet computes focus scores
   └─> Output: [score₁, score₂, ..., scoreₙ]
        └─> score > 0: expand (more detail)
        └─> score < 0: collapse (less detail)

3. Focus Allocator processes scores
   ├─> For each positive score:
   │   ├─> Fetch children from MegaContext Tree
   │   ├─> Replace LOD1 gist with 32 LOD0 blocks
   │   └─> Check budget: cost ≤ W_max?
   │
   └─> For each negative score:
       ├─> Find parent in MegaContext Tree
       ├─> Replace 32 LOD0 blocks with 1 LOD1 gist
       └─> Frees budget for other expansions

4. New Working Context assembled
   └─> [mix of LOD0, LOD1, LOD2, LOD3 entries]
        ├─> Contiguous in time (no gaps)
        └─> Within budget (total cost ≤ W_max)

Phase 3: Inference

1. Working Context fed to base LLM
   └─> Input: sequence of embeddings
        ├─> LOD0 entries: raw token embeddings
        └─> LOD1/LOD2/LOD3 entries: gist embeddings
        (LLM cannot distinguish - same embedding dimension)

2. LLM runs attention
   └─> Full O(n²) attention over Working Context only
        └─> n = (W_max / avg_entry_cost) ≈ 256-1024 entries

3. LLM generates next token
   └─> Output: new token embedding [e_new]

4. Token loops back to Phase 1
   └─> Added to buffer, eventually compressed to tree

Data Flow Properties

1. Unidirectional Write Path:

  • Tokens → MegaContext Tree (via GistNet)
  • Tree is append-only, never modified

2. Bidirectional Read Path:

  • MegaContext Tree → Working Context (fetch entries)
  • Working Context → LensNet (compute scores)
  • Scores → Focus Allocator → Updated Working Context

3. Isolation:

  • Base LLM never sees MegaContext Tree directly
  • Base LLM only operates on Working Context
  • GistNet never sees Working Context
  • LensNet never modifies MegaContext Tree

4. Cycle Time:

  • GistNet: O(32) tokens to trigger
  • LensNet: O(1) decode step (or every N steps)
  • Base LLM: O(1) token generation

Why This Architecture Enables System Properties

1. Unbounded Context Length

How: MegaContext Tree stores complete history on disk with logarithmic depth.

Math:

  • Tree depth = ⌈log₃₂(n)⌉
  • For 1M tokens: depth = 4 levels
  • For 1B tokens: depth = 6 levels
  • Storage: ~1.5n embeddings (linear)

Why Two Contexts Are Essential:

  • Cannot store 1B tokens in GPU (would require ~4TB)
  • Cannot process 1B tokens with O(n²) attention (would take hours per token)
  • Disk storage is cheap and scales linearly
  • Working Context stays fixed size regardless of total history

2. Constant Compute Cost

How: Working Context has fixed budget W_max; base LLM complexity is O(W_max²).

Math:

  • Attention cost: O(W_max²) = O(1) for fixed W_max
  • Example: W_max = 32k tokens → ~1B FLOPs per decode
  • Independent of total history length (could be 1M or 1B tokens)

Why Two Contexts Are Essential:

  • Base LLM only sees Working Context (W_max tokens)
  • MegaContext Tree is outside the inference path
  • No matter how much history accumulates, inference cost stays constant

3. Dynamic Focus/Defocus

How: LensNet scores relevance; Focus Allocator swaps detail levels.

Example:

T=0:  "My cat's name is Fluffy. [9500 tokens about other topics]"
      Working Context: [LOD3 gist] (low detail)

T=9500: "What was my cat's name?"
      LensNet detects query, scores LOD3 gist highly
      Focus Allocator: LOD3 → LOD2 → LOD1 → LOD0
      Working Context: [LOD0 tokens: "My cat's name is Fluffy"]

Why Two Contexts Are Essential:

  • MegaContext Tree preserves all detail at all levels (lossless traversal)
  • Working Context can swap between levels without re-encoding
  • One-way compression (e.g., RAG summaries) cannot “zoom back in”
  • Static context windows cannot adjust detail post-hoc

4. Lossy-Yet-Restorable Compression

How: Gists compress 32→1 (lossy) but original tokens remain in tree (restorable).

Compression Cascade:

32 tokens → 1 LOD1 gist (97% compression, small ΔNLL)
32 LOD1 gists → 1 LOD2 gist (97% compression, medium ΔNLL)
32 LOD2 gists → 1 LOD3 gist (97% compression, higher ΔNLL)

Restoration:

Need more detail? Traverse tree:
LOD3 gist → fetch 32 LOD2 children → fetch 32×32 LOD1 children → fetch 32×32×32 LOD0 tokens

Why Two Contexts Are Essential:

  • MegaContext Tree stores both compressed (gists) and original (tokens)
  • Working Context can dynamically choose which representation to use
  • Trade-off: budget (use gist) vs. fidelity (use tokens)
  • Not possible with single context (must choose one representation)

5. Sub-Linear Memory Scaling

How: MegaContext Tree in cheap disk/RAM; Working Context in expensive GPU RAM.

Memory Breakdown:

MegaContext Tree:  ~1.5n embeddings × 16KB each = 24n bytes (disk/RAM)
Working Context:   W_max embeddings × 16KB each = constant (GPU)
LensNet + GistNet: Small models (~10-100M params = 40-400MB GPU)

Example (1B tokens):
- MC Tree: 1.5B × 16KB = 24GB (RAM) ✓ affordable
- WC: 32k × 16KB = 512MB (GPU) ✓ affordable
- Total GPU: ~1GB (leaves 23GB for base LLM)

Why Two Contexts Are Essential:

  • GPU memory is 10-100× more expensive than RAM
  • Cannot afford to keep all history in GPU
  • Disk/RAM storage scales to TBs for pennies
  • Working Context uses GPU efficiently (only what’s needed)

6. No Retraining of Base Model

How: Base LLM remains frozen; operates on same embedding space.

Architecture:

Base LLM (frozen)
     ↑
     │ (same embeddings)
     │
Working Context ← mix of tokens + gists
                      ↑
                    GistNet (learned)
                      ↑
                  Raw Tokens

Why Two Contexts Are Essential:

  • GistNet learns to produce embeddings that “look like” base model tokens
  • Base LLM cannot tell the difference between LOD0 tokens and LOD1/LOD2/LOD3 gists
  • Working Context is the “adapter layer” - provides abstraction
  • MegaContext Tree is GistNet’s domain - invisible to base model
  • Separation allows independent optimization of each component

7. Multi-Resolution Access

How: Tree structure provides access at any granularity (LOD0, LOD1, LOD2, LOD3).

Access Patterns:

Coarse scan:  Read LOD3 gists (1 per 1M tokens) → fast overview
Medium scan:  Read LOD2 gists (1 per 32k tokens) → section-level
Fine scan:    Read LOD1 gists (1 per 1k tokens) → paragraph-level
Full detail:  Read LOD0 tokens (all 32 tokens) → word-level

Example Use Case:

Query: "Find all discussions about Python optimization"

1. Scan all LOD3 gists (1000 in 1B-token history) → 1000 gists
2. Identify 10 relevant LOD3 regions
3. Scan their LOD2 children (10 × 32 = 320 gists)
4. Identify 5 most relevant LOD2 regions
5. Expand to LOD0 for detailed reading (5 × 1024 tokens = 5120 tokens)

Total cost: 1000 + 320 + 5120 = 6440 tokens (vs. 1B tokens for full scan)

Why Two Contexts Are Essential:

  • MegaContext Tree provides multi-resolution storage
  • Working Context provides multi-resolution representation
  • Can query coarsely, then zoom in selectively
  • Not possible with flat context or RAG (fixed retrieval granularity)

Core Components

1. MegaContext Tree

Purpose: Persistent, hierarchical storage of complete conversation history.

Key Responsibilities:

  • Store all tokens (LOD0) and all gists (LOD1, LOD2, LOD3, …)
  • Maintain parent-child relationships
  • Support random access at any level
  • Track metadata (ΔNLL, timestamps, etc.)
  • Persist across sessions

See: MegaContext Tree, Storage Format, Tree Operations

2. Working Context

Purpose: Active, budget-constrained window for LLM inference.

Key Responsibilities:

  • Maintain contiguous temporal coverage
  • Mix tokens and gists optimally
  • Stay within token budget (W_max)
  • Provide embedding sequence to base LLM
  • Update continuously via refocusing

See: Working Context, Working Context Assembly, Working Context Refocusing

3. GistNet

Purpose: Learned compression model that builds the tree hierarchy.

Key Responsibilities:

  • Compress 32 tokens → 1 gist (LOD0 → LOD1)
  • Compress 32 gists → 1 gist (LOD1 → LOD2, LOD2 → LOD3, …)
  • Minimize ΔNLL (compression loss)
  • Align gists with base model embedding space
  • Train via self-supervised learning

Architecture:

  • Input: 32 embeddings (4096-dim each)
  • Output: 1 embedding (4096-dim)
  • Model: Transformer encoder (6-12 layers, 512-2048 hidden dim) [2]
  • Training: Minimize perplexity of next-token prediction [21]

See: GistNet, GistNet Architecture Details, GistNet Training

4. LensNet

Purpose: Learned scoring model that determines what to focus on.

Key Responsibilities:

  • Score each Working Context entry for relevance
  • Predict which entries should be expanded/collapsed
  • Adapt to task dynamics (queries, continuations, etc.)
  • Balance exploration vs. exploitation
  • Train via reinforcement learning or task supervision

Architecture:

  • Input: Working Context + tail gists (context representation)
  • Output: Focus scores (one per entry)
  • Model: Transformer encoder (4-8 layers, 256-1024 hidden dim)
  • Training: Maximize task reward (e.g., downstream NLL)

See: LensNet, LensNet Scoring, LensNet Training

5. Focus Allocator

Purpose: Deterministic algorithm that applies LensNet scores to refocus Working Context.

Key Responsibilities:

  • Enforce contiguity (no temporal gaps)
  • Enforce budget (total cost ≤ W_max)
  • Expand high-scoring entries (fetch children)
  • Collapse low-scoring entries (replace with parent)
  • Handle edge cases (boundary conditions, buffer limits)

Algorithm:

def focus_allocator(working_context, scores, budget):
    # Sort scores descending
    expansions = [(i, score) for i, score in enumerate(scores) if score > 0]
    expansions.sort(key=lambda x: x[1], reverse=True)
 
    # Greedily expand until budget exhausted
    for i, score in expansions:
        if can_expand(working_context[i], budget):
            working_context[i] = expand(working_context[i])
            budget -= expansion_cost(working_context[i])
 
    # Collapse low-scoring entries to free budget
    collapses = [(i, score) for i, score in enumerate(scores) if score < 0]
    for i, score in collapses:
        if should_collapse(working_context[i]):
            working_context[i] = collapse(working_context[i])
            budget += collapse_savings(working_context[i])
 
    return working_context

See: Focus Allocator, Focus Allocator Strategies

6. Base LLM

Purpose: Frozen language model that performs inference.

Key Characteristics:

  • Unchanged: No modifications to architecture or weights
  • Embeddings: Operates on same embedding space as training
  • Input: Working Context (mix of tokens and gists)
  • Output: Next token probabilities
  • Oblivious: Cannot distinguish tokens from gists

Examples: Llama, GPT, Claude (frozen, no finetuning)


Runtime Lifecycle

System Initialization

1. Load base LLM (frozen weights)
2. Load GistNet (pre-trained weights)
3. Load LensNet (pre-trained weights)
4. Initialize MegaContext Tree (empty or from checkpoint)
5. Initialize Working Context (empty)
6. Ready for first token

Token Processing Loop

LOOP (for each new token):

    1. TOKEN ARRIVAL
       ├─> User input or model generation
       └─> Add to buffer

    2. TREE UPDATE (every 32 tokens)
       ├─> GistNet: compress 32 tokens → 1 LOD1 gist
       ├─> Write LOD0 block + LOD1 gist to tree
       └─> Recursively compress LOD1→LOD2, LOD2→LOD3, etc.

    3. REFOCUS (every decode step or every N steps)
       ├─> LensNet: score Working Context entries
       ├─> Focus Allocator: apply scores
       │   ├─> Expand high-score entries (fetch children)
       │   └─> Collapse low-score entries (replace with parent)
       └─> Rebuild Working Context with new entries

    4. INFERENCE
       ├─> Feed Working Context to base LLM
       ├─> LLM generates next token
       └─> Loop back to step 1

END LOOP

Example Execution Trace

T=0: User: "Tell me about Paris"
     └─> Buffer: ["Tell", "me", "about", "Paris"]
     └─> Working Context: [LOD0: "Tell", "me", "about", "Paris"]
     └─> LLM: "Paris is the capital..."

T=32: Buffer full → GistNet compresses
     └─> MC Tree: [LOD0: 32 tokens], [LOD1: gist_1]
     └─> Working Context: [LOD0: recent 32 tokens]

T=1000: User: "What about London?"
     └─> LensNet scores Paris discussion (low relevance)
     └─> Focus Allocator: collapse LOD0 → LOD1
     └─> Working Context: [LOD1: gist_Paris], [LOD0: recent tokens]
     └─> More budget available for new London discussion

T=1050: User: "Compare Paris and London"
     └─> LensNet scores Paris gist (high relevance)
     └─> Focus Allocator: expand LOD1 → LOD0
     └─> Working Context: [LOD0: Paris details], [LOD0: London details]
     └─> LLM can compare with full context

Key Terms & Invariants

Key Terms

  • LOD0: Raw token blocks (32 tokens each)
  • LOD1/LOD2/LOD3: Gist levels (each compresses 32 children)
  • Gist: Single embedding that summarizes 32 child embeddings
  • Entry: One item in Working Context (can be LOD0, LOD1, LOD2, or LOD3)
  • Cost: Number of base tokens represented (LOD0=32, LOD1=1, LOD2=1, LOD3=1)
  • Budget (W_max): Maximum token cost for Working Context
  • ΔNLL: Compression loss (increase in perplexity due to gisting)
  • Focus: Expand entry to higher detail (replace gist with children)
  • Defocus: Collapse entry to lower detail (replace children with gist)
  • Contiguity: Working Context covers time without gaps
  • Refocus: Update Working Context by applying focus/defocus operations

Core Invariants

MegaContext Tree Invariants:

  1. Append-only: Nodes are never deleted or modified
  2. Complete: All LOD0 blocks are stored (no truncation)
  3. Hierarchical: Each non-leaf node has ≤32 children
  4. Aligned: LOD0 blocks start at multiples of 32
  5. Redundant: Both compressed (gists) and original (tokens) are stored

Working Context Invariants:

  1. Contiguous: Covers [start_pos, end_pos] without gaps
  2. Budgeted: ∑(entry_cost) ≤ W_max
  3. Mixed: Entries can be at any level (LOD0, LOD1, LOD2, LOD3)
  4. Temporal: Left-to-right = past-to-present
  5. Aligned: Each entry covers exactly one tree node’s time span

System Invariants:

  1. Isolation: Base LLM never accesses MegaContext Tree directly
  2. Constant Compute: Inference cost = O(W_max²), independent of history
  3. Lossless Paths: Can always traverse tree to restore original tokens
  4. Embedding Consistency: Gists are in same embedding space as tokens
  5. No Retraining: Base LLM weights are frozen, never updated

See: Invariants for complete details


Document Roadmap

This document is the canonical reference for understanding MegaContext’s two-context architecture. For deeper dives into specific aspects:

Architecture Deep Dives

Component Details

Operations

Training & Optimization

Comparisons & Context


Summary

The two-context architecture is the foundation of MegaContext’s ability to provide unbounded memory at constant compute:

  1. MegaContext Tree stores the complete history hierarchically on disk
  2. Working Context provides a fixed-size, dynamically-refocused view for inference
  3. GistNet builds the tree by compressing tokens into gists
  4. LensNet + Focus Allocator adapts the Working Context to changing relevance
  5. Base LLM operates unchanged on the Working Context

This separation enables:

  • ✓ Unbounded context length (millions to billions of tokens)
  • ✓ Constant compute cost (O(W_max²) regardless of history)
  • ✓ Dynamic focus/defocus (zoom in on relevant regions)
  • ✓ Lossy-yet-restorable compression (gists + original tokens)
  • ✓ Sub-linear memory scaling (disk for tree, GPU for working set)
  • ✓ No retraining of base model (frozen weights)
  • ✓ Multi-resolution access (coarse scan → fine detail)

The key insight: By separating long-term storage from active processing, MegaContext can optimize each independently while maintaining a coherent view of the entire interaction history. This is not possible with a single-context architecture.


References

  1. MegaTexture (Carmack, 2007) — Analysis — Virtual texturing system that inspired the core hierarchical streaming architecture
  2. Perceiver (Jaegle et al., 2021) — Analysis — Latent cross-attention bottleneck architecture
  3. Perceiver IO (Jaegle et al., 2021) — Analysis — Query-based decoding for arbitrary structured outputs
  4. Gist Tokens (Mu et al., 2023) — Analysis — Learned prompt compression via attention masking
  5. Compressive Transformer (Rae et al., 2019) — Analysis — Long-term compressed memory for transformers
  6. RAG (Lewis et al., 2020) — Analysis — Retrieval-augmented generation baseline
  7. Memorizing Transformers (Wu et al., 2022) — Analysis — kNN-augmented approximate retrieval
  8. Transformer-XL (Dai et al., 2019) — Analysis — Segment-level recurrence and relative positional encoding
  9. RoPE (Su et al., 2021) — Analysis — Rotary position embeddings used throughout MegaContext
  10. Flash Attention (Dao et al., 2022) — Analysis — IO-aware exact attention algorithm
  11. Knowledge Distillation (Hinton et al., 2015) — Analysis — Teacher-student framework for GistNet training

See Related Work for the complete bibliography of all research papers referenced throughout the documentation.


This document is the definitive guide to MegaContext’s two-context architecture. All other documentation should reference this page for architectural fundamentals.