Architecture Details: The Two-Context System

MegaContext virtualizes context by pairing a disk-backed gist tree called the MegaContext Tree with a budgeted working context governed by GistNet, LensNet, and the Focus Allocator. This two-context architecture (1) separates concerns between long-term storage and active processing.

Implementation note: The current notebook prototype lives under src/megacontext/ and implements only the minimal pieces (gistnet, basic runtime wrappers). The full production stack will move into the nanochat fork per the MegaContext PRD Index; reference those PRDs when you need the plan-of-record contracts.

It separates a model’s context into a MegaContext Tree (stored on disk) and a Working Context (on GPU). A learned GistNet model is used to build the MegaContext Tree as a hierarchy of gists [5, 7]. The Working Context compresses the MegaContext Tree into a fixed-size mix of tokens and gists that are used for inference.

To dynamically adapt level of detail, a learned LensNet model [2, 3], continuously/incrementally refocuses the MegaContext Tree onto the Working Context, giving the model effectively infinite memory at constant compute with automatic context management.

Dual contexts: MegaContext Tree tree vs. Working Context.
Compression: GistNet builds hierarchical gists aligned with base embeddings.
Focus/Defocus: LensNet scores working entries; Focus Allocator adjusts detail. Advanced focus layouts (multi-head, staging) are outlined in Multi-headed Focus.
See also: Runtime Loop for execution, POC Architecture for interfaces.

Why Two Contexts?
The Two-Context Architecture Explained
Detailed Context Comparison
How the Contexts Interact
Data Flow Between Contexts
Why This Architecture Enables System Properties
Core Components
Runtime Lifecycle
Key Terms & Invariants
Document Roadmap

Why Two Contexts?

The Fundamental Problem

Large language models face an inherent trade-off between memory capacity and computational efficiency:

1. Fixed Context Windows: Traditional LLMs have a fixed context window (e.g., 4k, 8k, 32k tokens) [12]. Once you exceed this limit, you must either: - Truncate old information (losing history) - Use sliding windows (losing distant context) [12] - Compress everything equally (losing important details)

Uniform Attention Cost: Standard transformer attention has O(n²) complexity, where n is the context length. Every token attends to every other token with equal computational cost, regardless of relevance [17].
Static Representation: Once text is processed, its representation is fixed. You cannot dynamically adjust the level of detail based on changing relevance as the conversation evolves [9, 11].

The Two-Context Solution

MegaContext solves these problems by separating concerns into two complementary contexts:

MegaContext Tree: The “Hard Drive” of Memory [1]

Purpose: Store the complete history indefinitely
Storage: Disk-backed (RAM for POC), hierarchical structure [1]
Capacity: Effectively unlimited (millions to billions of tokens)
Access Pattern: Random access, multi-resolution
Cost Model: Storage cost only, no computation per token

Working Context: The “RAM” of Active Memory

Purpose: Provide the relevant subset for immediate inference [9]
Storage: GPU memory, flat sequence [14]
Capacity: Fixed budget (8k-32k tokens)
Access Pattern: Sequential processing (left-to-right)
Cost Model: Full attention cost during inference [17]

Why This Separation Is Necessary

1. Scalability: You cannot fit millions of tokens in GPU memory or process them with O(n²) attention [17] in real-time.

2. Efficiency: Most historical context is not relevant for the current task. Processing everything equally is wasteful [9].

3. Adaptability: Relevance changes over time. Something unimportant earlier may become critical later. The system needs to dynamically refocus.

4. Practicality: Consumer-grade applications at 100M+ context lengths require sub-linear memory and compute scaling.

The Key Insight

The two-context architecture recognizes that there are fundamentally different requirements for:

Long-term storage (complete, persistent, multi-resolution)
Active processing (focused, fixed-size, high-detail where needed)

By separating these concerns, MegaContext can optimize each independently while maintaining a coherent view of the entire interaction history.

The Two-Context Architecture Explained

How They Work Together

GistNet compresses incoming tokens into the MegaContext Tree hierarchy
LensNet + Focus Allocator selects which parts of the tree to load into Working Context
Base LLM operates only on the Working Context (remains frozen, unmodified)
As new tokens are generated, they flow back into the MegaContext Tree
The cycle repeats, continuously refocusing the Working Context

Detailed Context Comparison

MegaContext Tree vs. Working Context

Aspect	MegaContext Tree	Working Context
Purpose	Long-term storage of complete history	Active processing window for inference
Storage Location	Disk (RAM in POC)	GPU memory
Capacity	Effectively unlimited (millions-billions of tokens)	Fixed budget: 8k-32k tokens
Structure	Hierarchical tree (LOD0→LOD1→LOD2→LOD3…)	Flat, contiguous sequence
Content	All tokens + all gists at all levels	Mixed: selected tokens and gists
Granularity	Multi-resolution (32:1 compression per level)	Variable per entry (LOD0, LOD1, LOD2, etc.)
Access Pattern	Random access to any node	Sequential processing (left-to-right)
Mutability	Append-only (grows monotonically)	Dynamic (refocused continuously)
Temporal Coverage	Complete: every moment since conversation start	Selective: contiguous but variable detail
Computational Cost	No inference cost (storage only)	Full attention cost during decode
Update Frequency	Block-aligned (every 32 tokens)	Every decode step (via refocus)
Persistence	Permanent (survives across sessions)	Ephemeral (rebuilt each step)
Visibility to Base LLM	Invisible (never seen directly)	Fully visible (only thing LLM sees)
Data Format	Tree nodes with parent/child pointers	Embedding sequence (4096-dim vectors)
Indexing	Tree coordinates (level, position)	Linear array (0 to W_max)
Compression Method	Hierarchical gisting via GistNet	No compression (but entries may be gists)
Detail Control	Implicit (by level)	Explicit (selected by LensNet/FA)
Memory Overhead	~1.5-2x of raw tokens (tree structure)	Exactly W_max embeddings
Latency	Disk I/O (negligible for RAM)	Zero (already in GPU)
Parallelism	Can build gists in parallel	Sequential attention
Failure Mode	Disk full (rare at GB scales)	Budget exceeded (handled by FA)
Optimization Target	Minimize ΔNLL (compression loss)	Maximize task performance

Detailed Breakdown

MegaContext Tree Structure

The MegaContext Tree is a hierarchical compression of the complete history:

Level 3:  ●─────────●─────────●        (each covers 32,768 tokens)
          │         │         │
Level 2:  ●──●──●──●──●──●──●──●      (each covers 1,024 tokens)
          │  │  │  │  │  │  │  │
Level 1:  ●●●●●●●●●●●●●●●●●●●●●●●●    (each covers 32 tokens)
          │││││││││││││││││││││││
Level 0:  [32][32][32][32][32][32]... (raw token blocks)

LOD0: Raw token blocks (32 tokens each)
LOD1: Each gist summarizes 32 LOD0 blocks (1,024 tokens → 1 gist)
LOD2: Each gist summarizes 32 LOD1 gists (32,768 tokens → 1 gist)
LOD3: Each gist summarizes 32 LOD2 gists (1,048,576 tokens → 1 gist)

Key Properties:

Each node has at most 32 children
Compression ratio: 32:1 per level
Tree depth grows logarithmically: depth = ⌈log₃₂(n)⌉
Total storage: ~1.5-2× raw tokens (due to redundancy)

Working Context Structure

The Working Context is a contiguous sequence mixing different levels of detail:

Position: [0  ][1  ][2  ][3  ][4  ][5  ][6  ][7  ][8  ]
Content:  [LOD0  ][LOD0  ][LOD1  ][LOD0  ][LOD2  ][LOD1  ][LOD0  ][LOD0  ][LOD0  ]
Cost:     [32  ][32  ][1   ][32  ][1   ][1   ][32  ][32  ][32  ]
Timeline: [0-31][32  ][64  ][96  ][128 ][160 ][192 ][224 ][256 ]
          |----Recent Context----|  |-Mid-| |----Distant Context---|
          (high detail)              (mid)  (low detail)

Key Properties:

Each entry covers exactly one time interval (no gaps, no overlaps)
Entries can be at different levels (LOD0, LOD1, LOD2, etc.)
Total token cost ≤ W_max (enforced by Focus Allocator)
Temporally contiguous (left-to-right = past-to-present)
Recent content typically at higher detail (LOD0)
Distant content typically at lower detail (LOD2, LOD3)

How the Contexts Interact

Three Types of Operations

1. Write: Tokens → MegaContext Tree (via GistNet)

New tokens (from user input or model generation) are written to the MegaContext Tree:

Incoming tokens → LOD0 buffer (32 tokens) → GistNet → LOD1 gist
                                              ↓
                  LOD1 buffer (32 gists) → GistNet → LOD2 gist
                                              ↓
                  LOD2 buffer (32 gists) → GistNet → LOD3 gist

Process:

Buffer incoming tokens until 32 are collected
GistNet compresses the 32-token block into a single LOD1 gist
Store both the LOD0 block and LOD1 gist in the tree
When 32 LOD1 gists accumulate, compress to LOD2
Repeat hierarchically up the tree

Triggering:

Happens automatically as tokens arrive
Block-aligned (every 32 tokens)
Independent of Working Context state

2. Read: MegaContext Tree → Working Context (via LensNet + Focus Allocator)

The Working Context is assembled by selecting entries from the MegaContext Tree:

MegaContext Tree (select nodes) → Working Context Assembly → [LOD0][LOD1][LOD0][LOD2]...
                                         ↑
                              LensNet + Focus Allocator
                              (decides what to include)

Process:

LensNet scores each current Working Context entry for “focus value”
- Positive score = expand to higher detail
- Negative score = collapse to lower detail
Focus Allocator applies scores while maintaining invariants:
- Contiguity: no temporal gaps
- Budget: total cost ≤ W_max
- Block-alignment: changes respect 32-token boundaries
Requested entries are fetched from MegaContext Tree
Working Context is rebuilt with new mix of detail levels

Triggering:

Every decode step (or every N steps)
Before each LLM inference pass
Adaptive based on LensNet scores

3. Update: Refocusing the Working Context

The Working Context is continuously updated to reflect changing relevance:

Old Working Context → LensNet → Focus Scores → Focus Allocator → New Working Context
         ↑                                                              ↓
         └──────────────────────── (feed to LLM) ────────────────────┘

Example Refocus Cycle:

Step T:   [LOD0][LOD0][LOD1][LOD1][LOD2][LOD0][LOD0]  (current WC)
          ↓
LensNet:  [+1][+2][-1][-2][+3][0 ][0 ]  (focus scores)
          ↓
FA:       expand expand collapse collapse expand keep keep
          ↓
Step T+1: [LOD0][LOD0][LOD0][LOD2][LOD3][LOD0][LOD0]  (updated WC)
                  ^^^ ^^^ ^^^
                 (detail changed)

Why This Matters:

Relevance changes as conversation evolves
Something mentioned briefly 10k tokens ago might suddenly become crucial
System can “zoom in” on newly relevant regions
Or “zoom out” on distractors to save budget for important content

Data Flow Between Contexts

Complete Data Flow Diagram

┌────────────────────────────────────────────────────────────────────┐
│                         INFERENCE CYCLE                             │
└────────────────────────────────────────────────────────────────────┘

  User Input / Generated Tokens
         │
         ▼
  ┌─────────────┐
  │ Token Buffer│  (accumulate 32 tokens)
  └──────┬──────┘
         │ (every 32 tokens)
         ▼
  ┌─────────────┐
  │   GistNet   │  (compress 32→1)
  └──────┬──────┘
         │
         ▼
  ┌──────────────────────┐
  │  MegaContext Tree    │  (append LOD0 block + LOD1 gist)
  │  ┌───┐ ┌───┐ ┌───┐  │
  │  │LOD3 │─│LOD2 │─│LOD1 │  │
  │  └───┘ └─┬─┘ └─┬─┘  │
  │          │     │     │
  │        ┌─┴─────┴─┐  │
  │        │ LOD0 Blocks│  │
  │        └─────────┘  │
  └──────────┬───────────┘
             │ (read selective entries)
             ▼
  ┌─────────────────────┐
  │  Working Context    │
  │  [LOD0][LOD1][LOD0][LOD2]   │  ◄─────┐
  └──────────┬───────────┘        │
             │                    │
             ▼                    │ (refocus)
  ┌─────────────────────┐        │
  │      LensNet        │        │
  │  (score relevance)  │        │
  └──────────┬───────────┘        │
             │                    │
             ▼                    │
  ┌─────────────────────┐        │
  │  Focus Allocator    │────────┘
  │ (expand/collapse)   │
  └──────────┬───────────┘
             │
             ▼
  ┌─────────────────────┐
  │   Frozen Base LLM   │  (inference)
  │   (e.g., Llama)     │
  └──────────┬───────────┘
             │
             ▼
    Next Token(s) ─────┘ (loop back to buffer)

Step-by-Step Data Flow

Phase 1: Token Ingestion

1. User types: "What did we discuss about machine learning?"
   └─> Buffer: ["What", "did", "we", "discuss", "about", ...]

2. Buffer fills to 32 tokens
   └─> GistNet input: 32 token embeddings [e₁, e₂, ..., e₃₂]

3. GistNet compresses
   └─> LOD1 gist: single embedding [g₁]

4. Write to MegaContext Tree
   ├─> LOD0 node: [e₁, e₂, ..., e₃₂] (32 embeddings)
   └─> LOD1 node: [g₁] (1 embedding, parent of LOD0)

5. Update tree metadata
   ├─> ΔNLL: compression loss metric
   ├─> Timestamps: token positions
   └─> Parent/child pointers

Phase 2: Working Context Assembly

1. LensNet reads current Working Context
   └─> Input: [WC₁, WC₂, ..., WCₙ] + [tail gists from MC Tree]

2. LensNet computes focus scores
   └─> Output: [score₁, score₂, ..., scoreₙ]
        └─> score > 0: expand (more detail)
        └─> score < 0: collapse (less detail)

3. Focus Allocator processes scores
   ├─> For each positive score:
   │   ├─> Fetch children from MegaContext Tree
   │   ├─> Replace LOD1 gist with 32 LOD0 blocks
   │   └─> Check budget: cost ≤ W_max?
   │
   └─> For each negative score:
       ├─> Find parent in MegaContext Tree
       ├─> Replace 32 LOD0 blocks with 1 LOD1 gist
       └─> Frees budget for other expansions

4. New Working Context assembled
   └─> [mix of LOD0, LOD1, LOD2, LOD3 entries]
        ├─> Contiguous in time (no gaps)
        └─> Within budget (total cost ≤ W_max)

Phase 3: Inference

1. Working Context fed to base LLM
   └─> Input: sequence of embeddings
        ├─> LOD0 entries: raw token embeddings
        └─> LOD1/LOD2/LOD3 entries: gist embeddings
        (LLM cannot distinguish - same embedding dimension)

2. LLM runs attention
   └─> Full O(n²) attention over Working Context only
        └─> n = (W_max / avg_entry_cost) ≈ 256-1024 entries

3. LLM generates next token
   └─> Output: new token embedding [e_new]

4. Token loops back to Phase 1
   └─> Added to buffer, eventually compressed to tree

Data Flow Properties

1. Unidirectional Write Path:

Tokens → MegaContext Tree (via GistNet)
Tree is append-only, never modified

2. Bidirectional Read Path:

MegaContext Tree → Working Context (fetch entries)
Working Context → LensNet (compute scores)
Scores → Focus Allocator → Updated Working Context

3. Isolation:

Base LLM never sees MegaContext Tree directly
Base LLM only operates on Working Context
GistNet never sees Working Context
LensNet never modifies MegaContext Tree

4. Cycle Time:

GistNet: O(32) tokens to trigger
LensNet: O(1) decode step (or every N steps)
Base LLM: O(1) token generation

Why This Architecture Enables System Properties

1. Unbounded Context Length ✓

How: MegaContext Tree stores complete history on disk with logarithmic depth.

Math:

Tree depth = ⌈log₃₂(n)⌉
For 1M tokens: depth = 4 levels
For 1B tokens: depth = 6 levels
Storage: ~1.5n embeddings (linear)

Why Two Contexts Are Essential:

Cannot store 1B tokens in GPU (would require ~4TB)
Cannot process 1B tokens with O(n²) attention (would take hours per token)
Disk storage is cheap and scales linearly
Working Context stays fixed size regardless of total history

2. Constant Compute Cost ✓

How: Working Context has fixed budget W_max; base LLM complexity is O(W_max²).

Math:

Attention cost: O(W_max²) = O(1) for fixed W_max
Example: W_max = 32k tokens → ~1B FLOPs per decode
Independent of total history length (could be 1M or 1B tokens)

Why Two Contexts Are Essential:

Base LLM only sees Working Context (W_max tokens)
MegaContext Tree is outside the inference path
No matter how much history accumulates, inference cost stays constant

3. Dynamic Focus/Defocus ✓

How: LensNet scores relevance; Focus Allocator swaps detail levels.

Example:

T=0:  "My cat's name is Fluffy. [9500 tokens about other topics]"
      Working Context: [LOD3 gist] (low detail)

T=9500: "What was my cat's name?"
      LensNet detects query, scores LOD3 gist highly
      Focus Allocator: LOD3 → LOD2 → LOD1 → LOD0
      Working Context: [LOD0 tokens: "My cat's name is Fluffy"]

Why Two Contexts Are Essential:

MegaContext Tree preserves all detail at all levels (lossless traversal)
Working Context can swap between levels without re-encoding
One-way compression (e.g., RAG summaries) cannot “zoom back in”
Static context windows cannot adjust detail post-hoc

4. Lossy-Yet-Restorable Compression ✓

How: Gists compress 32→1 (lossy) but original tokens remain in tree (restorable).

Compression Cascade:

32 tokens → 1 LOD1 gist (97% compression, small ΔNLL)
32 LOD1 gists → 1 LOD2 gist (97% compression, medium ΔNLL)
32 LOD2 gists → 1 LOD3 gist (97% compression, higher ΔNLL)

Restoration:

Need more detail? Traverse tree:
LOD3 gist → fetch 32 LOD2 children → fetch 32×32 LOD1 children → fetch 32×32×32 LOD0 tokens

Why Two Contexts Are Essential:

MegaContext Tree stores both compressed (gists) and original (tokens)
Working Context can dynamically choose which representation to use
Trade-off: budget (use gist) vs. fidelity (use tokens)
Not possible with single context (must choose one representation)

5. Sub-Linear Memory Scaling ✓

How: MegaContext Tree in cheap disk/RAM; Working Context in expensive GPU RAM.

Memory Breakdown:

MegaContext Tree:  ~1.5n embeddings × 16KB each = 24n bytes (disk/RAM)
Working Context:   W_max embeddings × 16KB each = constant (GPU)
LensNet + GistNet: Small models (~10-100M params = 40-400MB GPU)

Example (1B tokens):
- MC Tree: 1.5B × 16KB = 24GB (RAM) ✓ affordable
- WC: 32k × 16KB = 512MB (GPU) ✓ affordable
- Total GPU: ~1GB (leaves 23GB for base LLM)

Why Two Contexts Are Essential:

GPU memory is 10-100× more expensive than RAM
Cannot afford to keep all history in GPU
Disk/RAM storage scales to TBs for pennies
Working Context uses GPU efficiently (only what’s needed)

6. No Retraining of Base Model ✓

How: Base LLM remains frozen; operates on same embedding space.

Architecture:

Base LLM (frozen)
     ↑
     │ (same embeddings)
     │
Working Context ← mix of tokens + gists
                      ↑
                    GistNet (learned)
                      ↑
                  Raw Tokens

Why Two Contexts Are Essential:

GistNet learns to produce embeddings that “look like” base model tokens
Base LLM cannot tell the difference between LOD0 tokens and LOD1/LOD2/LOD3 gists
Working Context is the “adapter layer” - provides abstraction
MegaContext Tree is GistNet’s domain - invisible to base model
Separation allows independent optimization of each component

7. Multi-Resolution Access ✓

How: Tree structure provides access at any granularity (LOD0, LOD1, LOD2, LOD3).

Access Patterns:

Coarse scan:  Read LOD3 gists (1 per 1M tokens) → fast overview
Medium scan:  Read LOD2 gists (1 per 32k tokens) → section-level
Fine scan:    Read LOD1 gists (1 per 1k tokens) → paragraph-level
Full detail:  Read LOD0 tokens (all 32 tokens) → word-level

Example Use Case:

Query: "Find all discussions about Python optimization"

1. Scan all LOD3 gists (1000 in 1B-token history) → 1000 gists
2. Identify 10 relevant LOD3 regions
3. Scan their LOD2 children (10 × 32 = 320 gists)
4. Identify 5 most relevant LOD2 regions
5. Expand to LOD0 for detailed reading (5 × 1024 tokens = 5120 tokens)

Total cost: 1000 + 320 + 5120 = 6440 tokens (vs. 1B tokens for full scan)

Why Two Contexts Are Essential:

MegaContext Tree provides multi-resolution storage
Working Context provides multi-resolution representation
Can query coarsely, then zoom in selectively
Not possible with flat context or RAG (fixed retrieval granularity)

Core Components

1. MegaContext Tree

Purpose: Persistent, hierarchical storage of complete conversation history.

Key Responsibilities:

Store all tokens (LOD0) and all gists (LOD1, LOD2, LOD3, …)
Maintain parent-child relationships
Support random access at any level
Track metadata (ΔNLL, timestamps, etc.)
Persist across sessions

See: MegaContext Tree, Storage Format, Tree Operations

2. Working Context

Purpose: Active, budget-constrained window for LLM inference.

Key Responsibilities:

Maintain contiguous temporal coverage
Mix tokens and gists optimally
Stay within token budget (W_max)
Provide embedding sequence to base LLM
Update continuously via refocusing

See: Working Context, Working Context Assembly, Working Context Refocusing

3. GistNet

Purpose: Learned compression model that builds the tree hierarchy.

Key Responsibilities:

Compress 32 tokens → 1 gist (LOD0 → LOD1)
Compress 32 gists → 1 gist (LOD1 → LOD2, LOD2 → LOD3, …)
Minimize ΔNLL (compression loss)
Align gists with base model embedding space
Train via self-supervised learning

Architecture:

Input: 32 embeddings (4096-dim each)
Output: 1 embedding (4096-dim)
Model: Transformer encoder (6-12 layers, 512-2048 hidden dim) [2]
Training: Minimize perplexity of next-token prediction [21]

See: GistNet, GistNet Architecture Details, GistNet Training

4. LensNet

Purpose: Learned scoring model that determines what to focus on.

Key Responsibilities:

Score each Working Context entry for relevance
Predict which entries should be expanded/collapsed
Adapt to task dynamics (queries, continuations, etc.)
Balance exploration vs. exploitation
Train via reinforcement learning or task supervision

Architecture:

Input: Working Context + tail gists (context representation)
Output: Focus scores (one per entry)
Model: Transformer encoder (4-8 layers, 256-1024 hidden dim)
Training: Maximize task reward (e.g., downstream NLL)

See: LensNet, LensNet Scoring, LensNet Training

5. Focus Allocator

Purpose: Deterministic algorithm that applies LensNet scores to refocus Working Context.

Key Responsibilities:

Enforce contiguity (no temporal gaps)
Enforce budget (total cost ≤ W_max)
Expand high-scoring entries (fetch children)
Collapse low-scoring entries (replace with parent)
Handle edge cases (boundary conditions, buffer limits)

Algorithm:

def focus_allocator(working_context, scores, budget):
    # Sort scores descending
    expansions = [(i, score) for i, score in enumerate(scores) if score > 0]
    expansions.sort(key=lambda x: x[1], reverse=True)
 
    # Greedily expand until budget exhausted
    for i, score in expansions:
        if can_expand(working_context[i], budget):
            working_context[i] = expand(working_context[i])
            budget -= expansion_cost(working_context[i])
 
    # Collapse low-scoring entries to free budget
    collapses = [(i, score) for i, score in enumerate(scores) if score < 0]
    for i, score in collapses:
        if should_collapse(working_context[i]):
            working_context[i] = collapse(working_context[i])
            budget += collapse_savings(working_context[i])
 
    return working_context

See: Focus Allocator, Focus Allocator Strategies

6. Base LLM

Purpose: Frozen language model that performs inference.

Key Characteristics:

Unchanged: No modifications to architecture or weights
Embeddings: Operates on same embedding space as training
Input: Working Context (mix of tokens and gists)
Output: Next token probabilities
Oblivious: Cannot distinguish tokens from gists

Examples: Llama, GPT, Claude (frozen, no finetuning)

Runtime Lifecycle

System Initialization

1. Load base LLM (frozen weights)
2. Load GistNet (pre-trained weights)
3. Load LensNet (pre-trained weights)
4. Initialize MegaContext Tree (empty or from checkpoint)
5. Initialize Working Context (empty)
6. Ready for first token

Token Processing Loop

LOOP (for each new token):

    1. TOKEN ARRIVAL
       ├─> User input or model generation
       └─> Add to buffer

    2. TREE UPDATE (every 32 tokens)
       ├─> GistNet: compress 32 tokens → 1 LOD1 gist
       ├─> Write LOD0 block + LOD1 gist to tree
       └─> Recursively compress LOD1→LOD2, LOD2→LOD3, etc.

    3. REFOCUS (every decode step or every N steps)
       ├─> LensNet: score Working Context entries
       ├─> Focus Allocator: apply scores
       │   ├─> Expand high-score entries (fetch children)
       │   └─> Collapse low-score entries (replace with parent)
       └─> Rebuild Working Context with new entries

    4. INFERENCE
       ├─> Feed Working Context to base LLM
       ├─> LLM generates next token
       └─> Loop back to step 1

END LOOP

Example Execution Trace

T=0: User: "Tell me about Paris"
     └─> Buffer: ["Tell", "me", "about", "Paris"]
     └─> Working Context: [LOD0: "Tell", "me", "about", "Paris"]
     └─> LLM: "Paris is the capital..."

T=32: Buffer full → GistNet compresses
     └─> MC Tree: [LOD0: 32 tokens], [LOD1: gist_1]
     └─> Working Context: [LOD0: recent 32 tokens]

T=1000: User: "What about London?"
     └─> LensNet scores Paris discussion (low relevance)
     └─> Focus Allocator: collapse LOD0 → LOD1
     └─> Working Context: [LOD1: gist_Paris], [LOD0: recent tokens]
     └─> More budget available for new London discussion

T=1050: User: "Compare Paris and London"
     └─> LensNet scores Paris gist (high relevance)
     └─> Focus Allocator: expand LOD1 → LOD0
     └─> Working Context: [LOD0: Paris details], [LOD0: London details]
     └─> LLM can compare with full context

Key Terms & Invariants

Key Terms

LOD0: Raw token blocks (32 tokens each)
LOD1/LOD2/LOD3: Gist levels (each compresses 32 children)
Gist: Single embedding that summarizes 32 child embeddings
Entry: One item in Working Context (can be LOD0, LOD1, LOD2, or LOD3)
Cost: Number of base tokens represented (LOD0=32, LOD1=1, LOD2=1, LOD3=1)
Budget (W_max): Maximum token cost for Working Context
ΔNLL: Compression loss (increase in perplexity due to gisting)
Focus: Expand entry to higher detail (replace gist with children)
Defocus: Collapse entry to lower detail (replace children with gist)
Contiguity: Working Context covers time without gaps
Refocus: Update Working Context by applying focus/defocus operations

Core Invariants

MegaContext Tree Invariants:

Append-only: Nodes are never deleted or modified
Complete: All LOD0 blocks are stored (no truncation)
Hierarchical: Each non-leaf node has ≤32 children
Aligned: LOD0 blocks start at multiples of 32
Redundant: Both compressed (gists) and original (tokens) are stored

Working Context Invariants:

Contiguous: Covers [start_pos, end_pos] without gaps
Budgeted: ∑(entry_cost) ≤ W_max
Mixed: Entries can be at any level (LOD0, LOD1, LOD2, LOD3)
Temporal: Left-to-right = past-to-present
Aligned: Each entry covers exactly one tree node’s time span

System Invariants:

Isolation: Base LLM never accesses MegaContext Tree directly
Constant Compute: Inference cost = O(W_max²), independent of history
Lossless Paths: Can always traverse tree to restore original tokens
Embedding Consistency: Gists are in same embedding space as tokens
No Retraining: Base LLM weights are frozen, never updated

See: Invariants for complete details

Document Roadmap

This document is the canonical reference for understanding MegaContext’s two-context architecture. For deeper dives into specific aspects:

Architecture Deep Dives

MegaContext Tree - Complete tree structure and storage
Working Context - Budget management and assembly
POC Implementation - Current implementation details
System Properties - Formal property proofs

Component Details

GistNet - Compression model design
GistNet Architecture Details - Network structure
GistNet Training - Training procedures
LensNet - Focus scoring model
LensNet Scoring - Scoring mechanisms
LensNet Training - Training procedures
Focus Allocator - Refocusing algorithm
Focus Allocator Strategies - Allocation policies

Operations

Tree Operations - Build, query, traverse
Working Context Assembly - Initial assembly
Working Context Refocusing - Dynamic updates
Node Metadata - Metadata tracking

Training & Optimization

MegaContext End-to-End Training - Joint training strategy
Telemetry - Metrics and monitoring

Comparisons & Context

How MegaContext Works - Introductory overview
MegaTexture Analogy - Visual intuition
Comparisons - vs. RAG, sparse attention, etc.
Related Work - Academic context

Summary

The two-context architecture is the foundation of MegaContext’s ability to provide unbounded memory at constant compute:

MegaContext Tree stores the complete history hierarchically on disk
Working Context provides a fixed-size, dynamically-refocused view for inference
GistNet builds the tree by compressing tokens into gists
LensNet + Focus Allocator adapts the Working Context to changing relevance
Base LLM operates unchanged on the Working Context

This separation enables:

✓ Unbounded context length (millions to billions of tokens)
✓ Constant compute cost (O(W_max²) regardless of history)
✓ Dynamic focus/defocus (zoom in on relevant regions)
✓ Lossy-yet-restorable compression (gists + original tokens)
✓ Sub-linear memory scaling (disk for tree, GPU for working set)
✓ No retraining of base model (frozen weights)
✓ Multi-resolution access (coarse scan → fine detail)

The key insight: By separating long-term storage from active processing, MegaContext can optimize each independently while maintaining a coherent view of the entire interaction history. This is not possible with a single-context architecture.

References

MegaTexture (Carmack, 2007) — Analysis — Virtual texturing system that inspired the core hierarchical streaming architecture
Perceiver (Jaegle et al., 2021) — Analysis — Latent cross-attention bottleneck architecture
Perceiver IO (Jaegle et al., 2021) — Analysis — Query-based decoding for arbitrary structured outputs
Gist Tokens (Mu et al., 2023) — Analysis — Learned prompt compression via attention masking
Compressive Transformer (Rae et al., 2019) — Analysis — Long-term compressed memory for transformers
RAG (Lewis et al., 2020) — Analysis — Retrieval-augmented generation baseline
Memorizing Transformers (Wu et al., 2022) — Analysis — kNN-augmented approximate retrieval
Transformer-XL (Dai et al., 2019) — Analysis — Segment-level recurrence and relative positional encoding
RoPE (Su et al., 2021) — Analysis — Rotary position embeddings used throughout MegaContext
Flash Attention (Dao et al., 2022) — Analysis — IO-aware exact attention algorithm
Knowledge Distillation (Hinton et al., 2015) — Analysis — Teacher-student framework for GistNet training

See Related Work for the complete bibliography of all research papers referenced throughout the documentation.

This document is the definitive guide to MegaContext’s two-context architecture. All other documentation should reference this page for architectural fundamentals.

Mega Context

Explorer

Architecture Details

Architecture Details: The Two-Context System

Table of Contents

Why Two Contexts?

The Fundamental Problem

The Two-Context Solution

MegaContext Tree: The “Hard Drive” of Memory [1]

Working Context: The “RAM” of Active Memory

Why This Separation Is Necessary

The Key Insight

The Two-Context Architecture Explained

How They Work Together

Detailed Context Comparison

MegaContext Tree vs. Working Context

Detailed Breakdown

MegaContext Tree Structure

Working Context Structure

How the Contexts Interact

Three Types of Operations

1. Write: Tokens → MegaContext Tree (via GistNet)

2. Read: MegaContext Tree → Working Context (via LensNet + Focus Allocator)

3. Update: Refocusing the Working Context

Data Flow Between Contexts

Complete Data Flow Diagram

Step-by-Step Data Flow

Phase 1: Token Ingestion

Phase 2: Working Context Assembly

Phase 3: Inference

Data Flow Properties

Why This Architecture Enables System Properties

1. Unbounded Context Length ✓

2. Constant Compute Cost ✓

3. Dynamic Focus/Defocus ✓

4. Lossy-Yet-Restorable Compression ✓

5. Sub-Linear Memory Scaling ✓

6. No Retraining of Base Model ✓

7. Multi-Resolution Access ✓

Core Components

1. MegaContext Tree

2. Working Context

3. GistNet

4. LensNet

5. Focus Allocator

6. Base LLM

Runtime Lifecycle

System Initialization

Token Processing Loop

Example Execution Trace

Key Terms & Invariants

Key Terms

Core Invariants

Document Roadmap

Architecture Deep Dives

Component Details

Operations

Training & Optimization

Comparisons & Context

Summary

References

Graph View

Table of Contents

Backlinks