POC Implementation Guide

Status: Historical reference for the notebook-era prototype. Use Training & Operations and TODO.md for the nanochat workflow; this page remains for context when porting legacy assumptions.

This document consolidates ALL POC-specific implementation details, parameters, configurations, and constraints. When implementing POC components, this is the single source of truth for:

Parameter values
Simplifications vs full vision
Module configurations
Technology stack
Testing requirements

See MegaContext PRD Index for the active roadmap, POC Scope for historical guardrails, and Migration Plan - Nanochat Integration for the nanochat-specific scaffolding.

PRD alignment & nanochat integration

Role of this note: captures legacy POC shortcuts (e.g., Lightning loops, manual storage formats) so we can reference them while porting features into the nanochat fork described in Migration Plan - Nanochat Integration.
Mapping to PRDs:
- MegaContext End-to-End Training replaces the alternating loop; use these parameters when wiring the nanochat trainer.
- MegaAttention Training consumes the working-context hierarchy and LOD budgets defined below.
- MegaPrediction Training pulls its wLOD readouts, ΔNLL horizons, and gist losses directly from the sections that follow.
Nanochat hooks: wherever this note mentions notebooks/ or Lightning scripts, assume we now implement the same logic inside nanochat/train.py, nanochat/model.py, or CLI helpers; keep the configuration tables for reference when setting nanochat defaults.

Global POC Parameters

Core Configuration

Parameter	Value	Notes
K (Block size)	32 tokens	Fixed for POC
W_max (Working Context budget)	8,192 tokens	Configurable via YAML; future: 16k–32k
H (ΔNLL Horizon)	64 tokens	For GistNet evaluation
N_diff (Max focus changes per step)	4 actions	Expand/collapse limit per refocus
Cooldown steps	2 iterations	Min time before block can flip actions

Hierarchy Configuration

Level	Compression	Coverage	Token Cost in WC
LOD0	1:1 (no compression)	1 token	32 tokens per block
LOD1	32:1	32 LOD0 tokens	1 token per gist
LOD2	1024:1	1,024 LOD0 tokens	1 token per gist

POC Limitation: Two levels only (LOD0, LOD1, LOD2 root) - sufficient for moderate contexts.

Total compression: 32² = 1024× (with two layers)

Focus Thresholds

Threshold	Value	Purpose
τ_expand	0.20	Minimum signed score for expansion
τ_collapse	0.20	Symmetric collapse threshold

Update Cadence

Refocus frequency: Every K=32 tokens
LensNet scoring: Once per refocus cycle
GistNet compression: Inline during ingest (synchronous)

POC Simplifications & Constraints

What’s Frozen in POC

Base LLM: No fine-tuning during initial loop; LoRA is follow-up work
GistNet checkpoint: Gists frozen to initial checkpoint during demo runs (no retraining)
Hierarchy depth: Fixed at 2 levels (LOD0, LOD1, LOD2 root)
Block size: K=32 hardcoded (no variable-length blocks)
Storage: RAM-resident (no disk I/O or memory-mapping in POC)

What’s Simplified in POC

Synchronous updates: Ingest → refocus → decode happens inline (no background workers)
No streaming: Entire Working Context resides in GPU memory (no paging)
Simple initial focus: May use recency bias before LensNet is trained
Fixed thresholds: τ_expand and τ_collapse hardcoded at 0.2 (no adaptive)
Single base model: Not multi-model or MoE
Toy corpus: Project docs instead of large-scale datasets

Deferred Features

Post-POC enhancements (see Future Plan):

Disk-backed storage with memory-mapped files
LOD3+ hierarchy levels for billion-token contexts
Incremental tree updates (rebuild only affected subtrees)
Provenance tracking per node
Soft deletes / pruning tiers (see MegaCuration)
Version management for multiple GistNet checkpoints
Differentiable focus router (learned Focus Allocator)
KV-cache reuse across refocus steps
Multi-head contexts with different focus policies
Attention biasing for task-specific guidance

Technology Stack

Core Dependencies

Component	Technology	Version	Purpose
Python	Python	3.11	Core language
PyTorch	PyTorch	≥2.2	Tensor operations
Transformers	HuggingFace	Latest	Base model interface
FlashAttention	FlashAttention 2	Latest	Efficient attention
Environment	`uv`	Latest	Dependency management
Logging	Weights & Biases	Latest	Metrics tracking

Key Commands

# Setup environment
uv venv
uv sync
 
# Run tests
uv run pytest --maxfail=1 --disable-warnings
 
# Legacy notebook workflow (deprecated)
uv run jupyter lab  # run notebooks/megacontext.ipynb and customise phases
uv run python -m tools.decode_demo --config configs/SampleText_TinyGPT2.yaml

Base Model Configuration

Primary choice: HuggingFaceTB/SmolLM3-3B

Precision: bf16 (bfloat16)
GPU requirement: 24–48 GB (for model + working context + training)
Alternative: Qwen/Qwen3-1.7B

Loading:

from transformers import AutoModelForCausalLM, AutoTokenizer
 
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

Module-Specific Configurations

GistNet Parameters

See GistNet for architecture overview, GistNet Architecture Details for layer specs, GistNet Training for training details.

Architecture

Parameter	POC Value	Notes
Window size	32 tokens	Fixed
Slot queries	2 shared learned queries (Q₁, Q₂)
Layers per 32→1 block	2 self + 2 cross-attention
Refinement stack	32→1→32→1	Two-stage compression
Embedding dim	Same as base LLM (e.g., 4096)	Must match
Internal hidden width	512	Bottleneck
Attention heads	8
RoPE	Applied to tokens only, slots omit it
Activation	GELU
Norm	Pre-LayerNorm
Parameters	~0.5M per layer	~1M total
Runtime	<1 ms per 32-token span	On NVIDIA L4, bf16
Output	Single `g_final` vector per span	Dimension = embedding_dim

Training Loss

# Primary: Substitutability
Loss_subst = KL(P_base || P_replaced)  # Or ΔNLL@H over H=64 tokens
 
# Optional: Contrastive (prevent collapse)
Loss_contrast = max(0, margin - cosine_sim(g_i, g_j))  # margin ≈ 0.2
 
# Total
Loss = Loss_subst + 0.05 * Loss_contrast

Training Configuration

Parameter	Value
Dataset	Long-form text (4k–16k tokens), chunked into 32-token spans
Teacher	Frozen base LLM for ΔNLL@H computation
Optimizer	AdamW
Learning rate	1e-4
Scheduler	Cosine decay
Precision	bf16
Curriculum	Start with contiguous text, then structured data (lists, code, tables)

Hierarchy

Two 32→1 layers stacked hierarchically
Lower layer runs on token embeddings
Upper layer operates on lower-layer gist outputs
Result: 32² = 1024 tokens per LOD2 gist

LensNet Parameters

See LensNet for overview, LensNet Training for training, LensNet Scoring for score computation.

Architecture

Parameter	POC Value	Notes
Input embeddings	≈8k entries	Mixed LOD0/LOD1/LOD2 from Working Context
Conditioning gists	6 total	LOD2 root + 5 latest LOD1 gists
Down-projection width	512 (d_lens)	Bottleneck dimension
Attention heads	8
Stacks	1–3 dual cross-attention blocks
Update cadence	Every K=32 tokens
Output	Signed focus score u_i per entry	Range: [-1, +1]
Runtime	<3 ms per update @ 8k tokens	On NVIDIA L4
Parameters	≈100k–200k total	Tiny auxiliary network

Complexity

O(N × K × d_lens) per forward pass
With N ≈ 8k, K = 6, d_lens = 512 → ~25M multiply-adds
Negligible compared to base model decode

Training Loss

# 1. Regression on signed utility targets
L_reg = MSE(predictions, targets)
 
# 2. Ranking loss for ordered pairs
L_rank = softplus_ranking_loss(score_pairs)
 
# 3. Budget regularizer (zero-sum preference)
L_budget = ((P - N) / (P + N))²  # P=positive scores, N=negative
 
# 4. Illegality penalties
L_illegal = α * illegal_expand_penalty + β * illegal_collapse_penalty  # α, β ≈ 0.3
 
# Total
L_total = L_reg + 0.5 * L_rank + 0.1 * L_budget + L_illegal

Conditioning Inputs

Input	Shape	Purpose
`context`	[N, d]	Working Context entry embeddings (≈8k)
`tail_gists`	[6, d]	LOD2 root + 5 latest LOD1 gists
`levels`	[N]	0/1/2 markers for legality masking
`span_width`	[N]	LOD0 tokens represented per entry
`distance_to_cursor`	[N]	Block distance from decode cursor

Focus Allocator Parameters

See Focus Allocator for algorithm, Focus Allocator Strategies for variations.

Runtime Configuration

Parameter	Default	Notes
τ_expand	0.20	Min score magnitude for expansion
τ_collapse	0.20	Symmetric collapse threshold
N_diff	4	Max expand/collapse actions per iteration
cooldown_steps	2	Min iterations before block can flip actions
lens_update_interval	32 tokens (K)	LensNet runs once per block
tail_gist_window	5 LOD1 + current LOD2	Conditioning set for LensNet

Constraints

Block alignment: Every WC entry covers exactly one full 32-token block at a single LOD
Action budget: Apply at most N_diff=4 operations per iteration
Positional alignment: Gists reuse absolute token indices for RoPE; occupy central token index of span
Legality: LOD0 blocks can’t expand further; LOD2 gists can’t collapse higher (in POC)

Greedy Algorithm

def focus_allocator_step(working_context, policy_scores, W_max):
    # 1. Collect candidates
    expand_queue = [(score, entry) for entry, score in zip(working_context, policy_scores)
                    if score > τ_expand and can_expand(entry)]
    collapse_queue = [(score, entry) for entry, score in zip(working_context, policy_scores)
                      if score < -τ_collapse and can_collapse(entry)]
 
    # 2. Sort queues
    expand_queue.sort(reverse=True)    # Descending (highest first)
    collapse_queue.sort()              # Ascending (most negative first)
 
    # 3. Apply N_diff operations
    actions_taken = 0
    while actions_taken < N_diff and (expand_queue or collapse_queue):
        # Prioritize expansions if budget available
        if expand_queue and current_budget_allows_expansion():
            score, entry = expand_queue.pop(0)
            expand(entry)  # LOD1→LOD0 or LOD2→LOD1
            actions_taken += 1
 
        # Balance with collapses
        if collapse_queue:
            score, entry = collapse_queue.pop(0)
            collapse(entry)  # LOD0→LOD1 or LOD1→LOD2
            actions_taken += 1
 
    return working_context

Working Context Parameters

See Working Context for overview, Working Context Assembly for materialization, Working Context Refocusing for focus changes.

Budget

Parameter	POC Value	Future
W_max	8,192 tokens	16k–32k

Entry Costs

LOD0 block (32 tokens): 32 tokens
LOD1 gist: 1 token (saves 31)
LOD2 gist: 1 token (saves 1023)

Budget Invariant

assert sum(entry.cost for entry in working_context) <= W_max

See Invariants for all system invariants.

Refocus Cycle

Every K=32 tokens:
  1. Decode K tokens using current WC
  2. Ingest new tokens to MegaContext Tree
  3. LensNet scores all WC entries
  4. Focus Allocator applies up to N_diff=4 operations
  5. Repeat with updated WC

MegaContext Tree Parameters

See MegaContext Tree for overview, Storage Format for binary layout, Tree Operations for APIs.

Tree Structure

Level	Compression	Coverage	Entry Type
LOD0	1:1	1 token	Token ID (uint32)
LOD1	32:1	32 tokens	Gist vector (fp16)
LOD2	1024:1	1,024 tokens	Gist vector (fp16)

Tree Properties

Fixed branching factor: 32 children per node
Perfect alignment: Node boundaries align with 32-token blocks
Append-only: Historical nodes immutable (except gist refresh)
Balanced growth: Depth grows as log₃₂(N)

Storage Layout

See Storage Format for complete details.

Files:

LOD0.ctx - Raw token IDs (uint32)
LOD1.ctx - LOD1 gist vectors (fp16, dimension = embedding_dim)
LOD2.ctx - LOD2 gist vectors (fp16, dimension = embedding_dim)
metadata.json - Tree metadata and configuration

Header (64 bytes):

Offset	Field	Type	Value
0	magic	uint32	0x4D434354 (“MCCT”)
4	version	uint16	1 (POC)
6	level	uint16	0, 1, or 2
8	block_size	uint16	32
10	embedding_dim	uint16	Base model dimension
12	dtype_code	uint16	0=uint32, 1=fp16, 2=bf16
14	model_name	char[32]	UTF-8 null-terminated
46	reserved	18 bytes	Zeroed

POC Simplification: RAM-resident (no disk I/O or memory-mapping yet)

Sample Configuration File

File: configs/Gutenberg_SmolLM3.yaml

name: Gutenberg_SmolLM3
description: Gutenberg subset with SmolLM3-3B base model and two-stage GistNet training.
 
dataset:
  dataset_name: gutenberg_sample
  tokenizer: HuggingFaceTB/SmolLM2-360M-Instruct
  block_size: 32
  context_tokens: 512
  context_stride: 512
  horizon: 32
  teacher_model: HuggingFaceTB/SmolLM2-360M-Instruct
  splits:
    train:
      source: ../data/raw/gutenberg/**/*.txt
      output_path: ../data/gutenberg_sample/train.arrow
 
base_model:
  name: HuggingFaceTB/SmolLM3-3B
  torch_dtype: bfloat16
  run_name: poc_smollm3_l4
 
gistnet:
  model:
    hidden_size: auto
    block_size: 32
    num_heads: 16
    mlp_ratio: 4.0
  training:
    batch_size: 8
    precision: bf16-mixed
    phases:
      - name: pooling-pretrain
        objective: pooling_mse
        max_steps: 2000
        window_tokens: 512
        lr: 0.001
      - name: delta-finetune
        objective: delta_nll
        max_steps: 1000
        window_tokens: 512
        lr: 0.0005

Testing Requirements

Determinism

All POC tests must be deterministic:

Seeded RNG: Fixed random seeds for reproducibility
Deterministic blocks: 32-token blocks from dataset prep
Round-trip tests: Tree persistence and recovery
Synthetic streams: Deterministic test inputs

Smoke Tests (CI-friendly)

# Dataset tooling
test_dataset_prep_deterministic()
 
# Base model loading
test_base_model_loads()
 
# Tensor shapes
test_gistnet_output_shapes()
test_lensnet_output_shapes()
 
# Budget calculations
test_working_context_budget_calculation()
 
# Legality masks
test_focus_allocator_legality_masks()

Unit Tests

# Tree operations
test_megacontext_tree_ingest()
test_megacontext_tree_persistence()
 
# Focus allocator
test_focus_allocator_greedy_algorithm()
test_focus_allocator_cooldown()
test_focus_allocator_edge_cases()
 
# LensNet
test_lensnet_conditioning_inputs()
test_lensnet_score_computation()
 
# GistNet
test_gistnet_determinism()
test_gistnet_loss_computation()
test_gistnet_substitutability()  # ≤5% ΔNLL threshold

Integration Tests

# End-to-end
test_poc_loop_with_synthetic_stream()
test_budget_invariants_maintained()
test_focus_reallocation_logging()
 
# Dataset prep
test_dataset_prep_on_sample_corpus()
 
# Base model integration
test_base_model_forward_passes()

Evaluation Metrics

Metric	Target	How to Measure
ΔNLL@H	≤0.1	Compare base vs gist-replaced predictions
Overhead	≤5%	Latency with vs without MegaContext
Swap rate	0.1–0.3 actions/block	Log focus changes per iteration
Budget compliance	100%	Assert invariants never violated

Design Principles

Tensor-First Philosophy

Keep gist-side components tensor-first
Prefer thin Python wrappers around PyTorch modules
Persist MegaContext structures as contiguous LOD0/LOD1/LOD2 tensors
Mirror on-disk layouts instead of dense Python object graphs

Curriculum Training

Masked-attention curriculum (per Gist Token paper)
Progressively shrink working window during training
Balance context richness against storage/compute budgets

Working Context Management

MegaContext wrapper:

Owns contiguous LOD0/LOD1/LOD2 buffers
Encapsulates offsets/parent pointers
Provides iterators for enumerating legal window-sized views

WorkingContext wrapper:

Provides views for token embeddings vs gist embeddings
Utilities to materialize KV-cache keys for chosen slices
Combinator utilities for span replacement with specific gist levels

Key Invariants

See Invariants for comprehensive list. POC must maintain:

Budget Invariant: sum(entry_costs) ≤ W_max
Contiguity Invariant: entry[i].end_token == entry[i+1].start_token
Block Alignment Invariant: All boundaries align with K=32 blocks
Level Consistency Invariant: LOD0=32 tokens, LOD1=32 tokens (compressed), LOD2=1024 tokens
RoPE Invariant: Gists use central position index; LOD0 uses actual positions

What Success Looks Like

A successful POC demonstrates:

GistNet compresses 32 tokens → 1 gist with ΔNLL@H ≤ 0.1
LensNet predicts relevance and guides focus changes
Focus Allocator maintains invariants while adapting LOD
Working Context stays within budget while handling dynamic context
MegaContext Tree grows unboundedly while access stays constant-time
End-to-end system achieves <5% overhead vs frozen base model
Reproducible demos show focus adapting to changing queries

MegaContext PRD Index - Current milestone roadmap and phases
POC Scope - Capability guardrails and what’s excluded
POC Architecture - Module responsibilities and interfaces
Training & Operations - Training cadence and telemetry
Architecture Details - Full system design
System Properties - Core properties (constant compute, etc.)

This is the single source of truth for POC implementation details. When component files say “see POC Implementation for details,” this is where to look.

Mega Context

Explorer

POC Implementation

POC Implementation Guide

PRD alignment & nanochat integration

Global POC Parameters

Core Configuration

Hierarchy Configuration

Focus Thresholds

Update Cadence

POC Simplifications & Constraints

What’s Frozen in POC

What’s Simplified in POC

Deferred Features

Technology Stack

Core Dependencies

Key Commands

Base Model Configuration

Module-Specific Configurations

GistNet Parameters

Architecture

Training Loss

Training Configuration

Hierarchy

LensNet Parameters

Architecture

Complexity

Training Loss

Conditioning Inputs

Focus Allocator Parameters

Runtime Configuration

Constraints

Greedy Algorithm

Working Context Parameters

Budget

Entry Costs

Budget Invariant

Refocus Cycle

MegaContext Tree Parameters

Tree Structure

Tree Properties

Storage Layout

Sample Configuration File

Testing Requirements

Determinism

Smoke Tests (CI-friendly)

Unit Tests

Integration Tests

Evaluation Metrics

Design Principles

Tensor-First Philosophy

Curriculum Training

Working Context Management

Key Invariants

What Success Looks Like

Related Pages

Graph View

Table of Contents

Backlinks