POC Implementation Guide

Status: Historical reference for the notebook-era prototype. Use Training & Operations and TODO.md for the nanochat workflow; this page remains for context when porting legacy assumptions.

This document consolidates ALL POC-specific implementation details, parameters, configurations, and constraints. When implementing POC components, this is the single source of truth for:

  • Parameter values
  • Simplifications vs full vision
  • Module configurations
  • Technology stack
  • Testing requirements

See MegaContext PRD Index for the active roadmap, POC Scope for historical guardrails, and Migration Plan - Nanochat Integration for the nanochat-specific scaffolding.


PRD alignment & nanochat integration

  • Role of this note: captures legacy POC shortcuts (e.g., Lightning loops, manual storage formats) so we can reference them while porting features into the nanochat fork described in Migration Plan - Nanochat Integration.
  • Mapping to PRDs:
  • Nanochat hooks: wherever this note mentions notebooks/ or Lightning scripts, assume we now implement the same logic inside nanochat/train.py, nanochat/model.py, or CLI helpers; keep the configuration tables for reference when setting nanochat defaults.

Global POC Parameters

Core Configuration

ParameterValueNotes
K (Block size)32 tokensFixed for POC
W_max (Working Context budget)8,192 tokensConfigurable via YAML; future: 16k–32k
H (ΔNLL Horizon)64 tokensFor GistNet evaluation
N_diff (Max focus changes per step)4 actionsExpand/collapse limit per refocus
Cooldown steps2 iterationsMin time before block can flip actions

Hierarchy Configuration

LevelCompressionCoverageToken Cost in WC
LOD01:1 (no compression)1 token32 tokens per block
LOD132:132 LOD0 tokens1 token per gist
LOD21024:11,024 LOD0 tokens1 token per gist

POC Limitation: Two levels only (LOD0, LOD1, LOD2 root) - sufficient for moderate contexts.

Total compression: 32² = 1024× (with two layers)

Focus Thresholds

ThresholdValuePurpose
τ_expand0.20Minimum signed score for expansion
τ_collapse0.20Symmetric collapse threshold

Update Cadence

  • Refocus frequency: Every K=32 tokens
  • LensNet scoring: Once per refocus cycle
  • GistNet compression: Inline during ingest (synchronous)

POC Simplifications & Constraints

What’s Frozen in POC

  1. Base LLM: No fine-tuning during initial loop; LoRA is follow-up work
  2. GistNet checkpoint: Gists frozen to initial checkpoint during demo runs (no retraining)
  3. Hierarchy depth: Fixed at 2 levels (LOD0, LOD1, LOD2 root)
  4. Block size: K=32 hardcoded (no variable-length blocks)
  5. Storage: RAM-resident (no disk I/O or memory-mapping in POC)

What’s Simplified in POC

  1. Synchronous updates: Ingest → refocus → decode happens inline (no background workers)
  2. No streaming: Entire Working Context resides in GPU memory (no paging)
  3. Simple initial focus: May use recency bias before LensNet is trained
  4. Fixed thresholds: τ_expand and τ_collapse hardcoded at 0.2 (no adaptive)
  5. Single base model: Not multi-model or MoE
  6. Toy corpus: Project docs instead of large-scale datasets

Deferred Features

Post-POC enhancements (see Future Plan):

  1. Disk-backed storage with memory-mapped files
  2. LOD3+ hierarchy levels for billion-token contexts
  3. Incremental tree updates (rebuild only affected subtrees)
  4. Provenance tracking per node
  5. Soft deletes / pruning tiers (see MegaCuration)
  6. Version management for multiple GistNet checkpoints
  7. Differentiable focus router (learned Focus Allocator)
  8. KV-cache reuse across refocus steps
  9. Multi-head contexts with different focus policies
  10. Attention biasing for task-specific guidance

Technology Stack

Core Dependencies

ComponentTechnologyVersionPurpose
PythonPython3.11Core language
PyTorchPyTorch≥2.2Tensor operations
TransformersHuggingFaceLatestBase model interface
FlashAttentionFlashAttention 2LatestEfficient attention
EnvironmentuvLatestDependency management
LoggingWeights & BiasesLatestMetrics tracking

Key Commands

# Setup environment
uv venv
uv sync
 
# Run tests
uv run pytest --maxfail=1 --disable-warnings
 
# Legacy notebook workflow (deprecated)
uv run jupyter lab  # run notebooks/megacontext.ipynb and customise phases
uv run python -m tools.decode_demo --config configs/SampleText_TinyGPT2.yaml

Base Model Configuration

Primary choice: HuggingFaceTB/SmolLM3-3B

  • Precision: bf16 (bfloat16)
  • GPU requirement: 24–48 GB (for model + working context + training)
  • Alternative: Qwen/Qwen3-1.7B

Loading:

from transformers import AutoModelForCausalLM, AutoTokenizer
 
model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

Module-Specific Configurations

GistNet Parameters

See GistNet for architecture overview, GistNet Architecture Details for layer specs, GistNet Training for training details.

Architecture

ParameterPOC ValueNotes
Window size32 tokensFixed
Slot queries2 shared learned queries (Q₁, Q₂)
Layers per 32→1 block2 self + 2 cross-attention
Refinement stack32→1→32→1Two-stage compression
Embedding dimSame as base LLM (e.g., 4096)Must match
Internal hidden width512Bottleneck
Attention heads8
RoPEApplied to tokens only, slots omit it
ActivationGELU
NormPre-LayerNorm
Parameters~0.5M per layer~1M total
Runtime<1 ms per 32-token spanOn NVIDIA L4, bf16
OutputSingle g_final vector per spanDimension = embedding_dim

Training Loss

# Primary: Substitutability
Loss_subst = KL(P_base || P_replaced)  # Or ΔNLL@H over H=64 tokens
 
# Optional: Contrastive (prevent collapse)
Loss_contrast = max(0, margin - cosine_sim(g_i, g_j))  # margin ≈ 0.2
 
# Total
Loss = Loss_subst + 0.05 * Loss_contrast

Training Configuration

ParameterValue
DatasetLong-form text (4k–16k tokens), chunked into 32-token spans
TeacherFrozen base LLM for ΔNLL@H computation
OptimizerAdamW
Learning rate1e-4
SchedulerCosine decay
Precisionbf16
CurriculumStart with contiguous text, then structured data (lists, code, tables)

Hierarchy

  • Two 32→1 layers stacked hierarchically
  • Lower layer runs on token embeddings
  • Upper layer operates on lower-layer gist outputs
  • Result: 32² = 1024 tokens per LOD2 gist

LensNet Parameters

See LensNet for overview, LensNet Training for training, LensNet Scoring for score computation.

Architecture

ParameterPOC ValueNotes
Input embeddings≈8k entriesMixed LOD0/LOD1/LOD2 from Working Context
Conditioning gists6 totalLOD2 root + 5 latest LOD1 gists
Down-projection width512 (d_lens)Bottleneck dimension
Attention heads8
Stacks1–3 dual cross-attention blocks
Update cadenceEvery K=32 tokens
OutputSigned focus score u_i per entryRange: [-1, +1]
Runtime<3 ms per update @ 8k tokensOn NVIDIA L4
Parameters≈100k–200k totalTiny auxiliary network

Complexity

  • O(N × K × d_lens) per forward pass
  • With N ≈ 8k, K = 6, d_lens = 512 → ~25M multiply-adds
  • Negligible compared to base model decode

Training Loss

# 1. Regression on signed utility targets
L_reg = MSE(predictions, targets)
 
# 2. Ranking loss for ordered pairs
L_rank = softplus_ranking_loss(score_pairs)
 
# 3. Budget regularizer (zero-sum preference)
L_budget = ((P - N) / (P + N))²  # P=positive scores, N=negative
 
# 4. Illegality penalties
L_illegal = α * illegal_expand_penalty + β * illegal_collapse_penalty  # α, β ≈ 0.3
 
# Total
L_total = L_reg + 0.5 * L_rank + 0.1 * L_budget + L_illegal

Conditioning Inputs

InputShapePurpose
context[N, d]Working Context entry embeddings (≈8k)
tail_gists[6, d]LOD2 root + 5 latest LOD1 gists
levels[N]0/1/2 markers for legality masking
span_width[N]LOD0 tokens represented per entry
distance_to_cursor[N]Block distance from decode cursor

Focus Allocator Parameters

See Focus Allocator for algorithm, Focus Allocator Strategies for variations.

Runtime Configuration

ParameterDefaultNotes
τ_expand0.20Min score magnitude for expansion
τ_collapse0.20Symmetric collapse threshold
N_diff4Max expand/collapse actions per iteration
cooldown_steps2Min iterations before block can flip actions
lens_update_interval32 tokens (K)LensNet runs once per block
tail_gist_window5 LOD1 + current LOD2Conditioning set for LensNet

Constraints

  1. Block alignment: Every WC entry covers exactly one full 32-token block at a single LOD
  2. Action budget: Apply at most N_diff=4 operations per iteration
  3. Positional alignment: Gists reuse absolute token indices for RoPE; occupy central token index of span
  4. Legality: LOD0 blocks can’t expand further; LOD2 gists can’t collapse higher (in POC)

Greedy Algorithm

def focus_allocator_step(working_context, policy_scores, W_max):
    # 1. Collect candidates
    expand_queue = [(score, entry) for entry, score in zip(working_context, policy_scores)
                    if score > τ_expand and can_expand(entry)]
    collapse_queue = [(score, entry) for entry, score in zip(working_context, policy_scores)
                      if score < -τ_collapse and can_collapse(entry)]
 
    # 2. Sort queues
    expand_queue.sort(reverse=True)    # Descending (highest first)
    collapse_queue.sort()              # Ascending (most negative first)
 
    # 3. Apply N_diff operations
    actions_taken = 0
    while actions_taken < N_diff and (expand_queue or collapse_queue):
        # Prioritize expansions if budget available
        if expand_queue and current_budget_allows_expansion():
            score, entry = expand_queue.pop(0)
            expand(entry)  # LOD1→LOD0 or LOD2→LOD1
            actions_taken += 1
 
        # Balance with collapses
        if collapse_queue:
            score, entry = collapse_queue.pop(0)
            collapse(entry)  # LOD0→LOD1 or LOD1→LOD2
            actions_taken += 1
 
    return working_context

Working Context Parameters

See Working Context for overview, Working Context Assembly for materialization, Working Context Refocusing for focus changes.

Budget

ParameterPOC ValueFuture
W_max8,192 tokens16k–32k

Entry Costs

  • LOD0 block (32 tokens): 32 tokens
  • LOD1 gist: 1 token (saves 31)
  • LOD2 gist: 1 token (saves 1023)

Budget Invariant

assert sum(entry.cost for entry in working_context) <= W_max

See Invariants for all system invariants.

Refocus Cycle

Every K=32 tokens:
  1. Decode K tokens using current WC
  2. Ingest new tokens to MegaContext Tree
  3. LensNet scores all WC entries
  4. Focus Allocator applies up to N_diff=4 operations
  5. Repeat with updated WC

MegaContext Tree Parameters

See MegaContext Tree for overview, Storage Format for binary layout, Tree Operations for APIs.

Tree Structure

LevelCompressionCoverageEntry Type
LOD01:11 tokenToken ID (uint32)
LOD132:132 tokensGist vector (fp16)
LOD21024:11,024 tokensGist vector (fp16)

Tree Properties

  • Fixed branching factor: 32 children per node
  • Perfect alignment: Node boundaries align with 32-token blocks
  • Append-only: Historical nodes immutable (except gist refresh)
  • Balanced growth: Depth grows as log₃₂(N)

Storage Layout

See Storage Format for complete details.

Files:

  • LOD0.ctx - Raw token IDs (uint32)
  • LOD1.ctx - LOD1 gist vectors (fp16, dimension = embedding_dim)
  • LOD2.ctx - LOD2 gist vectors (fp16, dimension = embedding_dim)
  • metadata.json - Tree metadata and configuration

Header (64 bytes):

OffsetFieldTypeValue
0magicuint320x4D434354 (“MCCT”)
4versionuint161 (POC)
6leveluint160, 1, or 2
8block_sizeuint1632
10embedding_dimuint16Base model dimension
12dtype_codeuint160=uint32, 1=fp16, 2=bf16
14model_namechar[32]UTF-8 null-terminated
46reserved18 bytesZeroed

POC Simplification: RAM-resident (no disk I/O or memory-mapping yet)


Sample Configuration File

File: configs/Gutenberg_SmolLM3.yaml

name: Gutenberg_SmolLM3
description: Gutenberg subset with SmolLM3-3B base model and two-stage GistNet training.
 
dataset:
  dataset_name: gutenberg_sample
  tokenizer: HuggingFaceTB/SmolLM2-360M-Instruct
  block_size: 32
  context_tokens: 512
  context_stride: 512
  horizon: 32
  teacher_model: HuggingFaceTB/SmolLM2-360M-Instruct
  splits:
    train:
      source: ../data/raw/gutenberg/**/*.txt
      output_path: ../data/gutenberg_sample/train.arrow
 
base_model:
  name: HuggingFaceTB/SmolLM3-3B
  torch_dtype: bfloat16
  run_name: poc_smollm3_l4
 
gistnet:
  model:
    hidden_size: auto
    block_size: 32
    num_heads: 16
    mlp_ratio: 4.0
  training:
    batch_size: 8
    precision: bf16-mixed
    phases:
      - name: pooling-pretrain
        objective: pooling_mse
        max_steps: 2000
        window_tokens: 512
        lr: 0.001
      - name: delta-finetune
        objective: delta_nll
        max_steps: 1000
        window_tokens: 512
        lr: 0.0005

Testing Requirements

Determinism

All POC tests must be deterministic:

  • Seeded RNG: Fixed random seeds for reproducibility
  • Deterministic blocks: 32-token blocks from dataset prep
  • Round-trip tests: Tree persistence and recovery
  • Synthetic streams: Deterministic test inputs

Smoke Tests (CI-friendly)

# Dataset tooling
test_dataset_prep_deterministic()
 
# Base model loading
test_base_model_loads()
 
# Tensor shapes
test_gistnet_output_shapes()
test_lensnet_output_shapes()
 
# Budget calculations
test_working_context_budget_calculation()
 
# Legality masks
test_focus_allocator_legality_masks()

Unit Tests

# Tree operations
test_megacontext_tree_ingest()
test_megacontext_tree_persistence()
 
# Focus allocator
test_focus_allocator_greedy_algorithm()
test_focus_allocator_cooldown()
test_focus_allocator_edge_cases()
 
# LensNet
test_lensnet_conditioning_inputs()
test_lensnet_score_computation()
 
# GistNet
test_gistnet_determinism()
test_gistnet_loss_computation()
test_gistnet_substitutability()  # ≤5% ΔNLL threshold

Integration Tests

# End-to-end
test_poc_loop_with_synthetic_stream()
test_budget_invariants_maintained()
test_focus_reallocation_logging()
 
# Dataset prep
test_dataset_prep_on_sample_corpus()
 
# Base model integration
test_base_model_forward_passes()

Evaluation Metrics

MetricTargetHow to Measure
ΔNLL@H≤0.1Compare base vs gist-replaced predictions
Overhead≤5%Latency with vs without MegaContext
Swap rate0.1–0.3 actions/blockLog focus changes per iteration
Budget compliance100%Assert invariants never violated

Design Principles

Tensor-First Philosophy

  • Keep gist-side components tensor-first
  • Prefer thin Python wrappers around PyTorch modules
  • Persist MegaContext structures as contiguous LOD0/LOD1/LOD2 tensors
  • Mirror on-disk layouts instead of dense Python object graphs

Curriculum Training

  • Masked-attention curriculum (per Gist Token paper)
  • Progressively shrink working window during training
  • Balance context richness against storage/compute budgets

Working Context Management

MegaContext wrapper:

  • Owns contiguous LOD0/LOD1/LOD2 buffers
  • Encapsulates offsets/parent pointers
  • Provides iterators for enumerating legal window-sized views

WorkingContext wrapper:

  • Provides views for token embeddings vs gist embeddings
  • Utilities to materialize KV-cache keys for chosen slices
  • Combinator utilities for span replacement with specific gist levels

Key Invariants

See Invariants for comprehensive list. POC must maintain:

  1. Budget Invariant: sum(entry_costs) ≤ W_max
  2. Contiguity Invariant: entry[i].end_token == entry[i+1].start_token
  3. Block Alignment Invariant: All boundaries align with K=32 blocks
  4. Level Consistency Invariant: LOD0=32 tokens, LOD1=32 tokens (compressed), LOD2=1024 tokens
  5. RoPE Invariant: Gists use central position index; LOD0 uses actual positions

What Success Looks Like

A successful POC demonstrates:

  1. GistNet compresses 32 tokens → 1 gist with ΔNLL@H ≤ 0.1
  2. LensNet predicts relevance and guides focus changes
  3. Focus Allocator maintains invariants while adapting LOD
  4. Working Context stays within budget while handling dynamic context
  5. MegaContext Tree grows unboundedly while access stays constant-time
  6. End-to-end system achieves <5% overhead vs frozen base model
  7. Reproducible demos show focus adapting to changing queries


This is the single source of truth for POC implementation details. When component files say “see POC Implementation for details,” this is where to look.