Status: Archived POC plan for historical reference. The nanochat workflow and POR requirements now live in Training & Operations + MegaContext PRD Index; use this document only for context.
Proof-of-concept milestone covering repository readiness through demo documentation.
- Phase 0 readies tooling and docs for reproducibility.
- Phase 1 stands up the frozen runtime skeleton.
- Phase 2 delivers minimal GistNet compression with evaluation.
- Phase 3 integrates LensNet, focus allocator, and the end-to-end loop.
- Phase 4 ships the demo benchmark and documentation artifacts.
Details
Canonical plan for the MegaContext proof-of-concept milestone.
This milestone isolates the minimum “hot path” required to demonstrate MegaContext end-to-end on a single base model. The goal is to show that hierarchical gists plus dynamic focus can keep a frozen LLM within a fixed working window while retaining task-relevant history. We aim for clarity, determinism, and reproducibility rather than exhaustive optimization.
Phase 0 — Repository Readiness
Goal: Ensure contributors can reproduce the prototype quickly.
- Task 0.1: Verify
uvbootstrap script provisions the environment, installs dependencies, and documents canonical commands (uv venv,uv run pytest,uv run python -m …). - Task 0.2: Refresh
README.mdto reference the POC plan, describe prerequisites (Python 3.11, GPU requirements), and outline the expected proof demo. - Task 0.3: Confirm smoke tests for dataset tooling and base model stubs run in CI.
Phase 1 — Base Runtime Skeleton
Goal: Stand up the frozen base-model runtime and data pipelines used later in the loop.
- Task 1.1: Finalize
tools/prepare_dataset.pyand associated configs to produce deterministic 32-token blocks from a toy corpus (e.g., project docs). - Task 1.2: Implement
src/runtime/base_model.pytargetingHuggingFaceTB/SmolLM3-3B(bf16) with forward helpers used across the prototype. - Task 1.3: Provide
src/runtime/working_context.py(token pass-through version) and unit tests covering tensor shapes, legality masks, and budget calculations. - Task 1.4: Add a CLI demo (
uv run python -m tools.decode_demo --config configs/SampleText_TinyGPT2.yaml) that streams a short prompt to confirm logits generation. - Task 1.5: Define a modality-tagging schema (text vs image placeholder) in dataset outputs and
WorkingContextmetadata, mirroring the expectations in Multimodal MegaContext even if Phase 1 only exercises text.
Exit criteria: Dataset prep works on a sample corpus, base model forwards succeed, smoke tests pass under CI.
Phase 2 — Minimal Gist Compression
Goal: Train and validate a single-level (32→1) gist model sufficient to replace segments without catastrophic degradation.
- Design directive: keep gist-side components tensor-first. Prefer thin Python wrappers around PyTorch modules and persist MegaContext structures as contiguous LOD0/LOD1/LOD2 tensors that mirror on-disk layouts instead of dense Python object graphs.
- Task 2.1: Implement
src/gistnet/blocks.pyandsrc/gistnet/model.pywith RoPE-enabled self-attention [1], shared slot queries [2], and residual MLPs outputting the base embedding dimension. - Task 2.2: Extend dataset tooling to emit paired
(tokens, gist_tokens)batches over horizonH=64; cache teacher embeddings for repeatability. - Task 2.3: Build a Lightning-based trainer (
megacontext.gistnet.lightning) with a masked-attention curriculum (per Gist Token paper) and W&B logging of ΔNLL@H. - Task 2.4: Add unit tests for determinism (seeded RNG) plus a smoke eval comparing base vs gist-replaced loss with ≤5 % degradation on the toy corpus.
- Task 2.5: Document gist training steps and expected metrics in
notebooks/megacontext.ipynb(Jupyter notebook with Lightning workflow). - Task 2.6: Revise dataset preparation to output 4k-token context slices with cached teacher hidden states and future horizons, along with stride/layout metadata described in the schema sketch (gists/logits generated later during training).
- Task 2.7: Introduce
MegaContext/WorkingContexttensor wrappers that hydrate from dataset slices plus the current GistNet, manage contiguous LOD0/LOD1/LOD2 buffers and offsets, and provide iterators for enumerating all legalW_lwindows and gist substitution patterns. - Task 2.8: Upgrade the training loop to batch working-context windows, run prediction through the base model (replaying cached teacher hidden states/logits where applicable), and optimize ΔNLL/logit agreement with a curriculum that progressively shrinks
W_l.
Progress (current): core GistNet modules, dataset tooling with teacher caches, trainer scaffold, and notebook documentation are implemented; outstanding work covers curriculum training, ΔNLL smoke evals, and logging.
Notes: Tail→gist substitution is an intentional waypoint that lets us validate the compressor before expanding to full-tree summarisation. GistNet distills pooled teacher hidden states so we can regenerate contextualised gists without the teacher at runtime; naive mean-embedding pooling remains a baseline. Retain the pooling-MSE training path long term so we can compare (a) pure pooling loss, (b) pooling pretrain + ΔNLL fine-tune, and (c) ΔNLL-from-scratch across model variants—the Lightning trainer in megacontext.gistnet.lightning supports multi-phase schedules so we can automate those comparisons. ΔNLL stays the gold-standard substitutability check—we audit with it now and plan to fine-tune against it once the end-to-end loop is wired. The dataset continues to cache a wider teacher horizon so targets stay context-rich, balanced against storage/compute budgets.
Upcoming extensions under evaluation:
- Treat the current pooled hidden-state regression as a stepping stone. Long-term we want to phase it out in favour of training directly on prediction fidelity.
- Expand dataset prep to emit 4k-token contexts plus 32–64 token horizons, caching the teacher’s final hidden states so later stages can reconstruct rollouts and ΔNLL without re-running the model.
- Proposed schema sketch: store
context_tokens(4k LOD0 ids),context_mask, cached teacher hidden stacks (one per block), andfuture_tokensfor the horizon, alongside metadata that captures tokenizer, stride, and window offsets. Training will materialize MegaContext hierarchy on the fly with the latest GistNet checkpoints, so no gists/logits are persisted in the dataset. - Generate hierarchical LOD1/LOD2 gists for each slice at training time so experiments always reflect the current compressor.
- Construct batched working-context windows of width
W_lby sliding across each prepared MegaContext (optionally ordered for KV-cache reuse), run the base model forward with those contexts, and compare the horizon rollout in latent/logit space to the teacher outputs to maximize gist substitutability. - Prototype a curriculum that gradually shrinks
W_lduring training to push the model toward using higher-level, lossy summaries. - Introduce lightweight
MegaContext/[[Working Context]]tensor wrappers that own contiguous LOD0/LOD1/LOD2 buffers, expose combinator utilities (e.g., enumerate all legalW_l-sized windows, replace spans with specific gist levels), and surface batching hooks the trainer can call without hand-rolling slicing logic. TheMegaContexthelper should encapsulate offsets/parent pointers and emit iterator handles (or precomputed index tensors) that the trainer can batch together;[[Working Context]]should provide views for token embeddings vs gist embeddings, plus utilities to materialize KV-cache keys for a chosen slice.
Exit criteria: Gist checkpoints reproduce ΔNLL targets, deterministic tests pass, and documentation explains the compression pipeline.
Phase 3 — LensNet, Focus Allocator, and Runtime Loop
Goal: Integrate MegaContext Tree storage with dynamic focus so a streaming run respects a fixed working budget.
- Task 3.1: Implement
src/megacontext/memory/tree.pywith ingest/update APIs for 32-token blocks, node metadata (span id, level, offsets), and persistence to{LOD0,LOD1,LOD2}.ctx; include round-trip tests. - Task 3.2: Finish
src/runtime/working_context.pyto tile LOD0 tokens and LOD1 gists contiguously, reporting token-equivalent costs. - Task 3.3: Build
src/lensnet/model.pywith dual cross-attention (working entries ↔ gist cache) and a signed focus score head; createsrc/lensnet/dataloader.pyto replay working-context snapshots. - Task 3.4: Implement
src/runtime/focus_allocator.pywith greedy expand/collapse respecting contiguity, cooldowns, and budget hysteresis. Include Slot-Attention-style normalised competition to keep allocation bounded. - Task 3.5: Assemble
src/runtime/engine.pythat ingests streams, updates the gist tree, queries LensNet, applies focus adjustments, and decodes via the base model. - Task 3.6: Provide unit tests across components (tree ingest, allocator edge cases, LensNet mask handling) plus an integration test with a deterministic synthetic stream.
- Task 3.7: Ship
tools/run_poc_loop(invoked asuv run python -m tools.run_poc_loop --config configs/Gutenberg_SmolLM3.yaml) to demonstrate a short session that expands/collapses spans while respecting the budget. - Task 3.8: Add the positional adapter described in Positional Encoding—NTK-scaled RoPE [1], LOD-aware Gaussian damping, and lightweight ALiBi LoRA [3]—along with regression tests that verify ΔNLL stability for long-range, teleported spans.
- Task 3.9: Stub an image-ingest path that converts ViT patch embeddings into no-op gists and flows modality metadata through the runtime (consume-only); align interfaces with Multimodal MegaContext while keeping the demo text-only.
Exit criteria: End-to-end loop runs on the demo corpus, logs focus actions, maintains budget invariants, and tests cover core behavior.
Phase 4 — Proof Demo & Documentation
Goal: Capture evidence that MegaContext works and explain the prototype clearly.
- Task 4.1: Create a minimal benchmark script comparing (a) baseline LLM, (b) MegaContext-enabled run on a single synthetic task; report loss, swap rate, and latency.
- Task 4.2: Record a short walkthrough (structured log or screenshots) showing focus reallocations in the demo run.
- Task 4.3: Update
README.mdwith a POC Architecture summary: Architecture Details sketch, command sequence, expected outcomes, troubleshooting. - Task 4.4: Ensure
docs/poc_results.mdcaptures metrics, figures, and lessons learned leading into the next milestone per the Research Papers plan and Future Plan. - Task 4.5: Document multimodal readiness: summarize the placeholder ingest path, positional expectations, and next-step gaps referencing Multimodal MegaContext and Positional Encoding.
Exit criteria: Demo artifacts prove that MegaContext can retain context beyond the Working Context window with manageable complexity per the POC Scope, unlocking the Training & Operations and Performance Sketch goals for the PAPER milestone.
References
- RoPE (Su et al., 2021) — Analysis — Rotary position embeddings used throughout MegaContext
- Slot Attention (Locatello et al., 2020) — Analysis — Object-centric iterative attention mechanism
- LoRA (Hu et al., 2021) — Analysis — Low-rank adaptation used in GistNet/LensNet training
See Related Work for the complete bibliography of all research papers referenced throughout the documentation.