The runtime loop ingests tokens into the MegaContext Tree, scores the working window via LensNet, applies the Focus Allocator, and decodes with the frozen base LLM while logging telemetry.
- Stages: ingest → gist update → focus scoring → allocation → decode → telemetry.
- Budget: Working Context remains within
W_maxusing block-aligned actions. - Demo goals: Planned
tools/run_poc_loopscript will showcase expansion/collapse within limits once the runtime loop lands (tracked in MegaContext End-to-End Training). - Telemetry: swap rates, ΔNLL, latency feed pruning and training loops.
Details
The streaming runtime keeps a frozen base LLM within a fixed working window while preserving the full MegaContext history.
End-to-end flow
- Ingest & chunk: Incoming text is tokenized into 32-token blocks and appended to the MegaContext Tree gist tree (
src/megacontext/memory/tree.py). - Gist updates: GistNet generates or refreshes hierarchical summaries so higher-level nodes stay in sync with new tokens.
- Focus scoring: LensNet evaluates the working window, producing signed scores that suggest which spans deserve finer or coarser detail.
- Allocation step: The Focus Allocator expands high-score spans to raw tokens or collapses low-score spans to gists, ensuring the window adheres to
W_max. - Decode: The Working Context feeds the frozen base model (
src/runtime/base_model.py), producing next-token predictions or downstream logits. - Telemetry: Swap events, ΔNLL comparisons, and latency stats are logged for analysis and future tuning.
Demo targets
- Near-term goal: a dedicated
tools/run_poc_loopentry point (TBD) streams a synthetic session, showing expansion/collapse while maintaining budget invariants (MegaContext End-to-End Training, POC Architecture). Until then, use the nanochat trainer or unit tests to exercise individual components. - Research milestone: benchmarking harnesses compare MegaContext runs against baselines and track swap rate, loss, and latency (Research Papers sequence, see Paper 2).
Focus heuristics
- Greedy but bounded: Hysteresis and cooldowns prevent thrashing when spans hover near the decision boundary.
- Multi-scale awareness: Bundles of raw tokens plus their parent gists let the allocator choose hybrid representations.
- Telemetry-driven evolution: Access counts and ΔNLL sensitivity inform pruning strategies explored in Training & Operations Track B/D of Future Plan.