Cognitive-Core MegaContext — Training and Evaluation Plan
Status: Plan of record (POR). Use this PRD for Cognitive Core requirements; earlier POC/vision docs are reference only.
Overview
The Cognitive Core MegaContext Large Language Model (C²MC-LLM) extends the MegaContext framework to train a model that relies on a persistent, composable set of MegaContexts rather than memorizing facts in its weights. This document defines the training, inference, and evaluation procedures for C²MC-LLM, aligning with the end-to-end MegaContext + MegaPrediction ecosystem.
1. Motivation
Most LLMs embed knowledge directly into weights, which is:
- costly to update,
- prone to factual drift,
- memory-inefficient (facts occupy dense model parameters).
C²MC-LLM externalizes knowledge and long-term context into persistent MegaContexts (MCs). It enables:
- Instant knowledge updates (e.g., via Core Knowledge MC updates).
- Smaller, reasoning-focused models (“cognitive core”).
- Deep integration of retrieval and memory, eliminating tool-calling overhead while still compatible with RAG pipelines.
Once deployed, a C²MC-LLM ships with a Core Knowledge MegaContext (CK-MC) and expects to operate with it. However, this CK-MC is only one component of a larger composite MegaContext, which may include:
- CK-MC — permanent factual and procedural knowledge.
- Session-MC — ephemeral working memory (conversations, tasks).
- Recent-Events MC — continuously updated summaries of news or sensor feeds.
- Domain/Proprietary MCs — organizational or application-specific datasets.
- File Watcher MC — indexed, live-updating summaries of local or cloud storage.
Each of these MCs can be dynamically combined by the Focus Allocator and LensNet into a single Working Context (WC), creating a unified memory substrate analogous to a deeply integrated RAG system — without tool calls.
2. Architecture Components
| Component | Role |
|---|---|
| GistNet | Summarizes token spans into multi-level gists (LOD1, LOD2). |
| LensNet | Scores which spans to expand or compress; trained via ΔNLL supervision. |
| MegaContexts (MCs) | Persistent hierarchical trees of gists (CK, Session, etc.). |
| Focus Allocator | Selects and merges spans from multiple MCs into a fixed-size Working Context (WC). |
| Base Model (Cognitive Core) | Trained transformer performing next-token and LOD-level predictions (MegaPrediction). |
3. Core Knowledge MegaContext (CK-MC)
- Dataset: A MegaCuration of factual corpus (~100M tokens) representing stable, high-value world knowledge.
- Processing:
- Chunk into 32-token spans.
- Summarize via pretrained GistNet into multi-level hierarchy (LOD1, LOD2, etc.).
- Optionally prune into a sparse tree (anchors at LOD1/LOD2).
- Initial CK Working Context (CKWC):
- LensNet + Focus Allocator produce a CKWC of size CL (e.g., 4k–8k tokens).
- CKWC acts as a “starter memory” for any prompt.
- Refocused per query using LensNet during inference.
4. Training Initialization
| Module | Source | Frozen? |
|---|---|---|
| Token Embeddings | MC-LLM | ✗ |
| GistNet | MC-LLM | ✗ (finetune slowly) |
| LensNet | MC-LLM | ✗ (finetune slowly) |
| Transformer Trunk | Random | ✗ |
| Prediction Heads (LOD0/LOD1) | Random | ✗ |
Rationale: GistNet and LensNet preserve hierarchical compression/focus semantics, while the new trunk begins unmemorized, forcing dependence on external MCs.
5. Training Loop
Each batch samples a full context C and builds a composite working context (WC) mixing CK-MC and Session-MC spans within a shared token-equivalent budget (C2).
Step-by-Step
Full training context C
├── Load CKWC (size CL) and session context C
├── Allocate joint CK+Session WC of total length C2
│ ├── CK spans = summarized (LOD1/LOD2)
│ ├── Session spans = LOD0 + summaries
│ └── Joint [[Focus Allocator]] selects mix under C2 budget
├── Evaluate base model ([[MegaPrediction]])
│ ├── LOD0 NLL (next-token prediction)
│ ├── LOD1/LOD2 regression (gist targets)
│ └── Backprop through Base + [[GistNet]]
├── [[LensNet]] refinement (ΔNLL supervision)
│ ├── Identify ideal WC(s) via regularized argmin
│ ├── Derive target focus scores
│ └── Backprop through [[LensNet]]
└── CK Reliance Margin (drop-test loss)
├── Compare L_full vs. L_drop (CK removed)
└── Encourage L_drop > L_full + margin m
6. Composite Focus Allocator
Objective
Choose the optimal combination of spans from all available MCs (CK, Session, etc.) under a fixed working context budget C2.
Each span i has:
- Cost
w_i(token-equivalent length) - Predicted benefit
v_i(ΔNLL reduction estimate) - Penalty terms (hysteresis, redundancy, legality)
Maximize: Σ (v_i − penalty_i) subject to Σ w_i ≤ C2
Implementation Details
- v_i: estimated from LensNet scores + ΔNLL logs (EMA).
- MC diversity: penalize over-selection from a single MC.
- Session minimum share: ≥50% of C2 budget reserved for active context.
- Edits per iteration: up to 4 WC deltas (expand/compress spans).
- Dedup: remove identical WCs from different edit sequences.
7. Sparse MCs (Scalable Knowledge Storage)
To support large-scale composite MCs, each may store anchors instead of full trees.
| Feature | Description |
|---|---|
| Anchors | High-level gists (LOD1/LOD2) with metadata (domain, span size, version). |
| Progressive Fill | Expanding an anchor reconstructs details using LOD1 → LOD0 decoding. |
| Cache | Expanded spans cached as embeddings + KV for reuse. |
| Consistency Loss | Penalizes drift between reconstructed tokens and gist anchors. |
| Versioning | Multi-version support allows CK updates without invalidation. |
8. Loss Terms
| Loss | Description |
|---|---|
| LOD0 NLL | Standard next-token loss. |
| LOD1/LOD2 Regression | Cosine/MSE between predicted and reference gists. |
| CK Reliance Margin | Enforces reliance on CK: L_drop > L_full + m. |
| Focus Utility Reward | Reward for expansions that reduce ΔNLL. |
| Hysteresis Penalty | Prevents frequent oscillation of focus. |
9. Validation & Evaluation
All models are compared using equal-length working contexts (C2) to ensure fair compute.
| Model | Context Composition | MC Access | Expected Behavior |
|---|---|---|---|
| LLM | Last C2 L0 tokens | None | Baseline (no MC). |
| MC-LLM | C2 mix of L0 + gists from session C | Session MC | Compression & context recall. |
| C²MC-LLM | C2 mix from CK + session (and optional extra MCs) | Composite MC | Best performance, adaptive focus across knowledge tiers. |
Metrics
- Perplexity (ΔNLL) on CK-dependent and cross-MC spans.
- CK Reliance Index (CKRI) = (L_drop − L_full)/L_full.
- Focus attribution: % attention/focus toward each MC.
- Expansion utility: % of expansions that reduce loss.
- MC update responsiveness: change in accuracy after updating any MC.
- MC “poison” test: robustness to conflicting information.
10. Practical Considerations
- Minimum session share: enforce ≥50% to ensure grounding in live context.
- Compute fairness: all model variants use the same C2 budget.
- Position encoding: Gaussian RoPE + ALiBi blend to stabilize across MCs.
- CK caching: precompute CKWC embeddings + KV caches.
- Safety: contradiction detector cross-attends prompt ↔ CK spans.
- Extensibility: support new MC types (news, logs, FS watcher) via the same API.
11. Success Criteria
- MC reliance: performance drops when CK removed (ΔNLL↑).
- Instant adaptability: immediate improvement after MC updates, no finetune.
- Equal compute: matches MC-LLM cost with superior factual accuracy.
- Externalized knowledge: catastrophic drop without CK = success (facts externalized).
- Cross-MC reasoning: integrates information across CK, session, and other MCs coherently.
12. Next Steps
- Implement Composite Focus Allocator (multi-MC support).
- Add Sparse MC infrastructure with metadata, caching, and expansion path.
- Integrate CK Reliance Margin and multi-MC supervision into e2e training.
- Build multi-MC evaluation harness (LLM vs MC-LLM vs C²MC-LLM).
- Extend composite MC support for Recent Events, Domain Knowledge, and File Watcher modules.
Version: v2.0 Author: MegaContext Research Team Date: {datetime.now().strftime(‘%Y-%m-%d’)}