Cognitive-Core MegaContext — Training and Evaluation Plan

Status: Plan of record (POR). Use this PRD for Cognitive Core requirements; earlier POC/vision docs are reference only.

Overview

The Cognitive Core MegaContext Large Language Model (C²MC-LLM) extends the MegaContext framework to train a model that relies on a persistent, composable set of MegaContexts rather than memorizing facts in its weights. This document defines the training, inference, and evaluation procedures for C²MC-LLM, aligning with the end-to-end MegaContext + MegaPrediction ecosystem.

1. Motivation

Most LLMs embed knowledge directly into weights, which is:

costly to update,
prone to factual drift,
memory-inefficient (facts occupy dense model parameters).

C²MC-LLM externalizes knowledge and long-term context into persistent MegaContexts (MCs). It enables:

Instant knowledge updates (e.g., via Core Knowledge MC updates).
Smaller, reasoning-focused models (“cognitive core”).
Deep integration of retrieval and memory, eliminating tool-calling overhead while still compatible with RAG pipelines.

Once deployed, a C²MC-LLM ships with a Core Knowledge MegaContext (CK-MC) and expects to operate with it. However, this CK-MC is only one component of a larger composite MegaContext, which may include:

CK-MC — permanent factual and procedural knowledge.
Session-MC — ephemeral working memory (conversations, tasks).
Recent-Events MC — continuously updated summaries of news or sensor feeds.
Domain/Proprietary MCs — organizational or application-specific datasets.
File Watcher MC — indexed, live-updating summaries of local or cloud storage.

Each of these MCs can be dynamically combined by the Focus Allocator and LensNet into a single Working Context (WC), creating a unified memory substrate analogous to a deeply integrated RAG system — without tool calls.

2. Architecture Components

Component	Role
GistNet	Summarizes token spans into multi-level gists (LOD1, LOD2).
LensNet	Scores which spans to expand or compress; trained via ΔNLL supervision.
MegaContexts (MCs)	Persistent hierarchical trees of gists (CK, Session, etc.).
Focus Allocator	Selects and merges spans from multiple MCs into a fixed-size Working Context (WC).
Base Model (Cognitive Core)	Trained transformer performing next-token and LOD-level predictions (MegaPrediction).

3. Core Knowledge MegaContext (CK-MC)

Dataset: A MegaCuration of factual corpus (~100M tokens) representing stable, high-value world knowledge.
Processing:
- Chunk into 32-token spans.
- Summarize via pretrained GistNet into multi-level hierarchy (LOD1, LOD2, etc.).
- Optionally prune into a sparse tree (anchors at LOD1/LOD2).
Initial CK Working Context (CKWC):
- LensNet + Focus Allocator produce a CKWC of size CL (e.g., 4k–8k tokens).
- CKWC acts as a “starter memory” for any prompt.
- Refocused per query using LensNet during inference.

4. Training Initialization

Module	Source	Frozen?
Token Embeddings	MC-LLM	✗
GistNet	MC-LLM	✗ (finetune slowly)
LensNet	MC-LLM	✗ (finetune slowly)
Transformer Trunk	Random	✗
Prediction Heads (LOD0/LOD1)	Random	✗

Rationale: GistNet and LensNet preserve hierarchical compression/focus semantics, while the new trunk begins unmemorized, forcing dependence on external MCs.

5. Training Loop

Each batch samples a full context C and builds a composite working context (WC) mixing CK-MC and Session-MC spans within a shared token-equivalent budget (C2).

Step-by-Step

Full training context C
 ├── Load CKWC (size CL) and session context C
 ├── Allocate joint CK+Session WC of total length C2
 │    ├── CK spans = summarized (LOD1/LOD2)
 │    ├── Session spans = LOD0 + summaries
 │    └── Joint [[Focus Allocator]] selects mix under C2 budget
 ├── Evaluate base model ([[MegaPrediction]])
 │    ├── LOD0 NLL (next-token prediction)
 │    ├── LOD1/LOD2 regression (gist targets)
 │    └── Backprop through Base + [[GistNet]]
 ├── [[LensNet]] refinement (ΔNLL supervision)
 │    ├── Identify ideal WC(s) via regularized argmin
 │    ├── Derive target focus scores
 │    └── Backprop through [[LensNet]]
 └── CK Reliance Margin (drop-test loss)
      ├── Compare L_full vs. L_drop (CK removed)
      └── Encourage L_drop > L_full + margin m

6. Composite Focus Allocator

Objective

Choose the optimal combination of spans from all available MCs (CK, Session, etc.) under a fixed working context budget C2.

Each span i has:

Cost w_i (token-equivalent length)
Predicted benefit v_i (ΔNLL reduction estimate)
Penalty terms (hysteresis, redundancy, legality)

Maximize: Σ (v_i − penalty_i) subject to Σ w_i ≤ C2

Implementation Details

v_i: estimated from LensNet scores + ΔNLL logs (EMA).
MC diversity: penalize over-selection from a single MC.
Session minimum share: ≥50% of C2 budget reserved for active context.
Edits per iteration: up to 4 WC deltas (expand/compress spans).
Dedup: remove identical WCs from different edit sequences.

7. Sparse MCs (Scalable Knowledge Storage)

To support large-scale composite MCs, each may store anchors instead of full trees.

Feature	Description
Anchors	High-level gists (LOD1/LOD2) with metadata (domain, span size, version).
Progressive Fill	Expanding an anchor reconstructs details using LOD1 → LOD0 decoding.
Cache	Expanded spans cached as embeddings + KV for reuse.
Consistency Loss	Penalizes drift between reconstructed tokens and gist anchors.
Versioning	Multi-version support allows CK updates without invalidation.

8. Loss Terms

Loss	Description
LOD0 NLL	Standard next-token loss.
LOD1/LOD2 Regression	Cosine/MSE between predicted and reference gists.
CK Reliance Margin	Enforces reliance on CK: `L_drop > L_full + m`.
Focus Utility Reward	Reward for expansions that reduce ΔNLL.
Hysteresis Penalty	Prevents frequent oscillation of focus.

9. Validation & Evaluation

All models are compared using equal-length working contexts (C2) to ensure fair compute.

Model	Context Composition	MC Access	Expected Behavior
LLM	Last C2 L0 tokens	None	Baseline (no MC).
MC-LLM	C2 mix of L0 + gists from session C	Session MC	Compression & context recall.
C²MC-LLM	C2 mix from CK + session (and optional extra MCs)	Composite MC	Best performance, adaptive focus across knowledge tiers.

Metrics

Perplexity (ΔNLL) on CK-dependent and cross-MC spans.
CK Reliance Index (CKRI) = (L_drop − L_full)/L_full.
Focus attribution: % attention/focus toward each MC.
Expansion utility: % of expansions that reduce loss.
MC update responsiveness: change in accuracy after updating any MC.
MC “poison” test: robustness to conflicting information.

10. Practical Considerations

Minimum session share: enforce ≥50% to ensure grounding in live context.
Compute fairness: all model variants use the same C2 budget.
Position encoding: Gaussian RoPE + ALiBi blend to stabilize across MCs.
CK caching: precompute CKWC embeddings + KV caches.
Safety: contradiction detector cross-attends prompt ↔ CK spans.
Extensibility: support new MC types (news, logs, FS watcher) via the same API.

11. Success Criteria

MC reliance: performance drops when CK removed (ΔNLL↑).
Instant adaptability: immediate improvement after MC updates, no finetune.
Equal compute: matches MC-LLM cost with superior factual accuracy.
Externalized knowledge: catastrophic drop without CK = success (facts externalized).
Cross-MC reasoning: integrates information across CK, session, and other MCs coherently.

12. Next Steps

Implement Composite Focus Allocator (multi-MC support).
Add Sparse MC infrastructure with metadata, caching, and expansion path.
Integrate CK Reliance Margin and multi-MC supervision into e2e training.
Build multi-MC evaluation harness (LLM vs MC-LLM vs C²MC-LLM).
Extend composite MC support for Recent Events, Domain Knowledge, and File Watcher modules.

Version: v2.0 Author: MegaContext Research Team Date: {datetime.now().strftime(‘%Y-%m-%d’)}

Mega Context

Explorer

Cognitive-Core Training