Cognitive-Core MegaContext — Training and Evaluation Plan

Status: Plan of record (POR). Use this PRD for Cognitive Core requirements; earlier POC/vision docs are reference only.

Overview

The Cognitive Core MegaContext Large Language Model (C²MC-LLM) extends the MegaContext framework to train a model that relies on a persistent, composable set of MegaContexts rather than memorizing facts in its weights. This document defines the training, inference, and evaluation procedures for C²MC-LLM, aligning with the end-to-end MegaContext + MegaPrediction ecosystem.


1. Motivation

Most LLMs embed knowledge directly into weights, which is:

  • costly to update,
  • prone to factual drift,
  • memory-inefficient (facts occupy dense model parameters).

C²MC-LLM externalizes knowledge and long-term context into persistent MegaContexts (MCs). It enables:

  • Instant knowledge updates (e.g., via Core Knowledge MC updates).
  • Smaller, reasoning-focused models (“cognitive core”).
  • Deep integration of retrieval and memory, eliminating tool-calling overhead while still compatible with RAG pipelines.

Once deployed, a C²MC-LLM ships with a Core Knowledge MegaContext (CK-MC) and expects to operate with it. However, this CK-MC is only one component of a larger composite MegaContext, which may include:

  • CK-MC — permanent factual and procedural knowledge.
  • Session-MC — ephemeral working memory (conversations, tasks).
  • Recent-Events MC — continuously updated summaries of news or sensor feeds.
  • Domain/Proprietary MCs — organizational or application-specific datasets.
  • File Watcher MC — indexed, live-updating summaries of local or cloud storage.

Each of these MCs can be dynamically combined by the Focus Allocator and LensNet into a single Working Context (WC), creating a unified memory substrate analogous to a deeply integrated RAG system — without tool calls.


2. Architecture Components

ComponentRole
GistNetSummarizes token spans into multi-level gists (LOD1, LOD2).
LensNetScores which spans to expand or compress; trained via ΔNLL supervision.
MegaContexts (MCs)Persistent hierarchical trees of gists (CK, Session, etc.).
Focus AllocatorSelects and merges spans from multiple MCs into a fixed-size Working Context (WC).
Base Model (Cognitive Core)Trained transformer performing next-token and LOD-level predictions (MegaPrediction).

3. Core Knowledge MegaContext (CK-MC)

  1. Dataset: A MegaCuration of factual corpus (~100M tokens) representing stable, high-value world knowledge.
  2. Processing:
    • Chunk into 32-token spans.
    • Summarize via pretrained GistNet into multi-level hierarchy (LOD1, LOD2, etc.).
    • Optionally prune into a sparse tree (anchors at LOD1/LOD2).
  3. Initial CK Working Context (CKWC):
    • LensNet + Focus Allocator produce a CKWC of size CL (e.g., 4k–8k tokens).
    • CKWC acts as a “starter memory” for any prompt.
    • Refocused per query using LensNet during inference.

4. Training Initialization

ModuleSourceFrozen?
Token EmbeddingsMC-LLM
GistNetMC-LLM✗ (finetune slowly)
LensNetMC-LLM✗ (finetune slowly)
Transformer TrunkRandom
Prediction Heads (LOD0/LOD1)Random

Rationale: GistNet and LensNet preserve hierarchical compression/focus semantics, while the new trunk begins unmemorized, forcing dependence on external MCs.


5. Training Loop

Each batch samples a full context C and builds a composite working context (WC) mixing CK-MC and Session-MC spans within a shared token-equivalent budget (C2).

Step-by-Step

Full training context C
 ├── Load CKWC (size CL) and session context C
 ├── Allocate joint CK+Session WC of total length C2
 │    ├── CK spans = summarized (LOD1/LOD2)
 │    ├── Session spans = LOD0 + summaries
 │    └── Joint [[Focus Allocator]] selects mix under C2 budget
 ├── Evaluate base model ([[MegaPrediction]])
 │    ├── LOD0 NLL (next-token prediction)
 │    ├── LOD1/LOD2 regression (gist targets)
 │    └── Backprop through Base + [[GistNet]]
 ├── [[LensNet]] refinement (ΔNLL supervision)
 │    ├── Identify ideal WC(s) via regularized argmin
 │    ├── Derive target focus scores
 │    └── Backprop through [[LensNet]]
 └── CK Reliance Margin (drop-test loss)
      ├── Compare L_full vs. L_drop (CK removed)
      └── Encourage L_drop > L_full + margin m

6. Composite Focus Allocator

Objective

Choose the optimal combination of spans from all available MCs (CK, Session, etc.) under a fixed working context budget C2.

Each span i has:

  • Cost w_i (token-equivalent length)
  • Predicted benefit v_i (ΔNLL reduction estimate)
  • Penalty terms (hysteresis, redundancy, legality)

Maximize: Σ (v_i − penalty_i) subject to Σ w_i ≤ C2

Implementation Details

  • v_i: estimated from LensNet scores + ΔNLL logs (EMA).
  • MC diversity: penalize over-selection from a single MC.
  • Session minimum share: ≥50% of C2 budget reserved for active context.
  • Edits per iteration: up to 4 WC deltas (expand/compress spans).
  • Dedup: remove identical WCs from different edit sequences.

7. Sparse MCs (Scalable Knowledge Storage)

To support large-scale composite MCs, each may store anchors instead of full trees.

FeatureDescription
AnchorsHigh-level gists (LOD1/LOD2) with metadata (domain, span size, version).
Progressive FillExpanding an anchor reconstructs details using LOD1 → LOD0 decoding.
CacheExpanded spans cached as embeddings + KV for reuse.
Consistency LossPenalizes drift between reconstructed tokens and gist anchors.
VersioningMulti-version support allows CK updates without invalidation.

8. Loss Terms

LossDescription
LOD0 NLLStandard next-token loss.
LOD1/LOD2 RegressionCosine/MSE between predicted and reference gists.
CK Reliance MarginEnforces reliance on CK: L_drop > L_full + m.
Focus Utility RewardReward for expansions that reduce ΔNLL.
Hysteresis PenaltyPrevents frequent oscillation of focus.

9. Validation & Evaluation

All models are compared using equal-length working contexts (C2) to ensure fair compute.

ModelContext CompositionMC AccessExpected Behavior
LLMLast C2 L0 tokensNoneBaseline (no MC).
MC-LLMC2 mix of L0 + gists from session CSession MCCompression & context recall.
C²MC-LLMC2 mix from CK + session (and optional extra MCs)Composite MCBest performance, adaptive focus across knowledge tiers.

Metrics

  • Perplexity (ΔNLL) on CK-dependent and cross-MC spans.
  • CK Reliance Index (CKRI) = (L_drop − L_full)/L_full.
  • Focus attribution: % attention/focus toward each MC.
  • Expansion utility: % of expansions that reduce loss.
  • MC update responsiveness: change in accuracy after updating any MC.
  • MC “poison” test: robustness to conflicting information.

10. Practical Considerations

  • Minimum session share: enforce ≥50% to ensure grounding in live context.
  • Compute fairness: all model variants use the same C2 budget.
  • Position encoding: Gaussian RoPE + ALiBi blend to stabilize across MCs.
  • CK caching: precompute CKWC embeddings + KV caches.
  • Safety: contradiction detector cross-attends prompt ↔ CK spans.
  • Extensibility: support new MC types (news, logs, FS watcher) via the same API.

11. Success Criteria

  1. MC reliance: performance drops when CK removed (ΔNLL↑).
  2. Instant adaptability: immediate improvement after MC updates, no finetune.
  3. Equal compute: matches MC-LLM cost with superior factual accuracy.
  4. Externalized knowledge: catastrophic drop without CK = success (facts externalized).
  5. Cross-MC reasoning: integrates information across CK, session, and other MCs coherently.

12. Next Steps

  1. Implement Composite Focus Allocator (multi-MC support).
  2. Add Sparse MC infrastructure with metadata, caching, and expansion path.
  3. Integrate CK Reliance Margin and multi-MC supervision into e2e training.
  4. Build multi-MC evaluation harness (LLM vs MC-LLM vs C²MC-LLM).
  5. Extend composite MC support for Recent Events, Domain Knowledge, and File Watcher modules.

Version: v2.0 Author: MegaContext Research Team Date: {datetime.now().strftime(‘%Y-%m-%d’)}