Diagram TODO: add multimodal MegaContext schematic once assets are available.
MegaContext can virtualize more than text. This note sketches how we ingest images, compress them into hierarchical gists, and align their positional signals so the frozen or co-trained base model can reason over massive visual memories alongside language.
Why images need special treatment
- 2D structure: Image patches carry spatial relationships that raster order alone cannot preserve.
- Scale explosion: High-resolution captures (target: 1B×1B) require mipmap-style pyramids to keep storage and compute manageable.
- Mixed modalities: Visual gists must coexist with textual entries inside the Working Context while obeying the same budget and invariants.
To serve multimodal runs, we extend the MegaContext Tree and Positional Encoding scheme to understand 2D positions and multi-channel embeddings.
Building hierarchical visual gists
-
Patch tokenization
- Start from a ViT-style encoder that maps fixed-size patches (e.g., 16×16 pixels) to embedding vectors.
- Maintain per-patch metadata:
(x, y)indices, resolution level, color statistics, modality tags.
-
2D block grouping
- Instead of 1D 32-token spans, form 2D tiles (e.g., 8×8 patch blocks = 64 patches) that become the unit for compression.
- Each tile feeds a Visual GistNet variant with separable 2D attention (row/column mixing) and optionally cross-attention to textual cues already present in the working view.
-
Mipmap-style hierarchy
- Stack levels analogous to GPU texture mipmaps: each parent summarizes a 2×2 grid of child tiles, yielding quad-tree fan-out.
- Persist LOD0 (patch embeddings), LOD1 (tile gists), LOD2+ (mipmap gists) in parallel buffers within the MegaContext storage layout so they can be streamed independently.
-
Cross-modal linking
- Attach references from gists back to aligned textual spans (captions, OCR tokens), enabling LensNet to co-reason across modalities when assigning focus scores.
- Store provenance metadata in Node Metadata to preserve auditability.
This structure mirrors the textual hierarchy but respects image topology, enabling the Focus Allocator to expand or collapse spatial regions as needed.
Positional encoding for images
Visual entries combine global MegaContext indices with local 2D coordinates:
- Global index: Every image ingest event receives a base index, keeping chronological ordering consistent with text and allowing ALiBi biases to operate across modalities. See Positional Encoding Track B for the compatibility hooks we still maintain.
- Local 2D encoding: Within each image, we apply a 2D rotary or sine-cosine scheme (separate phases for x and y). For gists, we pool the mean position and attach scale parameters
(σ_x, σ_y)describing footprint size. - Gaussian extension: When Gaussian RoPE drives Track A, image gists set σ terms based on tile width/height, naturally blurring high frequencies in both axes.
- Slow-band disambiguation: Extremely large images layer ultra-slow rotary bands per axis so distant patches remain distinguishable even after mipmap compression.
Runtime integration inserts these signals alongside text positions before the base model consumes a multimodal working context.
Working context integration
- Budget accounting: Assign modality-aware costs (e.g., treat a visual tile gist as 1 token-equivalent for budget balance). Mixed batches of text and image entries remain contiguous in chronological order.
- Focus coordination: LensNet extends its input schema to include modality flags, 2D extents, and cross-modal attention summaries, letting it compare textual relevance against visual signals.
- Expansion policy: The Focus Allocator can refine a coarse visual gist into its child grid, similar to expanding text gists into LOD0 tokens, while respecting cooldowns and spatial contiguity.
This keeps inference uniform: the base model still sees a single sequence, but entries carry modality-specific embeddings and positional metadata.
Future directions
- Ultra-large imagery: For 1B×1B inputs, ingest occurs in streaming passes (tiled IO) that populate the MegaContext hierarchy level by level, using GPU texture hardware or CUDA kernels for efficiency.
- Learned cross-modal gists: Co-train textual and visual gist encoders so captions influence which visual regions stay detailed.
- Multimodal telemetry: Extend Telemetry to record focus hit rates per modality, positional stretch usage, and ΔNLL deltas when swapping visual LODs.
- Speculative rendering: Combine with MegaPrediction to queue future visual edits or annotations in a separate branch before committing to history.
Related notes
- Positional Encoding — shared positional strategy across modalities.
- MegaContext Tree — storage layer hosting both textual and visual hierarchies.
- Working Context — describes budget management for mixed-LOD sequences.
- Future Plan Track C — highlights multimodal application milestones.