Gist Tokens (arXiv:2304.08467v3) — Report

PDF: Gist Tokens - 2304.08467v3.pdf

Overview

Introduces gisting, a training recipe that teaches an LM to compress long prompts into a small set of reusable “gist tokens.”
Compression is achieved by modifying attention masks during instruction fine-tuning so that newly introduced gist tokens attend to the prompt while the prompt attends back primarily within the gist region.
Demonstrates up to 26× prompt compression (often using a single token) with minimal quality loss on LLaMA-7B and FLAN-T5-XXL, yielding ≈40% FLOPs and latency savings when caching gists.

Attention-masked training: During finetuning, mask the real prompt from seeing the original inputs so the model is forced to represent them through the new gist tokens.
Reusable caches: Once trained, a task-specific prompt can be compressed once, cached as a short gist prefix, and reused without re-encoding the full prompt each time.
Quality preservation: Automatic and human evaluations (ROUGE, GPT-4 judgements) show small degradation; failures relate to repetitive or overfitted gists when capacity is too high.
Compute/storage efficiency: Gains come both from shorter sequences and from the ability to reuse gist caches across queries.

Directly validates our GistNet architecture goal: compress token spans while preserving substitutability. Their attention-masking trick provides an empirical recipe for teaching hierarchical gist compression (see GistNet Training).
Suggests a practical path for prompt macro caching inside Working Context—store high-value task presets as single-token gists that can be injected alongside retrieved spans, reducing repeated instruction encoding.
Provides empirical evidence that aggressive 32→1 compression can work with minimal ΔNLL@H degradation, validating our POC Implementation target compression ratios.
Highlights the need for quality guards (detecting degenerate gists), which aligns with our Telemetry requirements for monitoring gist entropy and preventing mode collapse during GistNet Training.

Implement a masked-attention curriculum during GistNet Training where gist slots must reconstruct downstream predictions without attending to source tokens—forces gists to encode sufficient information for substitutability.
Borrow their gist caching benchmark to evaluate FLOPs/latency savings from our MegaContext Tree compression (see MegaContext End-to-End Training success metrics).
Apply their observation that excessive gist capacity causes overfitting as a design constraint: keep gist slots minimal (K=32 → 1 in POC Implementation) to prevent degenerate memorization.
Extend their logit divergence measurement as our primary ΔNLL@H metric for both GistNet training and LensNet utility scoring during counterfactual labeling.

Training relies on instruction-tuned data distributions; domain shift (e.g., specialized code, technical jargon) may degrade compression quality—relevant for POC Implementation where we test on project documentation and code.
Some gists degenerate into repetitive boilerplate, indicating need for entropy regularizers or contrastive loss during GistNet Training (see GistNet Parameters).
Compression quality is content-dependent; diverse corpora (code, prose, structured data) may require domain-adaptive curricula or specialist GistNet variants (see Track B).

LLMLingua / LLMLingua-2 for alternative prompt compression metrics rooted in token importance prediction (complements gisting’s generative approach).
Prompt Caching & Reuse work from OpenAI/Anthropic (e.g., RePrompting) for operational patterns when many prompts share structure.
Long-context distillation papers (e.g., LATS, LongLoRA) to understand how compression interacts with retrieval and adaptive context windows.

How to blend learned gist tokens with our hierarchical MegaContext Tree—should precomputed prompt gists live as pseudo-LOD1 nodes or as special metadata entries that bypass normal compression?
Can we precompute domain-specific gists (e.g., tool instructions, boilerplate) and cache them in Storage Format for instant Working Context injection, reducing cold-start overhead?
What Telemetry metrics detect gist drift when the base model is updated—track ΔNLL@H trends, attention pattern shifts, or embedding cosine similarity distributions?