LLMLingua-2 (arXiv:2403.12968v2) — Report

PDF: LLMLingua-2 - 2403.12968v2.pdf

Overview

Presents a task-agnostic prompt compression pipeline centred on data distillation from GPT-4, producing faithful, short prompts without retraining the target LLM.
Builds a binary token-classification compressor that decides which tokens to keep, trained on a distilled dataset where GPT-4 supplies compressed references meeting strict fidelity constraints.
Achieves 3× compression on MeetingBank and other long-context benchmarks with equal or better downstream accuracy compared to prior baselines (Selective Context, LLMLingua v1).

Data distillation: GPT-4 is prompted to compress documents under rules forbidding reordering or paraphrasing, yielding faithful shorter texts.
Alignment-focused labels: Tokens in original prompts are tagged “keep” vs “drop” by aligning to GPT-4 outputs; quality filters remove samples that break constraints or lose semantics.
Bidirectional compressor: A Transformer encoder with linear head uses bidirectional context to score token importance, enabling deterministic, low-latency compression.
Adaptive compression ratio: At inference, thresholds on keep-probabilities control final length, allowing user-specified trade-offs.

Offers a token-importance perspective complementary to our learned gists; we can blend discrete keep/drop masks from binary classifiers with GistNet’s hierarchical compression for hybrid focus strategies.
Distillation methodology aligns with GistNet Training and LensNet Training requirements—their teacher-student pipeline provides a template for generating training labels that indicate which spans should remain in LOD0 vs compress to LOD1 or LOD2.
Provides faithfulness metrics (alignment span coverage, compression quality checks) relevant when Focus Allocator collapses spans; we can adopt these as automated quality gates in our Telemetry pipeline to flag over-aggressive compression.

Adapt their distillation prompts for GPT-4/teacher LLM when generating training data for counterfactual utilities or GistNet Training supervision signals—ensures training data reflects faithful compression constraints.
Integrate their probability thresholding scheme into Focus Allocator Strategies: treat LLMLingua keep-probabilities as auxiliary signals that bias expansion/collapse decisions, augmenting LensNet Scoring with token-level importance priors.
Adopt their quality control pipeline (alignment coverage metrics, faithfulness checks) as automated validation in our Training & Operations workflows—reject training samples where gist substitutions violate fidelity constraints.
Explore hybrid compression modes where LOD1/LOD2 gists encode compressed semantics while LLMLingua-style masks select which LOD0 tokens must remain expanded for critical detail, optimizing Working Context budget allocation.

Distillation relies on proprietary teacher models (GPT-4); reproducing internally requires substitute teachers (base model as teacher, domain experts for manual labels) or multi-teacher ensembles to avoid single-point quality bottlenecks.
Their token classifier shows domain bias toward meeting transcripts; applying to POC Implementation’s mixed corpus (code, project docs, structured data) requires corpus diversification and domain-adaptive training to prevent brittle compression on out-of-distribution content.
LLMLingua-2 is purely extractive (selects tokens to keep); cannot synthesize hierarchical abstractions like GistNet’s learned embeddings, so must be combined with generative compression to achieve MegaContext’s aggressive 32→1 and 1024→1 ratios while preserving substitutability.

LLMLingua (v1) for entropy-driven token pruning—contrasts with distillation and offers lighter-weight heuristics.
Context pruning methods such as Selective Context, RetroPrompt, or Attend & Excise for alternative scoring strategies.
Faithful summarization literature (e.g., SummaC, QAGS) to design automated checks against hallucinated gists.

How to fuse token-level keep probabilities with LensNet block-level focus scores—should we combine them additively, use them as hard constraints on legal actions, or ensemble via learned gating during Focus Allocator Strategies?
Can we implement multi-teacher distillation (base model + domain-specialized variants) so the same training corpus generates both GistNet compression targets and LensNet expansion/collapse utilities without redundant forward passes?
What’s the optimal caching strategy for compression decisions across sessions—store learned keep/collapse patterns in Node Metadata, persist LLMLingua scores alongside gists in Storage Format, or recompute dynamically based on Working Context composition to handle evolving query distributions?