Context Selection Has Been a Solved Problem Since 1951

← Old Math

Every large language model has a context window, and the window is always too small. The information you’d like the model to act on doesn’t fit; something has to be left out. So you choose — and the choosing is the whole game.

This is the central engineering problem of modern LLM deployment. It goes by different names depending on which corner of the field you’re standing in: RAG, context engineering, long-context memory, prompt compression. Billion-dollar companies are being built around various answers. Thousands of papers have been published. The approaches range from vector-database retrieval to learned summarization to sliding windows to hierarchical memory architectures.

None of them cite the mathematician who settled the question in 1951.

The problem, stated cleanly

Strip away the engineering vocabulary and what remains is a precise mathematical question. You have a fixed function f — a frozen language model — that accepts at most C tokens as input and produces an output distribution over possible responses. You have a growing stream of information I — the user’s accumulated conversations, documents, preferences, facts — that far exceeds C. You need a selector: a map that takes the stream and produces a C-token subset such that f applied to that subset approximates what f would output if it could see the entire stream.

That’s it. That’s the whole problem. A window of a couple hundred thousand tokens, a user whose history runs to tens of millions, and one function standing between them deciding what survives the cut.

Notice what isn’t here. There is no “memory” in this formulation. Memory is a metaphor imported from human cognition, and it’s a misleading one. Humans retrieve memories through association, reconstruction, emotional weighting, and a dozen other mechanisms evolved for biological brains navigating physical environments. A transformer does none of this. It computes a deterministic function of its input tokens. The question is not “what does the model remember” but “what should it see” — what belongs in the channel to make the fixed function’s output correct.

Notice also: the function is fixed. You’re not training the model. You’re not fine-tuning it. You’re choosing its input. This means encoder-decoder co-optimization — the bread and butter of information theory — doesn’t apply. You’re optimizing one side of a channel against a decoder you can’t touch. This is a mismatched problem in a specific technical sense, and the mismatch is what makes the standard information-theoretic toolkit largely inapplicable.

So what is the right toolkit?

This is the problem Blackwell solved

In 1951, David Blackwell published a ten-page paper in the Proceedings of the Second Berkeley Symposium titled “Comparison of Experiments.” The paper asks a deceptively simple question: when is one statistical experiment better than another?

An experiment, in Blackwell’s sense, is a channel — a way of generating data about something you’re trying to decide. Blackwell’s theorem gives a clean answer: experiment E₁ is at least as good as experiment E₂ for every possible decision problem if and only if E₂ can be obtained from E₁ by post-processing — by applying a stochastic transformation (a “garbling”) to E₁’s output. If you can get E₂’s data by taking E₁’s data and adding noise, then E₁ is at least as informative. The converse holds too: if E₁ is at least as informative for every decision, then E₂ must be a garbling of E₁.

This is a complete characterization, ten pages, no asymptotics, no side conditions. It has been sitting in the library ever since.

Thirteen years later, Lucien Le Cam asked the natural follow-up: what if you don’t care about every decision problem — what if you care about one specific decision? Le Cam’s deficiency δ measures how much worse one experiment is than another for a particular decision task. Where Blackwell gives a total order (one experiment dominates for all decisions or it doesn’t), Le Cam gives a targeted measure: this experiment is this much worse than that one, for this specific purpose.

Now map this onto context selection.

The full information stream is the reference experiment — the best possible data you could hand the model. The C-token context you actually select is a candidate experiment — a compressed, partial view. The single decision problem you care about is: emulate the output of f. Le Cam deficiency δ(selected context, full stream; emulate f) is literally the quantity you want to minimize. This is not an analogy you should hold at arm’s length. The objects are the same objects. The optimization target is the same optimization target. The only thing that changes is the vocabulary.

Attention is the statistic

Here’s where it gets concrete for anyone who works with transformers. Every attention head computes a weighted sum: it takes the context tokens, computes softmax attention scores based on the query, and produces an output that is a convex combination of value vectors. Formally: the attention output is Σ softmax(q·kᵢ/√d) · vᵢ — a statistic of the context.

The question of whether this statistic is sufficient for the downstream decision is exactly Le Cam’s question. A sufficient statistic, in the classical sense, is one that preserves everything decision-relevant about the data. If the attention output captures all the information in the context that matters for the model’s output, then the context is sufficient. If attention loses something — if there is information in the raw tokens that matters for the output but gets washed out by the attention computation — then the context is deficient, and Le Cam’s δ measures how deficient.

Every ML practitioner already computes this object on every forward pass. The question of whether they’re computing a sufficient function of the context is the question Blackwell’s framework exists to answer.

Prompt compression is garbling

This connection runs deeper than the individual query. Consider what every context compression method actually does. LLMLingua takes a prompt and drops tokens based on perplexity scores. SelectiveContext filters based on information content estimates. Learned compressors train an encoder to summarize long contexts. RAG retrieves a subset of documents by similarity.

In every case, the method takes a context and applies a transformation — a Markov kernel — to produce a shorter context. In Blackwell’s language, each of these methods is a garbling. It takes one experiment and produces another by post-processing.

Blackwell’s theorem says: one experiment dominates another if and only if the second is a garbling of the first. This means there’s a clean partial order over compression methods. Method A dominates method B if B’s output can always be obtained by garbling A’s output. And there are 70 years of theory about which garblings lose what, under what conditions, for which decision problems.

None of the papers introducing these compression methods use this vocabulary. None appeal to this partial order. None ask whether their garbling is admissible in Blackwell’s sense. They evaluate empirically — run downstream benchmarks, measure perplexity recovery, compare ROUGE scores — without the formal apparatus that would let them say whether a given compression is optimal, or how far from optimal it is, or whether a fundamentally better compression exists.

The citation gap

I went looking for the citations. In the major prompt compression survey of 2024 (Li et al., arXiv:2410.12388), which covers the full landscape of context compression methods — token pruning, soft prompt compression, retrieval-based compression, summarization-based approaches — neither Blackwell, nor Le Cam, nor Torgersen’s 1991 textbook Comparison of Statistical Experiments, nor the word “deficiency” in its decision-theoretic sense, appear anywhere. Zero citations. You can open the file and search for the names yourself; they are not there.

The same holds across the broader RAG and long-context literature. I checked NeurIPS, ICML, ICLR, and the major survey papers. In papers specifically about choosing what goes into a language model’s context window — the exact problem Blackwell formalized — the classical framework does not appear.

This is not a moral failing. It’s a disciplinary geography problem. Blackwell’s comparison of experiments lives in mathematical statistics departments. It’s written in the language of measure theory, decision functions, and risk sets. The textbook treatment is Torgersen 1991, a 580-page monograph published by Springer in the Lecture Notes in Statistics series. It is not on any ML reading list I’ve encountered.

The ML community builds from information theory, from optimization, from empirical risk minimization — all useful frameworks, but none of them is the right framework for this particular problem, because this particular problem has a fixed decoder. The nearest framework the ML community does use — the information bottleneck (Tishby, Pereira, and Bialek, 1999) — would seem like a natural fit, and some readers may wonder why I haven’t invoked it. The reason is that the IB is degenerate for deterministic functions. Kolchinsky, Tracey, and Van Kuyk (2019) showed that when the mapping from input to output is deterministic — and transformer logits are a deterministic function of the input tokens — the IB Lagrangian becomes ill-posed or trivially piecewise constant. Since the whole point of our setup is a fixed, deterministic decoder, the IB framework doesn’t apply without artificial stochasticization. Blackwell’s framework, by contrast, handles deterministic decision rules natively. It was built for them.

The deeper reason for the gap is structural. Statistics departments and computer science departments read different journals, attend different conferences, and teach different prerequisite chains. A statistics PhD reads Le Cam; a CS PhD reads Shannon. Both traditions produce powerful tools, but they are different tools for different problem shapes. When the problem shape is “optimize both sides of a channel simultaneously at asymptotic rates,” Shannon wins. When the problem shape is “choose the best input to a fixed decision procedure,” Blackwell wins. LLM context selection is the second kind of problem. The community has been reaching for the first kind of tool.

The gap has, in fact, been recorded from inside. The most formal treatment of prompt compression to date — Nagle, Girish, and co-authors, “Fundamental Limits of Prompt Compression: A Rate–Distortion Framework for Black-Box Language Models” (arXiv:2407.15504, NeurIPS 2024) — studies exactly this setting, a fixed decoder nobody gets to redesign, and notes in its own text that “our model is one of compression for a fixed decoder,” a model that “has not been actively studied in the information theory literature before.” The authors then do what careful people do when the right tool has no name in their literature: they reach for the nearest one, rate–distortion, and do honest work inside it. The framework that takes the fixed decoder as its default had been sitting in the statistics literature for seventy years. The reach and the miss, recorded in the same paper, is the gap this essay names — stated by the field itself.

What people do instead

The engineering solutions to context selection form a spectrum, and all of them are resourceful responses to a real problem. They deserve a fair accounting.

Retrieval-augmented generation chunks documents, embeds them, stores the embeddings in a vector database, and retrieves the top-k most similar chunks at query time. This is effective and widely deployed. Its optimality criterion, to the extent it has one, is embedding similarity — which is a proxy for relevance, not a measure of sufficiency. Two chunks can be individually similar to the query and jointly redundant, or individually dissimilar and jointly necessary. Top-k similarity is not the right objective.

Learned rerankers add a second stage: retrieve broadly, then rerank with a cross-encoder. This helps with relevance but doesn’t change the fundamental framing. The reranker optimizes a relevance score, not a sufficiency measure.

Prompt compression methods — LLMLingua, SelectiveContext, and their descendants — attack the budget constraint directly by shrinking the prompt. As noted, each is a garbling. Some are better garblings than others, but none appeals to the theory that would let you characterize which.

Hierarchical summarization builds tree-structured summaries at multiple levels of abstraction and retrieves from the appropriate level. This is creative engineering and sometimes works well. It also introduces a new problem: the summarization step is itself lossy, and the loss is uncharacterized in sufficiency terms.

Sliding windows and recency weighting — the default in most conversational systems — keep the most recent C tokens. Picture the detail that decides the current question sitting eleven turns ago, dropping out of the window the moment it starts to matter. Under exchangeability assumptions on the information stream, this is provably not sufficient. The sufficient statistic for an exchangeable sequence is its empirical measure — the unordered histogram of what appeared — not its ordering. Recency weighting preserves ordering and discards content, which is exactly backwards. This is not a small point. Most deployed LLM memory systems privilege recency. Under the assumptions most charitable to them, this is demonstrably the wrong choice.

Each of these approaches was built by smart people solving a real problem under real constraints. The observation is not that they’re bad — it’s that they’re all solving the same well-defined problem, and none of them uses the classical apparatus that would give them an optimality criterion. They are engineering around mathematics they don’t know exists.

What the reframing gives you

If you take Blackwell/Le Cam seriously as the formal home for context selection, several things follow.

A clean optimality criterion

The goal is to minimize Le Cam deficiency δ(selected context, full stream; emulate f). This is a single scalar quantity that captures exactly how much worse your selected context is than the full stream, for the specific purpose of making the model produce the right output. It’s not a proxy (like embedding similarity). It’s not a partial measure (like downstream task accuracy on one benchmark). It’s the thing itself.

An impossibility direction

Blackwell comparison gives structural lower bounds. Certain selectors cannot achieve zero deficiency for certain stream structures — not because the algorithm is bad, but because the budget is too small relative to the information the stream contains. These are theorems, not empirical observations. They tell you when to stop trying to improve your selector and start increasing your budget, or accepting degraded performance, or restructuring the problem.

A principled method comparison

Blackwell’s theorem gives a partial order over selectors. Selector A dominates selector B if and only if B’s output can be obtained by garbling A’s output. This is a structural comparison, not an empirical one. You don’t need a benchmark suite to determine dominance — you need to analyze the functional relationship between the two selectors’ outputs. When dominance doesn’t hold (when neither selector is a garbling of the other), Blackwell’s framework tells you that too: the selectors are incomparable, meaning each is better for some decision problems and worse for others.

The exchangeability collapse

Under exchangeability — when the information in the stream doesn’t depend on the order it arrived — the selection problem simplifies dramatically. The sufficient statistic for an exchangeable sequence is its empirical measure (this is de Finetti’s representation theorem). Selection reduces to finding C tokens whose empirical distribution best approximates the stream’s empirical distribution. This is optimal quantization of a probability measure, which is a well-developed area with sharp results.

Zador’s theorem gives the asymptotic rate: the order-r quantization distortion decays as Θ(C^−r/d) — equivalently the L^r error as Θ(C^−1/d) — where d is the effective dimension of the distribution’s support and r is the order. The constructive algorithm is Lloyd’s (the k-means of measure quantization). This is quantitatively different from top-k similarity retrieval — it’s a coverage criterion, not a relevance criterion. You want C tokens that tile the distribution well, not C tokens that are individually close to the query.

And it means recency bias is wrong — not heuristically suboptimal, but formally non-sufficient. The sufficient statistic is the histogram, not the timeline. Most LLM memory systems are optimizing the wrong thing.

A transformer-native connection

Geshkovski, Letrouit, Polyanskiy, and Rigollet (2023, further developed in the Bulletin of the AMS 2025) showed that tokens under repeated self-attention behave as an interacting particle system on the sphere, and the forward pass induces clustering dynamics that drive the token distribution toward a small number of clusters, with the self-attention matrix converging to a low-rank limit. How many clusters emerge depends on the model’s parameters — the value matrix among them. The transformer is already performing a quantization — collapsing its input toward a finite set of attractors.

This means selection should respect the quantization the transformer already performs. Don’t pick C tokens that tile the input space uniformly; pick C tokens that tile the attractor space of the transformer’s forward dynamics. The selector and the model are solving the same geometric problem, and the selector should be aware of the model’s solution.

A tractable approximation

The natural objection to all of this is: Le Cam deficiency is defined in terms of risk functions over all possible randomized decision rules. In practice, you can’t compute it exactly. Fair.

But Mu, Pomatto, Strack, and Tamuz (2021, Econometrica) proved a remarkable connection: in large samples, Blackwell dominance is equivalent to a comparison of Rényi divergences. Specifically, one experiment dominates another in large samples if and only if the Rényi divergences of all orders are uniformly larger under the dominating experiment. Rényi divergences are estimable from model logits. You have the model’s output distribution — you can compute divergences between the output under the full context and the output under the selected context, at least approximately.

This doesn’t make deficiency minimization a solved computational problem. But it provides a tractable off-ramp from the uncomputable ideal to something you can estimate and optimize against. The gap between “the theory is beautiful but uncomputable” and “here’s something you can actually run” is bridged, at least partially, by the Rényi connection.

Kernel herding as a constructive example

For the single-layer attention case, there’s an even more concrete bridge. Tsai et al. (2019) showed that attention is a kernel smoother — a Nadaraya-Watson estimator with a particular kernel determined by the query-key inner product. Under this identification, selecting C context tokens for one attention head is a kernel subsample selection problem.

Chen, Welling, and Smola (2010) and Bach, Lacoste-Julien, and Obozinski (2012) showed that kernel herding — a greedy Frank-Wolfe algorithm on the maximum mean discrepancy — drives the moment-matching error down at rate O(1/C) in a finite-dimensional RKHS, beating the O(1/√C) of random subsampling. The caveat is load-bearing here: Bach et al. (2012), via the Frank-Wolfe equivalence, established that in an infinite-dimensional RKHS — which is what the softmax/inner-product attention kernel induces — only the O(1/√C) rate is guaranteed. So for one attention head there is a constructive selector (herding), but whether it beats random subsampling at the attention kernel is open, not proven. The framework produces a working algorithm; its fast-rate guarantee at the attention kernel, and the extension to the full deep composition, are both genuinely open.

What this does not claim

This essay does not claim that context selection is “solved” in any practical sense. The theory gives you the formal frame — the right definitions, the right optimality criterion, the right impossibility results. It does not hand you a production system. The gap between Le Cam deficiency as a concept and a selector you can deploy in a RAG pipeline is real and substantial.

It also does not claim that the existing engineering approaches are worthless. Many of them work well in practice, and some of them are approximating the right objective without knowing it. The claim is narrower: these approaches lack a formal foundation, and the formal foundation exists, and it has been sitting in the library for seventy years.

The formal framework has been actively developed throughout those seventy years. Torgersen’s 1991 textbook gives the canonical treatment. Mu et al. (2021) connects Blackwell dominance to Rényi divergences. Brooks, Frankel, and Kamenica (2024) extend the comparison of signals in directions relevant to sequential decision-making. This is not a dead field — it’s a living body of mathematics that the ML community has not engaged with.

What I am suggesting is this: the next time you design a context selection system, before you build the vector database and the reranker and the summarization pipeline, read Torgersen. Ask what the sufficient statistic is for your decision problem. Ask what garblings your compression method introduces. Ask whether your selector dominates the alternatives in Blackwell’s sense, or whether it’s incomparable, or whether it’s dominated.

The answers might change what you build.

This is the essay version. The full reference list and sources are in the published record on Zenodo.