Pointer or Preserve? Kolmogorov's Structure Function Already Answers

← Old Math

In a companion essay, I argued that context selection — choosing which tokens a frozen language model gets to see — is a sufficiency problem formalized by Blackwell and Le Cam seventy years ago. That essay names the outer problem: which pieces of information belong in the context window. This essay names the inner one. Before a piece of content competes for a slot, you face a prior question: can the model reconstruct this from a sparse cue, or must it be preserved verbatim?

Every LLM memory system makes this distinction. Importance scores assign each chunk a number; chunks below a threshold get summarized or dropped, chunks above get retained. Retrieval systems return verbatim passages for some queries and rely on the model’s parametric knowledge for others. Summarize-versus-retain policies, retrieval thresholds, token pruning scores — all are implementations of the same binary: does this content need to be in the window, or can the model get by without it?

The distinction is real and load-bearing. Content the model learned during training — common facts, standard patterns, widely-reproduced text — can be activated by a short pointer: a few tokens that cue the right region of weight space. Content the model has never seen — your proprietary data, your user’s idiosyncratic preferences, last Tuesday’s meeting notes — cannot be reconstructed from any cue. It must be in the window or it’s gone.

What’s missing is a formal definition of which side a given piece of content belongs on. The importance scores are approximating something, but the something has no name. There is no target function, no curve to fit, no theorem saying what the optimal pointer-versus-preserve boundary looks like. Everyone is navigating by instrument readings without a map.

The map exists. It was drawn in 2004. Almost no one in the LLM literature has read it.

The right formal object

The starting point is conditional Kolmogorov complexity. For a string x and a fixed model f, define K(x | f) as the length of the shortest program that produces x when given f on an auxiliary tape. This is the irreducible information content of x that f cannot supply. If K(x | f) is small, f already “knows” x — a short program plus the model suffices to reconstruct the content. If K(x | f) is large, x contains substantial information that f cannot provide. The content is novel relative to the model. A line every model has read ten thousand times is small under f; a stranger’s phone number is not.

The algorithmic mutual information I(f : x) = K(x) − K(x | f) quantifies the overlap. High mutual information means the model captures most of the content’s complexity; a pointer suffices. Low mutual information means the content is largely independent of the model; preservation is required. Gács, Tromp, and Vitányi formalized this object in 2001, building on Kolmogorov’s 1968 foundations.

But K(x | f) is a single number. It tells you the total novelty of x relative to f. It does not tell you the structure of that novelty — how much compression budget you need before the compression becomes lossy, or where the transition from lossless to lossy occurs. For that, you need the structure function.

Vereshchagin and Vitányi’s 2004 paper “Kolmogorov’s Structure Functions and Model Selection” defines the object precisely. The f-conditional structure function is:

h_x|f(α) = min{log |S| : x ∈ S, K(S | f) ≤ α}

In words: for a given complexity budget α, what is the smallest set S containing x that can be described in at most α bits given the model? The function traces a curve. At α = 0, no description budget is available, and S must be the set of all strings — h is maximal. As α increases, you can specify tighter sets, and h decreases. The curve records exactly how compressible x is at each level of descriptive investment.

The sufficiency line L(α) = K(x | f) − α enters from the upper left. Where h_x|f first touches L locates the f-relative algorithmic minimal sufficient statistic: the point at which the model-relative description has captured all the structure in x that can be captured, and what remains is incompressible noise. Below this point, further compression is provably lossy — you’re throwing away signal, not noise.

The distinction every importance score is trying to approximate is the level of the curve — not where it first touches the line. Content whose curve sits low — small K(x | f), because the model already holds most of it — is pointer-compressible: a brief description, combined with the model, recovers everything. Content whose curve rides the full line K(x | f) ≈ |x| is irreducibly novel: you need to preserve almost the entire string. The touch point is a different axis. It measures how non-stochastic x is, not how novel it is — and the two come apart at the extreme: a maximally novel string (random given the model, K(x | f) ≈ |x|) is in fact maximally stochastic, its curve being the sufficiency line, touched at α ≈ 0 and ridden at slope −1. A late touch — K(x | f) small yet no short model captures the structure — is the signature of an exotic, structured-but-non-stochastic string, not of novelty. Pointer-versus-preserve is read off the level; the touch point is read off orthogonally.

The staircase theorem: the distinction is not binary

Here is where Vereshchagin and Vitányi’s result becomes sharper than a binary classifier. Their main theorem proves that every non-increasing integer-valued staircase shape within the admissible region — bounded above by |x| and below by the sufficiency line K(x | f) − α — is realized, to logarithmic precision, as the structure function of some string. The structure function is not smooth. It is not sigmoidal. It does not transition cleanly from “pointer-compressible” to “preserve-worthy.” Content sits at arbitrary positions along arbitrary staircase curves.

This has a direct engineering consequence. Current importance-scoring methods collapse the structure function to a single number — a score, a threshold, a keep/discard decision. But the structure function is a curve. It tells you how much budget each piece of content needs: not “important or not” but “this content requires α bits of description to be losslessly compressed given the model.” Different content has different curves. The right question is not “is this above the threshold?” but “what does this content’s compression curve look like, and how much budget does it need?”

Budget allocation should be content-adaptive. A flat importance threshold treats all content as if it had the same structure function shape. The staircase theorem says this is wrong in the strongest possible sense — every staircase shape occurs, so no single threshold can be correct across all content.

The citation gap

This object — defined in 2004, published in IEEE Transactions on Information Theory, with a full textbook treatment in Li and Vitányi’s An Introduction to Kolmogorov Complexity and Its Applications — is absent from the LLM compression literature. I checked the major prompt compression surveys, the RAG literature, the context-window optimization papers. The Vereshchagin-Vitányi structure function does not appear. Twenty-two years in print, in the field’s central information-theory journal, and the compression literature walks straight past it.

One recent paper comes close. Pan et al. (arXiv:2504.09597, 2025) use the Kolmogorov structure function to analyze how an LLM’s compression capability develops with scale — a genuine engagement with the right mathematical object. But they apply it to training dynamics: how the model’s compression capability develops as it learns. This essay applies the same object at the opposite end of the lifecycle — inference-time selection against a frozen decoder, where the model is no longer learning and the question is which content the existing weights can already reconstruct from a cue, and which must be preserved verbatim. Same object, different stage. The pointer-versus-preserve question remains unaddressed in their framework.

The forty-year survey of algorithmic statistics (Vereshchagin and Shen, 2017) traces the full intellectual history: Kolmogorov’s 1968 foundations, Gács-Tromp-Vitányi 2001, the 2004 structure function paper, and subsequent extensions. The field is alive and well-developed. It is simply not on the ML community’s reading list.

The elephant: non-computability

Kolmogorov complexity is not computable. This is not a technicality — it is the standard and correct objection to any framework built on K, and it deserves a direct answer rather than a hand-wave.

The non-computability runs deeper than “we can’t calculate K(x).” Vereshchagin and Vitányi (2004) proved — in the appendices of the structure-function paper — that locating the minimal sufficient statistic on the structure function, the precise point where h touches the sufficiency line, is not even approximable in the Kolmogorov-complexity sense. You cannot get arbitrarily close to the right answer by running longer computations. The exact formal object is out of reach.

So why is this framework still useful?

Because the framework tells you what you should be approximating, even when you can’t compute the exact answer. And people are already approximating it — they just don’t know what they’re approximating.

Perplexity is already a one-point approximation

When LLMLingua scores tokens by perplexity under the target model, or when SelectiveContext filters by self-information, they are computing a single-point estimate of the structure function. Here’s why.

Cross-entropy of x under model f — the quantity − log f(x) — upper-bounds K(x | f) to within an additive constant, under standard assumptions. Delétang et al. (2023) established this connection rigorously in “Language Modeling Is Compression,” showing that language models are compressors with empirical rates that track Kolmogorov-complexity-based bounds. The per-token cross-entropy is an estimate of how many bits the model needs to encode this token. Low cross-entropy means the model predicts this token well — it’s pointer-compressible. High cross-entropy means the token is surprising — it carries information the model doesn’t have.

This is a single number. The structure function is a curve. Perplexity-based importance scoring evaluates the structure function at one point — roughly, the point where α equals the model’s own descriptive capacity — and uses that single evaluation to make the keep/discard decision. It’s like evaluating a function at one input and declaring you know its shape. The framework doesn’t demand you compute K exactly; it tells you the object has more structure than a single number, and that the additional structure — the curve’s shape, its staircase steps, its transition points — is decision-relevant information you’re currently discarding.

The computable sibling: MDL

The Minimum Description Length principle (Grünwald, 2007) is the computable-in-practice version of algorithmic statistics. Where Kolmogorov complexity uses the shortest program on a universal Turing machine, MDL uses the shortest description within a restricted model class. This makes everything computable at the cost of model-class dependence.

MDL gives a practical path to approximating the structure function: define a model class, compute description lengths within it, trace the trade-off curve. The MDL tradition is well-developed, and neural models can serve as the model class — giving “MDL with neural priors” as a computability bridge.

The caveat is real. Grünwald and Langford (2007) proved that MDL-based inference can be asymptotically inconsistent under model misspecification — when the true data-generating process is not in the model class. Grünwald and van Ommen (2017) extended this to Bayesian inference generally: under misspecification, posteriors can concentrate on wrong answers with increasing confidence. A frozen language model f applied to an arbitrary user’s data stream is certainly misspecified — the model was trained on a different distribution than this particular user generates. The misspecification caveat is not hypothetical; it’s guaranteed. Any MDL-based approximation of the structure function must account for this.

Individual strings are richer than ensembles

There is a second, deeper reason why the formal framework matters even when the exact object is non-computable.

Vereshchagin and Vitányi (2010) proved that individual strings can have rate-distortion profiles — curves describing the trade-off between compression rate and distortion — that are richer than anything Shannon’s ensemble-average rate-distortion theory can produce. Shannon’s R(D) curve gives the average trade-off across a distribution of possible strings. Individual-string R(D) curves, defined via Kolmogorov complexity, can take shapes that no ensemble average matches.

This matters because your user’s data is not a draw from an ensemble. It is a specific string — a particular sequence of conversations, documents, preferences. When you evaluate a compression method by its average performance across a benchmark, you’re computing the ensemble R(D) curve. When you deploy it on a specific user’s specific data, the relevant object is the individual-string curve — and the two can diverge sharply.

This is the formal justification for per-item analysis. Aggregate compression benchmarks can systematically misstate what’s achievable for specific content. A method that scores well on average can fail badly on the particular data that matters to a particular user — not because the method is poorly implemented, but because ensemble averages cannot capture individual-string structure. The structure function is an individual-string object. It gives you the right resolution.

What knowing the target gives you

Current importance scores are not useless — many work well enough in practice. The claim isn’t that they’re wrong; it’s that they’re approximating an unnamed target, and naming it sharpens every approximation.

Evaluation shifts from correlation to tracking. Without a formal target, you evaluate importance scores by downstream task performance: does this scoring method lead to better model outputs? This is a reasonable but blunt instrument. With a formal target, you can ask a sharper question: does this scoring method track the structure function? A score that correlates with downstream performance but diverges from the structure function in predictable ways — say, by systematically underestimating the budget needed for certain staircase shapes — can be improved by correcting for the divergence. The target tells you where to look for failure modes.

Budget allocation becomes content-adaptive. If the structure function says one piece of content needs 12 bits of description to be losslessly compressed and another needs 200, you should not give them equal space in the context window. The curve-not-threshold insight means budget allocation should respect the heterogeneity of content complexity. Some content is cheap to represent; some is expensive. A fixed threshold wastes budget on the cheap content and starves the expensive content.

The pointer-versus-preserve question connects back to the outer selection problem. Essay 1 argued that context selection is sufficiency under a budget — Le Cam deficiency measures how much worse the selected context is than the full stream. The structure function tells you what to do with each piece of content within the selection. The two frameworks are complementary: Blackwell/Le Cam says which pieces belong in the window; the structure function says how to represent each piece once it’s in. One is the outer optimization, the other is the inner. Together they define the full problem.

Computable geometric proxies are an open direction. The structure function is the formal target. Cross-entropy is one approximation of one point on the curve. But there may be other approximations — geometric, manifold-based, dimension-aware — that capture more of the curve’s shape. If the model’s internal representations have geometric structure (and they do — the representation-learning literature has documented this extensively), then compression properties of content relative to the model may be approximable through the geometry of the representation space. This is an open area with multiple possible approaches: manifold learning, neural compression bounds, dimension-aware importance scoring. The structure function tells you what these approaches should be targeting.

A note on the information bottleneck

Readers familiar with the information bottleneck (Tishby, Pereira, and Bialek, 1999) may wonder why it doesn’t appear in this analysis. The reason is the same as in Essay 1: the IB is degenerate for deterministic functions. When f is a frozen transformer computing a deterministic output from its input, the IB Lagrangian becomes ill-posed (Kolchinsky, Tracey, and Van Kuyk, 2019). The structure function handles deterministic decoders natively — it was built for the relationship between a specific string and a specific description system, with no stochasticity requirement.

The object exists

The Vereshchagin-Vitányi structure function is twenty-two years old. It is the definitionally correct object for answering “can this content be compressed to a pointer, or must it be preserved?” The LLM community is approximating it — through perplexity scores, through importance weights, through compression-based token pruning — without knowing what it is.

This is not an indictment. The engineering works, more or less, because any reasonable proxy for “how much does the model already know about this content” will correlate with the structure function to some degree. Perplexity-based scores do correlate. Learned rerankers do correlate. The correlation is not an accident — it reflects the fact that these methods are, implicitly, estimating aspects of K(x | f).

But knowing the target makes the approximations better. It tells you the object is a curve, not a number. It tells you the curve can take any staircase shape. It tells you individual strings can behave differently from ensemble averages. It tells you where the non-computability bites and where computable approximations can still help. It gives you a formal criterion for when a compression is lossy and when it’s lossless — relative to the specific model, for the specific content.

The formal object exists. The papers exist. The textbook exists. Read Vereshchagin and Vitányi 2004. Read the rate-distortion extension (2010). Read Li and Vitányi for the full development. The map is in the library. Use it.

This is the essay version. The full reference list and sources are in the published record on Zenodo.