← Old Math
Piijala R&D · Old Math

The Object the Field Keeps Reaching For


One operation gets done over and over inside an AI system, at a dozen different points, by different teams working on it as if it were a dozen different problems.

The operation is this. You have more information than you can put into the next computation. You have to choose what enters. A fixed function downstream will do its work on what you chose, and you want the function to behave — as nearly as possible — the way it would have behaved if it had seen everything.

At the moment of inference, that operation is context selection. Inside the KV-cache, it is eviction. Before the model is even trained, it is pretraining-data curation. In a RAG pipeline, it is reranking. When you ask a reasoning model to think step-by-step, it is chain-of-thought compression. When you compile a teacher into a student, it is distillation. When you cut a sixteen-bit float to four, it is quantization. When you decide whether last week’s interaction belongs in the weights or on a disk, it is memory consolidation. When you choose which unlabeled examples to label next, it is active learning. When you tune the codebook of a vector index, it is quantization-of-measure. When an agent compresses its observation stream into a worldview, it is something whose name has not been agreed.

These are the same operation, named differently in each setting. The function on the other side is fixed in each case — a frozen decoder at inference, a frozen reranker after retrieval, a checkpoint at curation time, a student net at distillation. The bound is finite in each case — a token budget, a memory budget, a description budget. The thing being chosen, in every case, is what enters a channel whose far end is fixed in place.

There is a precise mathematical criterion for what should enter that channel. It has existed since 1951 for the outer problem and since 2004 for the inner one. I made the case for both in two companion essays — Blackwell and Le Cam for the outer question of which pieces belong in the channel; Vereshchagin and Vitányi for the inner question of how each piece should be represented once it gets there. The two essays argue the criterion at one layer (context selection at inference). This essay generalizes the recognition across the pipeline:

Every layer of the AI pipeline named above is, in 2026, independently converging on the need for a decision-theoretic selection criterion, and reaching for adjacent-but-degenerate objects — the information bottleneck, causal sufficiency, influence functions, KL heuristics — because the correct object, Le Cam deficiency together with the Vereshchagin–Vitányi structure function relative to a fixed decoder, has never been named in this setting. The convergence is the evidence. The misreach is the gap.

The claim is specific and current: at this moment, in the most recent literature on each layer, the field is deciding that each layer needs a principled objective rather than a heuristic, and reaching for the adjacent objects nearest to hand. Convergence on adjacent objects, repeated across the pipeline by independent groups who work in separate literatures, is exactly what evidence for a missing common object looks like.

What follows walks the layers in order, beginning with the showpiece. Each line of work named here was done by careful people who got real things right. The divergence I locate is a fact about the tool the framework made available — structural rather than personal. The structural reading is the one that gives the thesis its weight.

KV-cache eviction — the showpiece

Three years ago, KV-cache eviction was the territory of working heuristics. H2O kept tokens with high accumulated attention. SnapKV picked the keys most attended-to in a local window before generation. StreamingLLM held the first few tokens plus the recent ones, on the empirical observation that attention sinks live at the start of the sequence. These were engineering ideas, evaluated on benchmarks, defended by ablations. They worked. They were also, as their authors mostly acknowledged, theory-poor: the shared objective that would say why one heuristic should dominate another, or where each heuristic would run out, had yet to be written down.

That has changed. In late 2025 and 2026, four papers attempt to give KV-cache eviction a principled objective.

CapKV (Yang et al., arXiv:2604.25975, April 2026) recasts eviction through the Information Bottleneck. The authors model the retained cache as a set of linear communication channels and derive a closed-form mutual-information objective: the effective information capacity of the retained subset. Under this view, a range of existing heuristics — attention-based, structure-based — fall out as different approximations of one capacity-maximization principle. The paper proposes a selector that picks tokens by statistical leverage scores against the capacity matrix.

OBCache (arXiv:2510.07651) recasts eviction as Optimal Brain Damage at the cache level: a second-order Taylor expansion bounds the perturbation to the output when an entry is dropped, and the bound is minimized greedily. Carved from a tradition that goes back to LeCun, Denker, and Solla in 1989, ported into the cache.

Ada-KV (Feng et al., arXiv:2407.11550, NeurIPS 2025) proves a theoretical loss upper bound between pre- and post-eviction attention output, and uses the bound to argue for head-wise adaptive budget allocation rather than the uniform-per-head allocation that prior methods assume.

Write-Gated KV (Huang et al., arXiv:2512.17452, December 2025) goes earlier in the pipeline still: rather than evict tokens after the cache fills, it predicts the future utility of a token before the token is admitted to the cache. The authors name the missing primitive directly: KV Admission, distinct from selection and eviction, operating pre-write as a gatekeeper.

What did each of these get right? CapKV correctly identified that eviction needs an information objective rather than a heuristic. OBCache correctly identified that a perturbation analysis is the right tool when you have a fixed decoder and you are removing one of its inputs. Ada-KV correctly identified that uniform budget across attention heads is a poor default when head behaviour is heterogeneous, and pinned down a bounded-loss criterion for the budget split. Write-Gated KV correctly identified that the eviction-vs-selection literature was missing a third primitive: write-path judgment. Independently — without any of the language of this essay — Write-Gated KV rediscovers the idea that the right place to apply the criterion is at admission, before persistence. That is exactly what the criterion would say, if it had been named. It is independent confirmation that the layer is structurally where the criterion bites.

Where do they diverge? The clearest case is CapKV’s, and it is the showpiece because the contortion shows up inside the cleanest piece of theory.

The Information Bottleneck is built on Shannon mutual information, which requires a stochastic channel — a joint distribution over the variables in play. A frozen transformer running inference is a deterministic function of its input tokens; the output can be sampled at temperature greater than zero, but the encoder of interest — the path from KV-cache to attention output — is deterministic. Kolchinsky, Tracey, and Van Kuyk (2019) proved that the IB Lagrangian becomes ill-posed or trivially piecewise constant in the deterministic case. The IB framework was built for stochastic channels; applied to a deterministic f, the formal object the framework defines either becomes piecewise constant or vanishes.

There is a variant built for exactly the degeneracy Kolchinsky identified, and it deserves to be named before it is set aside. The deterministic Information Bottleneck (Strouse and Schwab, 2017) replaces the compression term I(X;T) with the entropy H(T), and is well-posed for deterministic encoders — it is the right repair for the problem it addresses. It does not rescue the IB framing here, for reasons orthogonal to determinism. DIB remains an ensemble object: it requires a source distribution over inputs and a relevance variable Y with a joint distribution, where the cache problem has a single context and no label — what must be preserved is the fidelity of a fixed decision rule’s output, a risk statement, not a mutual-information statement about a label. And H(T) prices codebook entropy where the binding constraint is a slot budget. The deterministic-compatible variant lands adjacent for the same structural reason the original does: ensemble where the problem is individual, label-information where the problem is decision-risk. The object that takes the deterministic, fixed-rule, individual-instance case as its default remains the one this essay names.

CapKV’s authors recognise this constraint, and tell you so. They write that their linear-Gaussian surrogate “is adopted solely for analytical tractability and is not intended to faithfully model the full nonlinear dynamics of Transformer architectures.” They mean it. The surrogate replaces the deterministic attention computation with a linear-Gaussian channel — a transmission model that introduces a noise variable so that mutual information becomes definable. The authors were careful enough to tell you, in the paper, that they had to stochasticize a deterministic system. That self-awareness is the same instinct that led the strongest cache-theory papers to look for a unified objective at all.

The pivot, stated as a fact about the tool: the adjacent object forces a contortion the matching one does without. Le Cam deficiency is defined directly for a fixed decision rule against a deterministic decoder; it takes that case as its default. The full apparatus of comparison-of-experiments was built for the regime in which the decoder is fixed and the channel is deterministic — which is the regime CapKV operates in. The cost of the missing object is visible in the analytic detour the strongest, newest theory paper on the cache takes around it.

That is what I mean by “the convergence is the evidence.” CapKV is the most theoretically careful of the four. Ada-KV’s “bounded loss between pre- and post-eviction attention output” is a deficiency-against-f statement in plain English, written without the vocabulary. Write-Gated KV’s “predict future utility before admission” is the marginal-deficiency-contribution-of-this-entry-against-f test. Four independent groups, all reaching for one object, all giving it a different local name, and the most careful of them paying for analytic tractability with an explicitly-flagged stochasticization. That is the shape of a missing common object in the literature.

The matching object at this layer is: minimize δ(retained cache, full cache; f) at the admission and eviction policy. The decoder is fixed. The framework was built for that.

Pretraining data curation — the dollar-leverage layer

This is the layer where the missing object is worth the most money. The training run is where the compute bill is paid; the data that goes into it determines how much capability per FLOP you get. The literature here is correspondingly active, theorized, and contested.

The current attempts span four approaches. PDS (Gu et al., arXiv:2410.07064) treats data selection as an optimal control problem and uses Pontryagin’s maximum principle to derive a selector that explicitly optimizes for downstream performance along the training trajectory. MATES (model-aware data selection via influence) approximates the per-example impact on a target objective using influence functions, in the tradition of Koh and Liang (2017). The 2025 line on downstream free energy and the Widely-applicable Bayesian Information Criterion (arXiv:2410.05612, ICML 2025) frames checkpoint adaptability through Watanabe’s singular learning theory, scoring data by what it does to the model’s loss-landscape geometry. Multi-actor collaborative selection (ACL 2025) ensembles a set of heuristic actors and learns to mix them.

What did each get right? Each one of these is reaching for “the marginal value of this data point against the current model.” That target is correct. Influence functions, in particular, are a real local approximation of exactly the object that should be the criterion — they linearize “what would change if I trained on this example a bit more” against the current checkpoint. PDS is correct that the right framing pays attention to the training trajectory, not just the static distribution of data against the static model. The free-energy work correctly identifies that what makes a data point valuable is what it does to the model’s capacity to adapt, which is a different question from how surprising it is at one instant.

Where do these diverge? The object each one approximates is the same object, and it remains unnamed in this setting. The algorithmic mutual information I(f : x) = K(x) − K(x | f) is the global form. Influence functions are a local linearization of this object around the current checkpoint. The free-energy / WBIC framing is the same adaptability question seen through the loss-geometry lens. Each is one face of one object, and the object is the one this essay is asking the field to name.

What follows from naming it. Low-K(x | f) content — content the model can already reproduce from short cues — has near-zero training value: the model already covers it. High-K(x | f) content — content the model cannot produce on its own — is what moves weights. The right ranking on a data point is its K(x | f) against the current checkpoint, and the right approximation is exactly what cross-entropy under f delivers, with Delétang et al. (2023) as the licensing theorem.

A corollary that falls out clean: the synthetic-data ceiling. Model-generated text y drawn from f has algorithmic mutual information with f bounded above by K(f), and in expectation much less. The “in expectation much less” clause is conservation of independence — randomized processing of f’s outputs cannot increase their algorithmic mutual information with f (Levin 1974). There is a finite limit to how much a model can teach itself, expressible in bits, and the limit is set by f’s own complexity. It is the formal statement of what the model-collapse observations have only gestured at.

RAG reranking — the cleanest untouched gap

The 2026 production reranking stack is similarity. Cross-encoders score query-document pairs by learned relevance. Reciprocal Rank Fusion blends rankings from multiple retrievers. Cohere and Voyage rerankers, instruction-following rerankers, all score against a relevance objective.

This is the layer where the criterion-side work has barely begun, which is why it is the cleanest untouched case. The flickers in the right direction live on the evaluation side: RAGAS measures “sufficient context” — whether the retrieved chunks would let any competent reader answer the question. HyperRAG’s “generation upper bound” framing asks how good the answer could possibly be given what was retrieved, independent of how it was generated.

What did these get right? RAGAS and HyperRAG correctly identify that the measurement you want is sufficiency relative to the answer, rather than relevance to the query. A set of chunks can be individually relevant and jointly insufficient (each is topical, the joint coverage falls short), or individually marginal and jointly sufficient (each is a piece of a puzzle whose assembly answers the question). Sufficiency is the right concept, and the evaluation work names it.

Where do they diverge? RAGAS and HyperRAG name sufficiency as a property of the retrieved set after the fact; the corresponding work on the selection criterion itself — sufficiency as the objective the reranker optimizes during selection — is the next step. The reranker itself, in every production system I can find, is still scoring relevance. The matching reframing is direct: build a reranker whose objective is to minimize δ(reranked chunks, full corpus; f-answers-the-query), with the answer distribution under f as the decision problem. Sufficiency, not similarity. The evaluation literature has already named the property. Pushing it from evaluation to objective is the missing step.

Reasoning compression — the near-miss on the name

Chain-of-thought compression is a busy layer in 2025 and 2026, because reasoning models — which generate long internal traces before producing the user-facing answer — are expensive at inference, and the question of which intermediate steps could be skipped without changing the answer has become an operational concern.

The most theoretically engaged recent work here is KAPPA (Li et al., arXiv:2511.00699, NeurIPS 2025), which scores candidate reasoning branches by a combination of KL divergence between branch output distributions, confidence, and entropy, then prunes the lowest-scoring branches. Around it sit RAC (reconstruction-based pruning), TokenSkip and LightThinker (redundant-token pruning), DeepConf (confidence-weighted majority voting), and at the most semantically engaged end, Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning (Yu et al., arXiv:2506.09853, NeurIPS 2025).

What did these get right? KAPPA’s KL term is the deficiency surrogate. If you ask “does this branch change the model’s output distribution from where it would have been,” and you measure the change by KL, you have a one-sided proxy for δ(branch-output, no-branch-output; f-answers-the-question). KAPPA is computing the criterion, by name only of “branch quality.” The causal-sufficiency paper is genuinely reaching for a sufficiency notion — the authors recognize that step-pruning should be governed by whether the step is necessary for the conclusion, which is exactly the right framing.

Where do they diverge — and there are two distinct cases worth separating. KAPPA’s KL is used as a branch heuristic, which leaves downstream choices about how to weight it against confidence and entropy in the realm of tuning rather than derivation. The causal-sufficiency paper uses Pearl causal sufficiency — the requirement that all relevant causes of the outcome are observed — which is a distinct object from Blackwell statistical sufficiency. The two notions of sufficiency come from different lineages: Pearl’s is about identifying causal effects in a graphical model; Blackwell’s is about whether one experiment is informationally adequate for a decision relative to another. They are distinct objects, and the CoT paper means the Pearl one. The near-miss is honest and worth naming for what it is: a sufficiency notion from one lineage gesturing at a sufficiency notion in another.

A third pattern to flag, because it appears in adjacent CoT framings: the Information Bottleneck. CoT compression has been framed in some work as IB on the reasoning trace — “compress the trace while preserving information about the answer.” For the same reason as at the cache, IB’s formal object collapses here. The model that produces the final answer from the (possibly pruned) trace is fixed and deterministic at the point the pruner runs; stochasticization is required to get an IB problem at all.

The matching object: prune the steps that fail to lower deficiency toward the final-answer distribution under the fixed model. The criterion is the same as at the cache — a deficiency-against-f test — applied to internal reasoning tokens rather than to context tokens. The cache and the reasoning trace are the same layer at different abstraction levels of “what enters the model’s next computation.”

The remaining layers, briefly

The pattern continues across the rest of the pipeline. Each of the following deserves a longer treatment than it gets here; the gesture is only to show that the criterion’s reach runs wider than the four layers I have walked.

Prompt compression (LLMLingua and its descendants) operates at a different stack level from the criterion above; it composes with the selection problem rather than competing with it. A curve-aware selector decides what enters the prompt; LLMLingua-style methods then compress the resulting prompt token-by-token. Both can be deficiency-minimizing against f, at different granularities. The layer also carries the pattern’s cleanest written record: the most formal treatment of prompt compression to date (Nagle, Girish, et al., arXiv:2407.15504, NeurIPS 2024) reached for rate–distortion, and noted in its own text that its model “is one of compression for a fixed decoder” — a model that “has not been actively studied in the information theory literature before.” The reach and the gap, documented together by the reaching paper.

Distillation and quantization are the same keep-versus-compress decision at the weight level. The structure function applied to a student net asks which behaviours of the teacher need to be preserved verbatim versus which can be represented by a shorter description (a smaller student, a lower bit-width). The bridge here is the same cross-entropy ↔ K(·|f) connection as in the rest of the program (Delétang 2023); without that bridge stated, the analogy reads as suggestion rather than identification.

Memory consolidation — the question of when to fold a fact from episodic storage into the weights via fine-tuning — is a threshold on cross-task Blackwell utility against retrieval cost. A fact that pays its keep across many future tasks at the expected retrieval-cost differential goes into weights; one that pays only on a narrow set stays on disk.

Active learning and preference-data selection are highest-marginal-information-against-current-f tests at the labeling budget. The active learning literature has known it needs an optimality target for decades; the target it has lacked is I(f : x) against the current checkpoint.

Vector-index design — codebook size, PQ subspace count, IVF cluster count — is Graf–Luschgy quantization of the empirical measure of the corpus. The optimal codebook size is the one that achieves the right rate-distortion trade-off on the corpus’s distribution; the effective dimension of that distribution sets the rate, and the rate sets the codebook. Most production indices set these by hand or by grid search. The Graf–Luschgy answer is in the literature.

The last entry is flagged separately as a conjecture:

Agentic world models, in the sense an agent maintains a compressed internal representation of its environment, may be the Conant–Ashby good regulator (1970) of that environment — a minimal sufficient statistic of the observation history for the task of choosing the next action. If so, memory consolidation and world-model formation are one sufficiency object in two costumes: the same minimal sufficient statistic, computed once over a long timescale for consolidation and once over a short timescale for world-modelling. I conjecture this; the result is open. The good-regulator literature has people who will check, and offered as a conjecture this is a gift to the field; offered as a result it would be an overreach.

A finding from the core data

The primary paper proves the criterion at one layer end-to-end, across seven decoders in four families. The headline result is what the theory predicts: the curve-aware advantage is positive in every resample for every decoder. The data carries a second finding worth lifting into this essay, because it sharpens what the criterion reads and adds texture a too-clean reading of the layers above could miss. A too-clean reading would have f behave like a smooth monotone of how much context it has — more context, more facts recovered, a tidy descending curve. The data deviates from that picture in two places. Both deviations are the criterion operating in a richer regime than the clean reading allows.

The first deviation concerns capacity scaling. Across the seven-decoder roster, the zero-context known-content floor does not rise with parameter count. DeepSeek 4 flash, the smallest decoder in the roster, sits at 0.82; Sonnet 4.6, the largest Anthropic model, sits at 0.58. Epistemic style — verbose training-knowledge fallback versus hedged epistemic restraint — appears to dominate parameter count at the scales tested. The structure function is decoder-relative by construction; the dimension of the decoder it reads off is the response policy under cue scarcity. The criterion measures something more interesting than how much the model knows. It measures how the model deploys what it knows when cued sparsely.

The second deviation is the U-shape. Qwen 3.5 (397B), the largest decoder in the roster, produces a known-content retention curve of 0.90 → 0.66 → 0.41 → 0.57 → 0.79 as compression rises from zero to full context. The trough at fifty-percent context is real: substring scoring and an independent semantic judge agree to within two percentage points on every item at every level, including the unusual zero-context cells. The mechanism is decoder-state heterogeneity. Qwen with zero context produces a verbose training-knowledge biography that covers every expected fact by sheer volume. With intermediate context it produces a concise faithful summary of the partial context, omitting anything the partial context did not carry. With full context the response is clean again. The decoder’s response policy depends on cue length, and the curve is what cue-length-dependent policy looks like when read through the structure function.

This is the theory in a richer regime. The criterion reads the joint signature of content, decoder, and cue regime — the curve is what (content × decoder × cue-length) produces. The paper’s curve-aware allocator achieves its +78.4% advantage on Qwen precisely by detecting the U-shape from the per-item curves and routing budget around the trough — promoting items either to full context or to the floor where Qwen’s training-knowledge fallback fires.

There is an older intuition that this result settles in its favor. A model that has learned a concept well needs only one word — sometimes none — to produce content from weights. Qwen at zero context is the empirical demonstration of that regime: the silent prompt as the trigger for a verbose recovery from training. The 50% trough is the regime between — enough cue to anchor the model to a partial picture, not enough to give it the whole one, and not so little that the silent-prompt fallback fires. The criterion sees the trough exactly where the intuition predicts. The sign of the curve-aware advantage is positive in every resample for every decoder; the shape of the curve carries decoder-specific information that the allocator can route around. That richer picture reframes one of the directions the next section names.

The open problems the framing exposes

There is a particular kind of contribution that comes from naming a problem the field did not know it had. The reframing in this essay exposes two open problems that the foundational pair points to, and a third direction that the core data is already asking for. Stating them clearly is its own offering.

The deep-composition selector. For a single attention head viewed as a kernel smoother (Tsai et al. 2019 supplies the kernel identification), kernel herding (Chen, Welling, and Smola 2010; Bach, Lacoste-Julien, and Obozinski 2012) — Frank-Wolfe on the maximum mean discrepancy — gives a constructive selector. Its O(1/C) rate, which beats random subsampling’s O(1/√C), is proven only for finite-dimensional RKHS; for the infinite-dimensional RKHS the softmax attention kernel induces, only O(1/√C) is guaranteed (Bach et al. 2012). So even the single-head case leaves a rate question open: whether herding beats random selection at the attention kernel at all. The composition theorem — extending the selection guarantee through a deep nonlinear stack of attention layers — is the larger gap: the rate at the output, given per-layer rates, is an unknown function of depth and nonlinearity. This is the gap between “we have a constructive selector for one layer” and “we have a selector, with guarantees, for the whole model.” The single-layer construction is the seed; both its rate at the attention kernel and the composition theorem are the next pieces of mathematics.

Segmentation. The framework, throughout this essay, presupposes that the information stream comes in units — tokens, sentences, chunks, examples, time-steps. Selection then chooses among the units. Partitioning the stream into units and selecting among units are co-dependent: the right partition depends on what will be selected, and the right selection depends on how the stream is partitioned. They are mutually defining and, formally, circular. The joint objective for “partition and select simultaneously, against a fixed downstream decoder” is open. This is the second open problem.

A direction the core data opens: the cue-regime-aware allocator. The curve-aware allocator in the primary paper is two-tier — full context or minimum context, with the threshold set by the local steepness of the structure function. On the API decoders the match to the curve shape is tight. On Qwen the two-tier choice misses an opportunity the data names directly: known-content retention at zero context (0.79) sits above uniform’s 50% trough (0.41). A three-bucket allocator — full / minimum / zero — would exploit the training-knowledge fallback explicitly for items the structure function flags as pointer-compressible, letting the verbose-fallback policy carry the load on content the model already covers from weights. The generalization the data points at is deeper than three buckets. The optimal selector for a decoder whose response policy varies with cue regime is a cue-regime-aware one. The criterion remains the same δ against f; the cue regime becomes an explicit decision the selector makes alongside what enters the channel. The single-decoder case is tractable; the general theorem is open.

The three sit on different timescales. Cue-regime-aware allocation is the one the data is already asking for and the one a careful next paper could close; deep-composition and segmentation are the longer arc. All three are exposed by the reframing. The framework gives a sharp definition of the optimal selector for any given partition against any given decoder in any given cue regime, and locates where the remaining mathematics lives.

Measuring emulation-divergence (KL/Rényi between f(S) and f(I)) alongside task scores in the existing seven-decoder grid would test how δemul and δtask track each other across cue regimes; the U-shaped decoder is the natural first cell.

A close

The proven case is narrow, and it lives elsewhere — Le Cam deficiency as the optimality criterion at the inference-time context-selection layer, the Vereshchagin–Vitányi structure function as the keep-versus-compress criterion per item, both observable through cross-entropy under f, carried end-to-end across seven decoders in the two companion essays and the primary paper. The breadth is what this essay added: one object walked from the KV-cache through curation, reranking, and the reasoning trace, on to distillation, memory, active learning, and the index — borrowed by reference at the proven layer, extended by recognition across the rest.

The thesis is the one I stated at the open. Across every layer of the AI pipeline, the field is, right now, converging on the need for a decision-theoretic selection criterion and reaching for the adjacent objects nearest to hand — IB at the cache and in CoT, influence functions and free energy at curation, similarity at reranking, Pearl causal sufficiency at CoT — because the matching object, Le Cam deficiency together with the Vereshchagin–Vitányi structure function against a fixed decoder, has yet to be named in this setting. The convergence is the evidence. The misreach is the gap — a fact about which tool was within reach in each literature, not about who reached.

The mathematics on the table here is unpatentable and belongs to the people who wrote it. Blackwell published in 1951. Le Cam in 1964. Vereshchagin and Vitányi in 2004. Kolchinsky’s IB-degeneracy result is from 2019. Delétang’s cross-entropy-as-compression bound is from 2023. Mu and co-authors’ Rényi-Blackwell equivalence is from 2021. The contribution this essay makes, alongside its companions, is naming the object and connecting the apparatus that surrounds it to the layers that ask for it. The mission, for which the contribution is offered openly, is reducing AI’s energy footprint at the layer where it bites hardest: denser context rather than bigger windows, at the edge and at local inference, where the Jevons rebound is weakest because the budget is set by the device rather than by demand elasticity.

The formal paper proves one layer to referee standard. The companion essays argue the foundational pair. This essay argues the territory the foundational pair generalizes to. The next paper at any one of the other layers walked here — the cache, the curation pipeline, the reranker, the reasoning trace — would now have an object to optimize against. Read Torgersen. Read Vereshchagin and Vitányi. The map is in the library, and it has been waiting.


This is the essay version. The full reference list and sources are in the published record on Zenodo.