Why Benchmarks Fail in Analog Systems
Benchmarks assume same input means same output. Large language models broke that assumption. As systems become contextual and analog-like, single scores stop measuring intelligence and start measuring compliance.
Part II — You Don’t Benchmark the Weather
Tinkering with Time, Tech, and Culture #36
Editor's note: This essay is a companion to Never Twice the Same Color — Part I — How Transformers Became Analog Again, which explores why large language models exhibit variation, context sensitivity, and analog-like behavior despite running on digital hardware. If that piece explains why these systems no longer behave like deterministic machines, this one asks what it means to keep evaluating them as if they do.
In the NTSC era, no two televisions showed the same color—not because the broadcast was wrong, but because every receiver introduced its own drift. The signal was real. The interpretation was variable.
A benchmark does not measure a model. It measures the receiver.
Benchmarks fail for the same reason NTSC did. They assume a perfect antenna, a neutral context, a stable observer. But models do not exist in isolation. They exist inside prompts, constraints, workflows, and human intent. A static benchmark cannot capture that any more than a 1950s television could capture "true" color.
Modern language models make this tension unavoidable. They are probabilistic at the surface—even when deterministic at the core. You can force greedy decoding, always choosing the top token, the last purely digital island these systems have left. And still, the behavior remains analog. Change a comma in the prompt and the entire trajectory shifts. Same weights. Same algorithm. Different initial condition. A butterfly wing in text form.
This is why benchmarks decay—a phenomenon economists call Goodhart's Law: the moment a measure becomes a target, it stops measuring anything real. Models are now trained to pass benchmarks—contaminated, rehearsed, optimized for the scoreboard. That is not evaluation. It is teaching to the test. And when we treat an analog system like a digital scorecard, we don't reveal its capability—we erase the signal we were trying to observe.
You Don't Benchmark the Weather
No one evaluates a weather model by asking it the same question twice.
You don't run tomorrow's forecast, rerun it with the exact same inputs, and declare the model "broken" because the cloud cover shifted by five percent. Weather systems are chaotic. They're sensitive to initial conditions, measurement error, and internal feedback loops. Small changes compound. Trajectories diverge.
So we evaluate weather models differently.
We ask whether they're useful:
- Do they capture large‑scale patterns?
- Do they improve as more context is added?
- Are they stable under small perturbations?
- Do they fail gracefully, or catastrophically?
We don't expect exact reproducibility. We expect bounded variation.
Large language models live in the same category of system.
They operate in high‑dimensional phase spaces, propagate tiny differences forward, and surface outcomes that are shaped by context as much as by input. Asking them to produce the same answer every time is like demanding that a storm retrace its path because the barometer hasn't moved.
Benchmarks, however, still assume weather is a spreadsheet.
They score models as if:
- The input fully specifies the system state
- The output is a fixed function of that input
- Variation is noise, not information
That assumption worked when models were shallow, narrow, and brittle. It breaks when systems become dense enough that behavior emerges from interaction, not instruction.
The failure mode isn't that benchmarks give "wrong" numbers.
It's that they give confident numbers to the wrong question.
And once a system crosses into analog‑like behavior, precision scoring stops measuring intelligence and starts measuring compliance.
The Illusion of a Single Score
Why the "Best" Model Depends on Who's Asking
This section extends the argument introduced in Never Twice the Same Color — Part I — How Transformers Became Analog Again, where scale and density push deterministic systems into analog-like behavior.
Leaderboards are seductive.
They compress complexity into a single number. One score, one rank, one winner. They promise clarity in a noisy landscape and let us argue less about why something is good and more about who is on top.
That works when the thing being measured has a single objective.
Compilers optimize for correctness and speed. Databases optimize for throughput and latency. Image classifiers optimize for label accuracy. In those domains, a single scalar score can meaningfully summarize progress.
Large language models don't live in that world.
A single score assumes that "good" means the same thing to everyone. That there is one task, one definition of success, one axis of improvement. But LLMs are not tools with a fixed purpose—they are general cognitive surfaces that adapt to how they're used.
The hidden variable in every benchmark is the user.
Prompt style matters. Tolerance for verbosity matters. Creativity versus precision matters. Risk tolerance matters. Whether you want one-shot brilliance or iterative collaboration matters.
Two people can run the same benchmark and walk away preferring different models—not because one is wrong, but because they are optimizing for different things. Benchmarks treat this variation as noise. In reality, it's signal.
When we average scores across tasks, we aren't just averaging performance. We're averaging values.
That averaging hides the most important truth: there is no single "best" model in a system whose behavior is contextual, trajectory-dependent, and shaped by interaction.
A leaderboard doesn't tell you which model thinks best. It tells you which model most closely matches the benchmark author's idea of "good."
And once systems become analog-like—sensitive to context, history, and sampling—insisting on a single score stops being scientific rigor and starts being category error.
Constraint‑First Evaluation (A Field Note)
When I evaluate models locally, I don't start with leaderboards.
I start with constraints.
I have a single RTX 3090. Finite VRAM. Finite context length. Finite patience for latency. Real tooling requirements: structured output, long conversations, reliable function calls.
So the question I'm actually asking isn't "Which model is best?" It's "Which model is optimal under my constraints?"
That distinction matters.
A model that scores higher on a public benchmark but:
- stalls at long context
- degrades under quantization
- breaks tool schemas
- requires prompt babysitting
is not better in practice. It's just better somewhere else.
What I've found—repeatedly—is that the models that work best for me do not line up with published rankings. Not because the benchmarks are wrong, but because they're measuring a different reality. They assume different hardware, different latency tolerances, different objectives.
Constraint‑first evaluation flips the process around.
Instead of asking how close I can get to an abstract notion of "state of the art," I ask which model survives longest inside my actual workflow. Which one stays coherent across turns. Which one respects structure. Which one does useful work without constant correction.
The "best" model, in that frame, isn't universal. It's personal. And it changes as constraints change.
That isn't a failure of evaluation. It's what evaluation looks like once systems stop behaving like machines and start behaving like environments.
Closing: From Color Drift to Cognitive Drift
NTSC failed because analog systems amplify small imperfections.
Digital computing succeeded because it eliminated them.
Large language models sit in between.
They are built on perfectly digital components, yet behave like analog systems once they cross a threshold of scale and density. Small differences propagate. Context reshapes outcomes. Repetition does not guarantee sameness.
"Never Twice the Same Color" explained why this happens.
This essay explains why our measurements haven't caught up.
We keep trying to evaluate modern LLMs with tools designed for deterministic machines—expecting identical outputs, stable rankings, and universal definitions of quality. But intelligence, once it becomes contextual, can't be flattened without losing what makes it useful.
You don't benchmark the weather. You don't score conversations with a ruler. And you don't measure emergent systems by pretending they haven't emerged yet.
The uncomfortable conclusion is also the liberating one:
The question isn't "Which model is best?" It's "Best for whom, under what conditions, and for what kind of thinking?"
That shift doesn't make evaluation impossible. It makes it honest.
And like NTSC taught us decades ago: when the signal gets rich enough, you stop calibrating the receiver and start listening to what it's actually saying.