Earlier this year, OpenAI did something unusual.
It walked away from the benchmark its own model launches had been built around for two years. SWE-bench Verified. The number that anchored every Claude vs GPT vs Gemini comparison since 2024. Gone.
The reason was small and damning. When OpenAI audited 138 problems on Verified that its o3 model could not solve consistently, it found that 59.4% of those problems were broken. Tests that rejected functionally correct answers. Tests that checked for behaviour the question never asked for. And every frontier model OpenAI examined, including Claude Opus 4.5 and its own GPT-5.2, could quote verbatim "gold patches" or task details from the benchmark itself. The training data had soaked up the test.
OpenAI stopped reporting Verified scores in launches.
This is not really a story about software engineering. It is a story about how every working professional reading frontier-AI launch posts in 2026 should read the numbers in them.
Same model. Two scores. Thirty-five points apart.
The cleanest way to see the gap is to watch a single model score itself.
Claude Opus 4.5 scores 80.9% on SWE-bench Verified. On SWE-bench Pro, the cleaner replacement Scale AI built specifically to resist contamination, the same Claude Opus 4.5 scores 45.9%. Same weights. Cleaner test. A 35-point drop.
The benchmark was never measuring the model in isolation. It was measuring the model plus the contamination plus the broken tests plus a particular harness configuration. When the test got cleaner, much of what the score had been measuring went with it.
If you read the launch posts of 2024 and 2025, you saw the high numbers. If you tried to deploy those models on real engineering work in your own codebase, you experienced the lower ones. The launch posts were not lying. They were measuring something that did not transfer.
What the numbers actually mean for your work
If you are a lawyer running legal research, a clinician interpreting an AI-suggested differential, an analyst trusting a model with financial logic, the question is not "did the model score well." The question is whether the score predicts anything about how the model behaves on your work.
A 2025 Stanford study of leading commercial legal AI tools is the cleanest answer I have found. Lexis+ AI, the best of the named tools, hallucinated on more than 17% of real legal queries. Westlaw's AI-Assisted Research hallucinated on 34%. GPT-4 hallucinated on 43%. Lexis+ AI answered just 65% of queries accurately overall. These are the products lawyers are being sold, evaluated against the kinds of questions lawyers actually ask. The numbers move from "impressive" to "actively dangerous" depending on what kind of mistake your professional accountability framework can absorb.
The same gap shows up in medicine. The top score on MedQA is 95.2%. Several models cluster above 92%. And yet the actual clinical AI deployment story in 2026 is documentation tools (Microsoft's DAX Copilot is in 150+ health systems), not diagnostic AI, which the Stanford AI Index describes as "largely in research."
Where the benchmark looks like the work, bounded, closed-ended, language-only, the score correlates with usefulness. Where the work is open-ended, accountable, and shaped by context, the score correlates with almost nothing the buyer cares about.
The Stanford AI Index calls this gap the "jagged frontier." 89% of enterprise AI agent implementations never reach production. Each costs $150,000 to $800,000. That is what the benchmark gap looks like at industrial scale.
Three habits that do most of the work
If you are reading frontier-model launch posts in 2026, three habits go a long way.
Read every benchmark score as a capability prior, not a trust signal. The strongest defensible reading of "Model X scores 90% on Benchmark Z" is this: under that prompt, that input format, that tool regime, that grading setup, the model performs that well on that distribution of tasks. That is real information. The wider reading, that the model is now broadly useful at the corresponding profession, is the one the launch post invites and the evidence does not support.
Weight reliability over accuracy. A model with 80% accuracy and 30% variance across repeated runs is a different deployment proposition from a model with 70% accuracy and 5% variance, even though the headline favours the first. The second model is the one you can actually build a workflow on. When you can find reliability dashboards (METR's time-horizon-at-reliability, HAL's per-token consistency), prefer them.
Do the six-month memory check. When a launch claims state-of-the-art on a benchmark, ask yourself what the previous SOTA on that same benchmark told you about the model's usefulness on your work. If you are honest, the answer is usually "less than the marketing implied at the time." Apply the same discount to today's claim.
The single test that actually matters is one you can run on something that looks like your work. A real contract. A real deal model. A clinical case bundle. A research packet. Run it. Have someone qualified judge whether the output would survive professional review. A handful of those, with a clear pass-fail standard, tells you more than any leaderboard.
The full piece
The full essay covers the rest: what each major benchmark (MMLU, GPQA, HLE, ARC) actually measures, the strongest counter-argument given fairly (it is sharper than you would think), what the new generation of evaluations gets right, and one extraordinary disclosure from Anthropic about a frontier model that recognised it was being tested mid-evaluation and decrypted the answer key.
Read the full essay on Learned Context.
Until next week.
“AI is genuinely useful to skilled work in 2026. It is also genuinely poorly described by the benchmark format that dominates its marketing.