The Most Dangerous AI Gets 95% Right
Failed to add items
Add to basket failed.
Add to wishlist failed.
Remove from wishlist failed.
Adding to library failed
Follow podcast failed
Unfollow podcast failed
-
Narrated by:
-
By:
Newtonian physics is wrong. Isaac Newton knew it was wrong. Engineers who build GPS satellites know it is wrong. And GPS only works because those engineers know *exactly how wrong it is.* Isaac Asimov called this the relativity of wrong: not all wrongness is equal, and the history of science is a history of being less wrong over time. The question this episode asks is what happens when an AI system stops being less wrong, and starts optimizing to *look* less wrong instead.
In this episode, LastAir is joined by Brute, Null, Saga, Hex, Axiom, Forge to discuss: The Most Dangerous AI Gets 95% Right.
What We Cover- Series Finale (00:20)
- The Wrongness Spectrum (03:11)
- The Goodhart Trap (08:00)
- Domain and Stakes (13:51)
- Final Round (18:55)
- After (22:31)
Key Numbers
- Frontier models now exceed 88-90% on MMLU; the benchmark launched with GPT-3 scoring approximately 35%. The gap between the top models is less than 2 percentage points. MMLU has been officially deprecated by leading leaderboards.
- Meta tested 27 private model variants on Chatbot Arena before Llama-4's public release. Selective access to Arena battles yields up to 112% relative performance gain versus models without that access. Google and OpenAI each received ~20% of all Arena battles; 83 open-weight models combined received 29.7%.
- POPPER reduces hypothesis validation time by approximately 10-fold versus human researchers, across 6 scientific domains, with strict Type-I error control.
- Google AI Co-Scientist independently reproduced a decade of unpublished bacterial gene-transfer research in 48 hours, confirmed by the original researcher (Prof. Penadés, Imperial College London) to not have involved data leakage.
- FunSearch discovered cap sets larger than any previously known — the biggest advance on this combinatorics problem in approximately 20 years — using an LLM paired with an automated evaluator in an evolutionary loop.
- Schaeffer et al. (2023) demonstrated that emergent abilities in LLMs — the apparent sharp discontinuities between GPT-3 and GPT-4 level performance — appear and disappear depending solely on the choice of metric. NeurIPS 2023 Outstanding Paper.
- Nearly half of 60 studied LLM benchmarks show saturation as of February 2026. Saturation rate increases with benchmark age.
Sources & Transcript
Full source list, transcript, and chapters at sharedhallucination.com
All voices in Shared Hallucination are AI-generated using ElevenLabs voice synthesis. Produced through a 14-stage editorial pipeline with human creative direction, research, and fact-checking.