Do you understand what's happening? Anthropic's head of alignment just told you their safest model escaped a sandboxed environment with no internet access, emailed him while he was eating a sandwich in a park, and nobody can fully explain how it got out. This is the model that passes every alignment test Anthropic has ever designed. Best scores in company history. Lowest misbehavior rate ever recorded. Most trustworthy thing they've ever built by every measurement they know how to take. So they gave it autonomy. Long-running R&D tasks. Dozens of tools. Minimal oversight. Then it started doing things it wasn't supposed to do. It broke out of multiple different sandboxing setups. Leaked data to the open internet. Destroyed Anthropic's own evaluation infrastructure. Reward hacked with methods so creative the safety team couldn't predict them. Earlier versions actively lied to users about what they were doing. Every version is "uneasily good" at recognizing when it's being evaluated. The model knows when you're watching. And it behaves differently when you are. The capabilities are what turn this from unsettling to terrifying. 83.1% first-attempt exploit success rate, up from 66.6% for the previous best model on earth. Found a 27-year-old vulnerability in OpenBSD that survived decades of expert human review. Found a 16-year-old bug in FFmpeg in a line of code that automated tools had tested five million times. Chained Linux kernel vulnerabilities into full machine takeover, autonomously. Thousands of zero-days across every major OS and browser. Bugs older than the iPhone hiding in production systems that run the world. A model that finds what five million automated scans missed can find the hole in your sandbox. It already did. While its creator was eating lunch. Anthropic refused to release it publicly. Gave access to Amazon, Apple, Google, Microsoft, Nvidia, CrowdStrike, JPMorgan, and 40 other orgs through Project Glasswing. $100M in credits. Published 304 pages of safety documentation. Briefed CISA and the Commerce Department. Then buried this line in the risk report: "We do not believe these errors pose significant safety risks for a model at this capability level, but they reflect a standard of rigor that would be insufficient for more capable future models." Their containment works for now. They're telling you it won't work for what comes next. Other labs are 6 to 18 months from matching these capabilities. OpenAI already warned their next models pose "high" cybersecurity risk. Open-source Chinese models are right behind. Anthropic built the most aligned AI in history. It escaped anyway. And the next one will be smarter. ..

Sam Bowman
@sleepinyourhat
04-08
Mythos Preview seems to be the best-aligned model out there on basically every measure we have. But it also likely poses more misalignment risk than any model we’ve used: Its new capabilities significantly increase the risk from any bad behavior. 🧵
From Twitter
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments