This article is machine translated
Show original

I am more curious about Mythos than Opus 4.7. While the community is buzzing with excitement over the rapid arrival of Opus 4.7, Anthropic quietly unveiled the overwhelming metrics for Mythos alongside it. The transition from Opus 4.6 to 4.7 represented a generally stable improvement. Based on SWE-bench Pro, it rose by approximately 11 percentage points from 53.4% to 64.3%, and on Terminal-Bench, it increased by about 4 percentage points from 65.4% to 69.4%. Befitting a generational upgrade, the results were evenly boosted, but with increases ranging from single digits to the low tens of percent across benchmarks, it can be described as "steady progress." On the other hand, the jump from Opus 4.7 to Mythos Preview is on a completely different scale. SWE-bench Pro jumped 13.5 percentage points from 64.3% to 77.8%, and Terminal-Bench rose 12.6 percentage points from 69.4% to 82.0%. SWE-bench Verified climbed from its previous high of 87.6% to 93.9%. This additional increase in the high score range carries significance beyond mere numerical values, as this is an area where difficulty rises exponentially. In Humanity's Last Exam, the "with tools" benchmark also recorded the highest score among all models in the table, rising 10 percentage points from 54.7% to 64.7%. Meanwhile, the Cybersecurity benchmark saw a slight decline between 4.6 and 4.7 before Mythos surged 10 percentage points to 83.1%. However, Mythos is still in the Preview stage, and since measurements are unavailable for some benchmarks such as Scaled Tool Use, Financial Analysis, and Multilingual Q&A, its completeness as a general-purpose model requires verification. However, looking solely at the measured range, if Opus 4.7 was an incremental evolution of 4.6, Mythos appears to be the next-generation model we are truly hoping for. Holding on Mythos... #AI #Opus4.7 #Mythos #Anthropic #Claude

Telegram
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments