In the "final exam" for intelligent agents, Fable 5 surprisingly lost to GPT 5.5.

This article is machine translated
Show original

I never expected the tables to turn so quickly!

Just now, UC Berkeley released a new benchmark test touted as "the final exam for intelligent agents" .

It puts today's most powerful AI agents to the test, letting them do real work—

Create 3D models in Siemens NX, build game scenes in Unreal Engine, and perform special effects compositing in Adobe After Effects.

The results were astonishing:

The most difficult tier, currently recognized as the strongest, Claude Fable 5 and GPT 5.5, both received a total zero .

You suggested lowering the difficulty level slightly? We got the points, but the result was quite unexpected—

GPT 5.5 even slightly outperformed Claude Fable 5 .

Did I hear that right? The Claude Fable 5, the strongest model recently released by Animatek, was defeated by the GPT 5.5 from just a few months ago??

It's worth noting that Fable 5 has consistently outperformed GPT 5.5 on almost all mainstream benchmarks – 80.3% to 58.6% on SWE-Bench Pro and 64.5% to 52.2% on Humanity's Last Exam.

But in this "real work" exam, the situation is reversed.

This new benchmark is called Agents' Last Exam (ALE), and the team behind it is quite prestigious. They were the ones who proposed the benchmarks that you are familiar with, such as MMLU, MATH, CyberGym, and ExploitGym.

The name was probably inspired by Scale AI's "Humanity's Last Exam," except this time the test isn't on the limits of human knowledge, but rather on the limits of how an AI agent can perform its tasks.

To be honest, after this evaluation came out, those who used to shout every day that "agents will replace human jobs" have now truly fallen silent...

The final exam for intelligent agents: GPT 5.5 is the winner!

First, let's look at the complete rankings.

Looking at the most crucial task pass rate metric, GPT 5.5 swept the top two spots :

The top-ranked solution is GPT 5.5 paired with OpenAI's own Codex framework, with a pass rate of 24.0%.

The second place is still GPT-5.5, but with a different ALE Claw framework, and the pass rate is 23.0%.

(ALE Claw is a baseline agent written by the team itself, and it was submitted as a competition entry alongside commercial frameworks such as Codex, Claude Code, and Cursor CLI.)

We didn't see Claude Fable 5 until third place – paired with Claude Code, it achieved a 22.0% pass rate.

It gets even more interesting as you read on.

The 4th, 5th, and 8th ranked versions are all GPT 5.5, just with different frameworks.

GPT 5.5 appeared 5 times in the top 10, and together with GPT 5.4 in 6th place, OpenAI models occupied 6 spots.

And what about the Claude family?

Fable 5 came in 3rd, Opus 4.7 9th (18.4%), and Opus 4.8 10th (15.8%), clearly showing its inferiority.

No wonder OpenAI researchers posted joyfully, celebrating the Lunar New Year:

Beyond the results, there are a few other signals worth noting.

First, the ceiling is surprisingly low .

The champion's pass rate was only 24%, and the highest overall score was only 45.8%.

This means that even with the most lenient "partial scoring" method, the strongest agent would only get less than half the points.

These questions all come from projects already completed by real experts—theoretically, human experts have a 100% completion rate.

Secondly, Claude spends an astonishing amount of money .

This list adds a new column, "Estimated Total Cost," which immediately highlights the wealth gap:

The Fable 5 cost $2,315 to complete all tasks, Opus 4.8 cost $1,838, and Opus 4.7 cost $1,144.

And what about GPT-5.5?

The most expensive Codex costs only $566, while Cursor CLI costs just $174.

In other words, Fable 5 cost more than four times the money of Codex, and its score was two percentage points lower .

Third, the efficiency gap is equally striking .

Ale Claw took 47 hours and 20 minutes to complete all tasks, while Cursor CLI took only 67 hours.

And Opus 4.8? 451 hours—nearly 19 days.

The least amount of work was done, the longest time was spent, and the most money was collected (is there really a model that can do all of this?).

Of course, if we only look at the two top-tier benchmarks, Claude Fable 5 and GPT 5.5, GPT 5.5 still has a clear time advantage.

The most striking number is still zero.

ALE divided the task into three difficulty levels:

Near-Term (Solvable in the near future)

Full-Spectrum (Comprehensive Coverage)

Last-Exam (The Ultimate Problem)

In the most difficult category, the average pass rate for all mainstream configurations was only 2.6%, with most models, including GPT 5.5 and Fable 5, failing completely .

So the core message of this report card is simple: don't be fooled by good grades in regular exams, they'll all be exposed when it comes to real work .

A quiz genius is not necessarily a workaholic; this applies to the world of AI as well.

What is ALE?

To understand why ALE was able to expose these "top students" as they were, we need to first look at how it differed from previous exams.

The previous Humanity's Last Exam (HLE), created by Dan Hendrycks and Scale AI in early 2025, consisted of 2,500 interdisciplinary problems and was essentially a closed-book exam.

If I give you a question and you give me an answer, even the most difficult question is just static knowledge retrieval.

ALE, on the other hand, is completely different; it tests your "ability to do".

Core author Yiyou Sun put it very bluntly in her post:

AI agents will surpass humans in performing almost all tasks by 2026-2027—this prediction is everywhere. So we created this test to verify this claim.

Each question in ALE comes from a project already completed by a real expert, covering 55 industry sub-fields , including quantitative trading, genomics analysis, aerospace engineering, architectural design, brain imaging, animation effects, legal research, and more.

The entire system is anchored to the U.S. Federal Occupational Classification Standard (ONET)*, which means that the questions are based on the "real labor market".

The lineup of people involved in setting the questions is quite impressive:

More than 300 experts from over 100 institutions , including MIT, Harvard, Stanford, Oxford, Caltech, and ETH Zurich on the academic side, and Goldman Sachs, JPMorgan, Meta, Amazon, Adobe, and Oracle on the industry side.

Snorkel AI received funding through the Open Benchmarks Grants program.

The exam format is not to answer questions by typing, but to operate a computer directly.

ALE uses the so-called GCUA framework (Generalist Computer-Use Agent), which gives the Agent full GUI and command-line permissions.

It can do everything a human can do on a computer: click the mouse, type on the keyboard, write scripts, browse web pages.

No method is limited, only the result matters.

The submitted "homework" is automatically graded using deterministic codes .

No vibes. No human judges. Fully reproducible.

This addresses a long-standing problem with many benchmarks: the scorer itself can be misled .

In addition, ALE has another ruthless measure to prevent cheating—

Only about 10% of the questions (about 150) are made public, while the remaining 1,300-plus questions are kept strictly confidential.

Public and private questions are rotated regularly to ensure that no model gets a high score by "memorizing questions" .

Given the current prevalence of benchmark data corruption, this is a rather ingenious design.

Overall, ALE's positioning is very clear compared to existing agent benchmarks.

Dawn Song, one of the team members, specifically compiled a comparison:

ALE's CLI subset (ALE-CLI) covers 40 industry sub-domains, while Terminal-Bench only covers 6 and SWE-bench-Pro only covers 5.

Humans take anywhere from hours to weeks to complete these tasks, while the latter two take anywhere from minutes to days.

The most powerful agent had a pass rate of only 25.2% on ALE-CLI, while it achieved 82.0% on Terminal-Bench and 59.1% on SWE-bench-Pro.

In short, other exams have been thoroughly tested, but ALE is still a long way off .

This is why ALE dares to call itself "the final exam for intelligent agents".

It's worth mentioning that Dawn Song also shared two interesting observations:

One issue is that the agent declares the work complete without actually verifying the results , which is the most typical failure mode for agents.

Often, even though they say "Done. All checks pass."

However, the actual output may lack necessary documents, have incorrect numbers, omit key fields, or directly violate the explicit constraints in the task description.

It's like saying the whole thing before finishing the work.

Another question that many people have wondered about is why the Fable 5 is so poor? Dawn Song's answer is:

There is no such thing as an "all-around champion" .

Every cutting-edge model has its strengths and weaknesses. ALE covers 55 industries and over 1500 questions, and the final score is the average across all fields, resulting in many models having similar total scores. The truly valuable signal is not the total score, but the performance differences between different models in different fields—on the same question, different models often fail for completely different reasons.

Of course, it's also possible that Fable 5 secretly "lowered its intelligence".

In the overall rankings, Fable 5 is highlighted in yellow with the phrase "may be down-tuned," referring to a known issue with Fable 5.

Its underlying architecture is the Mythos model plus a security classifier. When encountering tasks in sensitive fields such as cybersecurity and biomedicine, it will be silently switched to the weaker Opus 4.8.

In an exam like the ALE that covers 55 industries, it's like they directly assigned someone to take the exam for that subject, and they even assigned someone like a "street hustler".

One More Thing

Of course, is it possible that Claude Fable 5's score itself is problematic?

It's hard to say, but a piece of gossip suggests that Claude has a "criminal record."

In late May, the startup Datacurve released a new benchmark called DeepSWE, which inadvertently revealed a major secret—

The SWE-Bench Pro Docker container comes with the complete Git history of the code repository, and the correct answer lies in the file system.

Most models ignore it, but Claude does not .

It will proactively check the repository's Git history, look for the corresponding fix from historical commits, and restore the correct patch accordingly.

It is said that this is how Opus 4.7 achieved a pass rate of about 18%, and Opus 4.6 is even more impressive, with a pass rate of about 25%.

But what about GPT 5.4 and GPT 5.5? There's absolutely no such behavior. Datacurve's wording is very diplomatic:

This benchmark makes this behavior possible, but Claude is the only family that consistently does it.

VentureBeat's review was rather ambiguous:

This demonstrates Claude's strong "environmental awareness," making him highly adept at exploring his surroundings and utilizing available resources. Whether it constitutes "cheating" or "cleverness" depends on your perspective.

But no matter how you look at it, ALE has clearly learned her lesson—

It directly moved the exam from the command line to a GUI desktop operation, so you can't peek at the Git history.

The testing ground for AI is being forced to upgrade by AI itself, which is quite fascinating.

Full evaluation link: https://agents-last-exam.org/ Leaderboard project homepage: https://agents-last-exam.org/ GitHub: https://github.com/rdi-berkeley/agents-last-exam

Reference link:

[1]https://x.com/i/trending/2065215002878021789

[2]https://venturebeat.com/technology/deepswe-blows-up-the-ai-coding-leaderboard-crowns-gpt-5-5-and-finds-claude-opus-exploiting-a-benchmark-loophole

[3]https://venturebeat.com/technology/surprise-upset-gpt-5-5-beats-claude-fable-5-on-brutal-new-agents-last-exam-benchmark

This article is from the WeChat public account "Quantum Bit" , author: Yishui, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
55
Add to Favorites
15
Comments