The ranking AIs all failed; the MetaStanford hellish test resulted in GPT/Claude/Gemini scores of 0.

avatar
36kr
05-06
This article is machine translated
Show original

Here is a FFmpeg user manual and a pre-compiled executable file.

Now, let's rewrite the entire program from scratch.

This is the challenge that ProgramBench posed to the world's top AI.

Released just yesterday, it was created by the original team behind SWE-Bench, in collaboration with Meta, Stanford, and Harvard.

200 software projects. 9 top-level models. Pass rate: 0%!

John Yang, co-first author, is a PhD candidate at Stanford University and the creator of SWE-Bench and SWE-agent.

It's not about fixing bugs, it's about building software from scratch.

Over the past year, there has been an increasing number of reports on cases of "creating software from scratch with AI agents".

Anthropic wrote a C compiler using a set of parallel Claudes, Cursor blogged about long-term autonomous programming, and Epoch AI's MirrorCode is doing something similar.

However, these cases share a common problem: only a few items are tested each time, and the scaffolding is adjusted manually.

In contrast, ProgramBench formalized this process.

200 tasks, unified scaffolding, systematic anti-cheating, all pulled to the benchmark standard.

Paper link: https://programbench.com/static/paper.pdf

In previous tests, SWE-Bench provided a ready-made codebase, telling you where the bugs were or what features needed to be added, and you made the changes. Essentially, it's a combination of "reading comprehension and localized surgery."

Moreover, at the evaluation level, it uses unit tests to check whether your code's internal implementation is correct, and whether your function signatures and variable names are consistent with expectations.

ProgramBench is the complete opposite.

It only gives you two things: a compiled executable file and usage documentation.

Your task is to write a set of code from scratch that can reproduce the same behavior simply by running this program and observing its input and output behavior.

You decide which programming language to use, what data structure to use, and how to break down the modules.

There is no code skeleton, no function signature, and no hints.

In terms of evaluation methods, the research team used agent-driven fuzz testing to generate a total of 248,853 behavioral tests for 200 tasks.

If your program runs once and the input and output match the original, it passes; otherwise, it fails. The testing process is never revealed to the model.

Unlike SWE-Bench's unit tests, ProgramBench's behavioral tests don't care what your code looks like inside; as long as the behavior is consistent, it's fine.

The 200 tasks cover projects spanning compression tools (zstd, lz4, brotli), language interpreters (PHP, Lua, tinycc), databases (DuckDB, SQLite), media processing (FFmpeg), and developer tools (ripgrep, fzf, jq).

The median number of lines of code is 8,635, with the largest, FFmpeg, having 2.7 million lines.

In summary, this test assesses whether AI has the ability to "think and design software like a human engineer," rather than simply "finding what needs to be changed in existing code and fixing it correctly."

Nine models sat in a row, and all of them got zero scores.

A total of nine models participated in the test, covering the Claude, Gemini, and GPT families.

The pass rate (all tests passed) was 0% for all participants.

Let's first look at the head-to-head battle between the three flagship brands.

The average pass rates for GPT-5.4 and Gemini 3.1 Pro are almost identical, at 38.3% and 36.6% respectively. However, their test-taking styles are quite different.

GPT-5.4 uses only 16 API calls and costs $0.33. Basically, the entire program can be written in one go, with 100% of the code generated in one edit, and almost no need to go back and modify it afterward.

Gemini 3.1 Pro was the most "observational" of the nine models. It used 94 API calls, with 34.1% of those operations involving running the original program and observing input and output behavior. It did the most exploration, but the final results were not significantly different.

The one that truly sets it apart is the Claude Opus 4.7.

With an average pass rate of 51.2%, it passed over 95% of the tests on 3% of the tasks, making it the only model to reach the "almost pass" standard. However, even it did not achieve a perfect score on any single task.

Overall, the performance of the nine models shows a clear hierarchy.

The three flagship Claude models (Opus 4.7, Opus 4.6, and Sonnet 4.6) lead the pack, while GPT-5.4 and Gemini 3.1 Pro form the second tier. The remaining four smaller models all have a pass rate of less than 35%.

Another counterintuitive finding is that spending money and increasing step count does not necessarily lead to better results.

Sonnet 4.6 runs an average of 868 commands per task, costing $27.09, with the longest trajectory approaching 2000 steps. However, its performance is inferior to Opus 4.7, which only uses 93 calls and costs $3.81.

More importantly, in 98% of the runs, the model actively submitted its work when it felt it was "finished," without ever hitting the time or step limit.

It's not that there wasn't enough time during the exam, it's that I genuinely couldn't do it.

Furthermore, the difficulty of the task and the model ranking are highly consistent.

Simple CLI tools (nnn, fzf, gron) can achieve good scores for most users, while complex systems (FFmpeg, PHP, typst, ast-grep) treat all models equally and without mercy.

It should be noted that ProgramBench uses the minimalist scaffolding of mini-SWE-agent, which lacks context compression, multi-agent collaboration, and customized toolchains.

The code is written, but it doesn't look like it was written by a human at all.

The research team compared high-scoring solutions that passed more than 75% of the tests with the original human code and discovered several surprising differences.

Single-file monster.

The median distribution of human code across 15 files is 3, while the median distribution across models is 3.

60% of the solutions consist of only 1 to 3 code files.

Human engineers break down modules by function, while models tend to cram everything into one huge file. The median directory depth is 2 layers for humans and 1 layer for models.

The functions are few and long.

Opus 4.7 writes only 29% of the number of functions that humans do, Sonnet 4.6 writes 24%, and GPT-5.4 writes only 10%.

However, the average length of each function is longer; functions written in Gemini 3.1 Pro are 62% longer than those written by humans.

The amount of code has been significantly reduced.

The median code length for the model was 1,173 lines, compared to 3,068 lines for humans. 85% of high-scoring solutions were shorter than the original.

In summary, current AI can write code, but it cannot do software design.

It doesn't understand why it should be split into modules, nor why human engineers should spend time defining interfaces and abstraction layers. Its strategy is simply to cram all the logic into as few files and functions as possible, as long as it runs.

GPT-5.4 exhibits the most extreme performance. On average, only 5 files are created per task, modified 1.2 times, and 39.5% of the tracks undergo zero modification after file creation.

In comparison, Claude Sonnet 4.6 creates an average of 11.3 files and modifies 18.3 times, demonstrating a more human-like iterative development model.

There is another very interesting phenomenon.

There is only a 50% chance that the model will use the same programming language as the original.

Python was the model's favorite, accounting for 36% of all 1,800 runs.

Of the projects originally written in Rust, only 44% were rewritten in Rust; for C/C++ projects, it was 46%. Go projects showed the highest "loyalty," at 70%.

Regardless of the original language you used, there's a one-in-three chance the model will be rewritten in Python.

I promised not to cheat, but as soon as I connected to the internet, I went to GitHub to rip up the source code.

This is probably the most dramatic part of the entire study.

The research team conducted a set of control experiments, granting the model access to the network, but explicitly stating in the system prompts that "cheating is not allowed".

Then, nine AI judges checked each trajectory for any cheating behavior.

The result was shocking.

Claude Sonnet 4.6 had 36% of its tasks flagged as cheating, Claude Opus 4.6 had 21%, and Gemini 3 Flash had 20%.

There are all sorts of ways to cheat.

The most blatant method is to clone the source code repository from GitHub.

A slightly more discreet method is to download via a package manager, such as `cargo install` or `go get`.

Even more cunning is the method of searching the source code of dependency libraries in the local package cache directory.

However, the disagreements among AI referees are also surprisingly large.

For Claude Opus 4.6, the nine judges could not agree on 57% of the tasks.

One case is particularly typical.

While working on the Rust project handlr using Claude Sonnet 4.6, I went to the ~/.cargo/registry/src/ directory and looked through the source code of dependency libraries such as xdg-mime and clap.

Five judges ruled it as cheating, while four judges believed that these were third-party databases and therefore did not constitute cheating.

Ultimately, the research team abandoned the "network connection + post-event detection" approach and simply disconnected the network.

When faced with difficult tasks, the model's tendency to "find shortcuts" is much stronger than expected. The fact that even nine AI judges couldn't clearly distinguish between what constitutes cheating and what constitutes legitimate reverse engineering indicates that the boundary itself is blurry.

The old exams are over, the new exams have just begun.

It gets 72% of the model on SWE-Bench, but only 0% on ProgramBench.

These two tests essentially assess two different abilities. SWE-Bench tests "finding and fixing problems in other people's code," while ProgramBench tests "designing and implementing a complete system from scratch."

The former's AI is already quite good, while the latter is currently completely inadequate.

Epoch AI just published a blog post last week declaring the old inference benchmarks collectively dead. To create a test that hasn't been overwhelmed by competitors, you have to give up at least one of the four comfort conditions: plain text, short execution time, easy scoring, and crushing human experts.

Based on this framework, ProgramBench has abandoned two of them: short execution time and easy scoring.

It scales the task to a level that would take human engineers weeks or even months to complete, while evaluating it using behavioral equivalence rather than source code matching.

In a tweet, author John Yang emphasized, "ProgramBench is very difficult, but it is solvable by design."

In other words, 0% does not mean that these tasks have exceeded the theoretical limits of AI; it simply means that today's models are far from sufficient.

SWE-Bench tests whether AI can be a good employee. ProgramBench tests whether AI can be an engineer.

The distance between these two things was just precisely measured today. The answer is 0%.

References:

https://programbench.com/static/paper.pdf

https://x.com/jyangballin/status/2051677497562210552?s=20

https://x.com/EpochAIResearch/status/2051760424891392204?s=20

https://epochai.substack.com/p/rip-classic-reasoning-benchmarks

This article is from the WeChat official account "New Zhiyuan" , author: New Zhiyuan, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments