0% completion rate, Claude, GPT, and Gemini all fail; the new work by the author of SWE-Bench silences the AI community.

avatar
36kr
05-07
This article is machine translated
Show original

The creator of SWE-Bench has just released a new, hellish benchmark.

The result was quite shocking:

Claude Opus 4.7, GPT-5.4, GPT-5 mini, Gemini 3.1 Pro, Gemini 3 Flash—almost all of the strongest first-tier models in this generation have a 0% completion rate.

No single model can truly and completely reconstruct a software project.

what does that mean?

Today's big model is already very good at writing code, but still can't do software engineering.

Recently, MetaFAIR, in collaboration with institutions such as Stanford and Harvard, released a very interesting new benchmark that essentially redefines the way AI coding is evaluated:

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Past large-scale programming benchmarks mostly tested local capabilities: completing functions, fixing bugs, implementing features, etc. In essence, they were still making local modifications to the existing code structure.

ProgramBench, for the first time, pushed the question to the level of true software engineering: if AI is given only a functional description and usage documents for a program, can it rebuild a real, executable software system from scratch, just like a real engineer? For example, ffmpeg, SQLite, or ripgrep.

Moreover—it cannot connect to the internet.

In other words: Does the model actually possess engineering intelligence?

To test this, the research team removed the original source code and tests, keeping only the executable and usage documents. The model had to decide for itself the language, architecture, module splitting, data structure, and even the organization of the entire repository.

More importantly, ProgramBench no longer scores based on source code similarity. It uses behavioral equivalence. This means you can implement it using completely different languages, algorithms, architectures, and even completely different engineering methods. As long as the final input and output behavior is consistent with the original program, it will pass.

The research team even used agent-driven fuzzing to automatically generate a large number of end-to-end behavioral tests.

For the first time, a benchmark truly began to resemble real-world software engineering, rather than simply testing coding skills. After the results were released, the entire AI community fell silent.

All models: 0% completion rate.

Table 2 creates the initial impact, while Figure 4 explains the details behind it. It tells us that the models aren't completely incapable of performing tasks; rather, they can often accomplish parts of them, even approaching perfection on a few tasks. However, all models fail when 100% behavioral equivalence is required. This final mile is precisely the biggest difference between software engineering and ordinary code generation. Furthermore, if we're choosing the best among the worst, the Claude series (especially Opus 4.7 and 4.6) performs relatively best.

Even though the paper specifically added an "Almost " metric—which counts tasks with a completion rate of over 95%—Claude Opus 4.7, currently the best-performing version, only has 3% of tasks nearing completion.

There is a particularly crucial sentence in the paper:

Models favor monolithic, single-file implementations that diverge sharply from human-written code.

In other words, the model is extremely prone to generating monolithic code. A large amount of logic is crammed into a single file; the directory structure is very shallow; module splitting is minimal; functions are excessively long; the entire repository looks like a giant script.

This is almost the complete opposite of the habits of excellent human engineers.

The latter often emphasizes separation of modules and concerns, and breaks down the code elegantly—configuration is placed in config.json, utility functions in utils.py, database operations in db.py, and then they are called to each other through import.

This actually exposes a very core problem: AI excels at local code generation, but not at global system planning. Real software engineering, in essence, is precisely the latter.

This is why the model is very strong in LeetCode, SWE-Bench, and Copilot scenarios, but it quickly falls into deep waters once it enters large-scale engineering systems in the real world.

The real bottleneck in AI coding today is no longer code generation capability, but long-term software system building capability.

Another interesting finding is the difference in performance between different languages.

The research team statistically analyzed the model's performance on projects in different languages, including C/C++, Go, and Rust. It is clear that the traditional C/C++ project achieved the highest completion rate, while Rust performed the worst.

The different models showed a highly consistent ranking in terms of task difficulty: models generally achieved higher success rates with relatively simple CLI tools like nnn, fzf, and gron; however, almost all models struggled to progress with complex systems such as FFmpeg, php-src, typst, and ast-grep. This indicates that ProgramBench's performance was not due to a model's occasional failure, but rather a stable suppression of the current model by the complex software system itself.

This is not surprising.

There is so much historical code, engineering practices, and Stack Overflow content about C/C++ on the internet that the model has been immersed in these patterns for many years.

Rust's engineering philosophy emphasizes modularity, ownership, trait systems, and long-term maintainability, which are precisely the things that current models are least good at.

In a sense, what Rust measures is not coding ability, but engineering ability.

As ProgramBench sparked heated discussions, the debate surrounding this benchmark also spread rapidly. One of the main criticisms was: isn't this just testing whether the model has memorized FFmpeg? After all, many projects in ProgramBench are open-source software.

In response, renowned Silicon Valley investor Deedy Das wrote an article stating that any benchmark can be overfitted.

SWE-Bench can remember bugs, LeetCode problems can be memorized, and even ARC-AGI may prevent leaks in the future by hiding its problem set. Simply discussing whether memory exists does not negate the value of benchmarks.

He believes that if a model really tries to memorize these programs by brute force, it will often degrade significantly in other areas.

Because training a large-scale model is not simply about stuffing the entire FFmpeg library into the parameters. Furthermore, researchers can detect the presence of direct memorization by comparing the similarity between the generated code and the original source code.

What he really wants to emphasize is that rebuilding a real-world software system from the ground up is itself a complex task with high utility and a long time span. If the model can truly reason and accomplish this kind of task, then this ability is likely to generalize to a large number of other engineering scenarios .

Another type of controversy is even more interesting. Some people complained that even humans couldn't rewrite FFmpeg from scratch, so this benchmark is completely unreasonable.

Deedy Das responded, "So what? There are many things that LLMs can do today that the average human cannot."

The goal of benchmarks is never to simulate the average ability of ordinary people, but to push models closer to higher levels of intelligence. Just because humans can't do it doesn't mean benchmarks are worthless.

For example, AlphaGo's superior chess skills compared to the vast majority of people do not diminish its contribution to the advancement of AI; similarly, a benchmark far exceeding the capabilities of ordinary engineers may be a problem that future agent systems must overcome.

Of course, he also acknowledged that ProgramBench still has many shortcomings. For example, it currently does not test complete agent harnesses like Claude Code and Codex; it only tracks completion status and does not provide more granular measurement of progress.

It also restricts internet connectivity to prevent obvious cheating.

Deedy Das agrees, this could lead the model to "hill-climbing" on the wrong thing in an attempt to score on a specific metric. However, one can always add a performance test with network access for comparison.

Some suggested: Why not use a completely new problem that no one has ever solved before? Deedy Das responded that it would make the benchmark almost impossible to build.

It's difficult to design a comprehensive test for a question without a standard answer; it's also difficult to determine whether a task truly belongs to a real-world engineering task or is a challenge fabricated by researchers out of thin air.

However, these issues can actually be corrected as the benchmark evolves.

What's truly important is that ProgramBench, for the first time, has elevated the evaluation of AI coding from the function level to the system level. It exposes what is currently the biggest gap in the industry: true software development is never about writing a function, but about creating an engineering system that can be maintained, extended, and collaboratively implemented by a team.

Today's large models are very good at generating local code. However, they still lack the ability to maintain complex systems consistently and stably over the long term.

So you'll find that recently the entire industry has started frantically researching another batch of keywords: memory, agents, repo-level reasoning, long-horizon planning, and autonomous software engineering.

The competition in the next stage may no longer be about who can generate longer code at once, but about who can maintain a living software system stably and continuously over long periods of time, through multiple rounds of interaction, and in complex contexts.

Paper link:

https://programbench.com/static/paper.pdf

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014) , author: Sia, published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments