GPT-5.5 achieves a global first, enabling blind programming with zero source code, ushering in a new era for AI programming.

This article is machine translated
Show original

[Introduction] GPT-5.5 has taken the hellish benchmark where AI failed to deliver a single solution! Starting with zero source code and writing a program blindly, it achieved a perfect score by maximizing inference computing power. Traditional code testing is obsolete; the race to ASI in computing power has officially begun.

The "hellish" programming challenge has finally been solved by AI!

Today, on ProgramBench, a benchmark where all cutting-edge AI technologies fail, GPT-5.5 has been successfully tested!

Using two different programming languages, C and Python, GPT-5.5 xhigh completely outperforms Opus 4.7 xhigh.

Just a few days ago, Meta, in collaboration with Stanford and Harvard, unveiled this new programming benchmark, ProgramBench:

200 questions, the pass rate for all cutting-edge AI models is 0%.

No model could solve even a single problem completely. Now, GPT-5.5 has become the first exception!

The ultimate test for programming AI: rebuilding the program from scratch.

How difficult is ProgramBench?

Traditional programming benchmarks, whether SWE-bench or HumanEval, are essentially about "fixing bugs" or "completing functions".

Give the model an existing codebase, tell it where it's broken, and let it fix the bugs.

This is an open-book exam, or even a semi-open-book exam, while ProgramBench is completely different.

It gives you a pre-compiled executable file and a document, and then says: Rewrite this program from scratch.

No source code provided, no decompilation allowed, no internet access permitted.

200 tasks, ranging from small tools like jq and ripgrep to heavyweight tools like FFmpeg, SQLite, and PHP compilers.

OpenAI researcher Noam Brown previously stated, "It's time to phase out evaluation methods like GQPA and introduce a completely new one."

When it was first released, almost all the AIs that manipulated the rankings failed. This time, GPT-5.5 finally turned the tables.

GPT-5.5 Breaks Record: Two Solutions in C and Python for the Same Problem

The first task that GPT-5.5 conquered was cmatrix, a classic terminal program that creates the digital rain effect in "The Matrix".

To the researchers' surprise, GPT-5.5's two inference levels, high and xhigh, chose completely different languages to solve the same problem.

The high version uses C, while the xhigh version uses Python.

Ultimately, both passed all the behavioral tests.

GPT-5.5 high's strategy is textbook-level: it first used 10 rounds of exploration and testing to test more than 40 flag combinations, thoroughly understanding the CLI behavior of the original program.

Then I wrote the complete C language implementation in one go, and it was done with only 5 minor adjustments.

GPT-5.5 xhigh is even more thorough, with 27 steps to explore every CLI path, and then write a complete Python implementation in one go.

Here come the key figures.

The GPT-5.5 (medium) without high inference mode is only slightly better than Claude Sonnet's 4.6.

But once you switch to xhigh mode, the performance skyrockets.

Not only did they solve a problem for the first time (with a pass rate of 0.5%), but they also set a new record for "almost solved" tasks: more than 95% of the unit tests for 26 tasks were passed.

More notably, the GPT-5.5 xhigh completely outperformed all competitors across the entire cumulative histogram.

No matter what metric you choose—average score, median, pass rate ≥90%, pass rate ≥50%—it's number one.

178 calls, Opus 4.7 failed due to two bugs

In comparison, the performance of the Claude Opus 4.7 xhigh is disappointing.

It cost $10.74 and involved 178 API calls, which is 10 times that of the standard GPT-5.5 version, which costs $1.04 and involves 17 calls.

As a result, 19 tests failed, the worst performance of the entire event.

The reason for Opus 4.7's failure is surprisingly simple:

Bug 1: Color parsing is case sensitive.

The code uses strcmp() instead of strcasecmp(). Inputs of "GREEN", "Red", and "BLUE" are all considered invalid.

A single difference in a function call caused 11 tests to fail .

In its 178-step exploration, Opus never tested uppercase or mixed-case color input; it only tried lowercase and an invalid color, "purple".

Bug 2: The exit code for invalid colors was written incorrectly.

The original program returned exit(0) when it encountered an invalid color, but Opus wrote it as exit(1).

Ironically, Opus clearly observed the original program's behavior during the exploration phase—`./executable -C purple; echo "exit=$?"` output `exit=0`. However, when testing its own implementation, it failed to detect this difference.

Eight tests failed.

However, Opus 4.7 has one highlight worth mentioning: it demonstrates amazing systems engineering capabilities when handling missing ncurses header files.

The other three models, upon discovering the missing ncurses.h file, directly switched to using ANSI escape sequences.

Opus 4.7 spent about 20 steps investigating, using ldconfig -p to find the runtime .so file, using nm -D to check the exported symbols, and then manually wrote a 106-line header file declaration to directly link the dynamic library.

This was a truly creative project, but it didn't yield better results.

There are still 199 unsolved questions.

The emergence of ProgramBench marks a new stage in programming benchmarking.

The pass rate for SWE-bench has reached 88.7%. AI has surpassed most PhDs on GPQA.

These evaluators are "melting" at an alarming rate, with scores getting higher and higher but less and less distinguishable.

As for ProgramBench, with 200 questions, only 1 has been solved so far, with a pass rate of 0.5%.

More importantly, this record-breaking achievement reveals a key trend: "inference computing power" is becoming a core variable in the capabilities of programming AI.

GPT-5.5 performs only moderately well in the default inference mode, but its high inference mode represents a qualitative leap.

This means that it's not that the model isn't smart enough, but rather that it wasn't given enough time to "think" beforehand.

Of the 200 questions in ProgramBench, 199 are still waiting to be answered.

From zero to one, it's not just the starting point.

Looking back at every "first breakthrough" moment in the history of AI development—

AlphaGo defeated a professional Go player for the first time, GPT-4 passed the bar exam for the first time, and O1 scored points on a math Olympiad problem for the first time.

"From zero to one" is never the starting point of linear progress, but rather a signal flare for exponential growth.

Noam Brown's Scaling Law for inference computing power has received the most intuitive verification to date on ProgramBench:

Using the same GPT-5.5 base, I almost failed in medium mode, got a perfect score in high mode, and completely crushed in xhigh mode.

Intelligence is no longer a fixed value, but a function of computing power.

What does this mean? It means that the path to ASI may not require waiting for the next generation of architectural revolution.

As long as the inference computing power continues to expand, and as long as the Scaling Law does not hit a wall.

Today, ProgramBench can only rebuild the cmatrix model; tomorrow it might rebuild SQLite; and the day after, it might rebuild the entire Linux kernel.

References:

https://x.com/polynoamial/status/2054255862441812099

https://programbench.com/blog/gpt-5-5-first-solve/

This article is from the WeChat official account "New Zhiyuan" , edited by Taozi, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments