GPT-5.5 is here, topping the entire leaderboard and crushing Opus 4.7. OpenAI redeems itself tonight.

04-24

This article is machine translated

Show original

Silicon Valley is up all night!

Just now, GPT-5.5 made its stunning debut —OpenAI's most powerful and versatile next-generation flagship model to date.

It represents a completely new level of intelligence, evolving into the "native brain" of the Agent era .

That's right, the much-anticipated "Spud" has finally made its debut today.

Most notably, GPT-5.5 ranked first across all benchmark tests!

Whether in programming, reasoning, mathematics, or intelligent agent tasks, Claude Opus 4.7 and Gemini 3.1 Pro are completely outclassed by GPT-5.5.

Compared to the previous generation, GPT-5.5 Thinking is a game-changer, creating a generational gap.

In the AAI test, with the same output token, the GPT-5.5 Smart Index topped the world; it also set a new state-of-the-art record in ARC-AGI-2.

Ultraman couldn't help but praise it, saying, "GPT-5.5 is both smart and fast."

Each token is as fast as GPT-5.4, and the number of tokens used per task is significantly reduced.

It can almost perfectly understand what you're supposed to do!

President Greg excitedly stated, "This is a step towards a completely new way of working with computers."

Starting today, GPT-5.5 is officially launched on ChatGPT and Codex.

A new king of programming has emerged, and Opus 4.7 has fallen from its pedestal.

Let's look at the core programming field first. GPT-5.5 has made a remarkable comeback!

In OpenAI's words, it is the most powerful intelligent agent programming model to date.

Terminal-Bench 2.0 tests the capabilities of the entire agent engineering process.

The problem provides the model with a terminal environment and a fuzzy target, allowing it to plan its own path, call up tools, write scripts, handle errors, and iterate repeatedly.

Here, GPT-5.5 achieved 82.7%, GPT-5.4 75.1%, and Claude Opus 4.7 only 69.4%. A difference of 13 percentage points, a crushing victory.

OpenAI's internal Expert-SWE benchmark, which specifically tests long-cycle programming tasks with a median human estimated completion time of 20 hours, scored 73.1% for GPT-5.5, which is also higher than GPT-5.4's 68.5%.

In SWE-Bench Pro, an industry benchmark that is widely recognized as the best indicator of real-world problem-solving capabilities on GitHub, GPT-5.5 scored 58.6%, slightly lower than Claude Opus 4.7 (64.3%).

However, OpenAI added an asterisk next to this data, writing, "Anthropic reports signs of overfitting (memory) on some subsets of the problem."

In other words, although Opus 4.7 performed well on the test, I suspect you memorized the answers.

Codex researchers stated bluntly: SWE-Bench is no longer a reliable measure of top-tier programming skills.

Most importantly, in all three assessments, GPT-5.5 used fewer tokens, yet still outperformed GPT-5.4 across the board.

This capability is even more evident in Codex.

It can complete end-to-end programming tasks, from implementation and refactoring to debugging, testing and verification.

For example, let's use GPT-5.5 to create a visualization application for the Artemis II space mission.

First, send a screenshot of the mission to GPT-5.5, and then require the implementation of an interactive 3D orbit simulator using WebGL and Vite. The trajectory data must come from real vector data from NASA/JPL Horizons, and it must also have realistic orbital mechanics.

GPT-5.5 was assembled from scratch; it could be rotated by dragging the mouse, and the relative positions of the Orion spacecraft, the moon, and the sun were all correctly aligned.

Let's have another tank shoot down a flying saucer.

The prompt requires you to create a UFO shooting game using Three.js, where players control a tank to shoot down flying saucers overhead. The game should be "low-poly but visually appealing." First, provide the complete file structure and a list of files that need to be modified, then write all the code. "Don't stop until you're finished."

GPT-5.5 was executed exactly as described, from file structure to Three.js rendering to shooting judgment, delivering a playable 3D game in one go.

In the 3D dungeon arena, Codex handled the game architecture, TypeScript/Three.js implementation, combat system, enemy encounters, and HUD feedback.

GPT generated the environment textures, the OpenAI API generated the character dialogue, and the character models, textures, and animations came from third-party asset tools. Several AIs each handled their own tasks, piecing together a game where you can fight monsters.

Early testers stated that GPT-5.5 has a stronger ability to understand system configurations.

It is better able to determine where the problem lies, where the fix should be added, and what other parts of the codebase might be affected.

85% of OpenAI employees are going crazy using it; this is the real workhorse AI.

Beyond programming, GPT-5.5 also performs exceptionally well in "knowledge-based work".

After all, OpenAI calls it "a new kind of intelligence for real-world work."

It can understand what you want to do more quickly and switch between different tools until the task is completed.

GDPval assesses the level of AI in performing normative knowledge work across 44 occupations. GPT-5.5 scored 84.9%, Opus 4.7 scored 80.3%, and Gemini 3.1 Pro scored only 67.3%.

OSWorld-Verified tests whether the model can operate independently in a real computer environment. GPT-5.5 scored 78.7%, almost the same as Opus 4.7's 78.0%.

Tau2-bench was used to test the model's ability to handle multi-turn conversations, system queries, and action execution within complex customer service workflows. GPT-5.5 achieved 98.0% performance without fine-tuning prompts.

What's interesting is how OpenAI uses it itself. According to the official blog, over 85% of the company's employees use Codex across departments every week.

The public relations department used GPT-5.5 to analyze six months of speech invitation data, built a scoring and risk framework, and automatically processed low-risk requests through the Slack AI agent.

The finance department reviewed 24,771 K-1 tax forms, totaling 71,637 pages, two weeks ahead of last year.

The marketing team has implemented automatic generation of weekly business reports, saving 5 to 10 hours per week .

Today, in Codex, GPT-5.5 allows direct interaction with web applications to test processes, click on pages, capture screenshots, and iterate based on what you see until the task is completed.

Below is an example of testing the onboarding process.

Codex can also generate higher quality spreadsheets, PowerPoint presentations, and documents. Below is a demo of financial modeling.

The new in-app file viewer speeds up the review, revision, and iteration process, making files ready to share faster.

In terms of computer use, Codex offers enhanced computer operation capabilities.

Whether it's recognizing screen content, clicking, typing, navigating, or even transferring contextual information across tools, it can handle it all with ease.

OpenAI researcher Noam Brown stated that with GPT-5.5, he can write CUDA kernels and run research experiments just like a professional.

Revolutionizing scientific research, proving the Ramsey number theorem.

In addition to these, GPT-5.5 also helped discover a new proof about Ramsey numbers, which has been verified in the Lean language.

Ramsey numbers are a core research object in combinatorics; simply put, they are the size of a network at which a certain regular structure will inevitably emerge. New results in this field are extremely rare.

Paper link: https://cdn.openai.com/pdf/6dc7175d-d9e7-4b8d-96b8-48fe5798cd5b/Ramsey.pdf

Research findings in this field are extremely rare and technically very challenging. GPT-5.5 has discovered a proof regarding the long-term asymptotic nature of off-diagonal Ramsey numbers.

It's not about writing code or providing explanations; it's about presenting a valuable mathematical proof.

On GeneBench, GPT-5.5 scored 25.0%, while GPT-5.4 scored 19.0%. This benchmark is specifically designed for multi-stage scientific data analysis, requiring models to handle fuzzy data and cope with hidden confounding factors with minimal human intervention.

BixBench, an evaluation based on real bioinformatics design, ranked GPT-5.5 first among all models with publicly available scores, at 80.5%.

FrontierMath Tier 4, the most difficult tier in a cutting-edge mathematics problem bank curated by top mathematicians such as Terence Tao, covers areas such as algebraic geometry and number theory, with a difficulty level approaching that of unpublished research.

The GPT-5.5 score is 35.4%, the GPT-5.4 score is 27.1%, and the Opus 4.7 score is only 22.9%. The difference is more than 12 percentage points.

The difference between Tier 1 and Tier 3 is only 8 percentage points (51.7% vs 43.8%), which shows that the advantage of GPT-5.5 becomes more significant the further one goes into the forefront of mathematics.

Derya Unutmaz, professor of immunology at the Jackson Genomics Laboratory, used GPT-5.5 Pro to analyze an expression dataset containing 62 samples and nearly 28,000 genes.

The model produced a detailed research report, summarizing not only its findings but also delving into key issues and insights. In contrast, this task would have taken a human team several months.

Bartosz Naskręcki, a mathematics teaching assistant at Poznan-Mickiewicz University, built an algebraic geometry application from a single prompt word in just 11 minutes on Codex, visualizing the intersection of quadric surfaces and converting the resulting curve into a Weierstrass model.

From programming to knowledge work to scientific research, and so on, the conclusion is clear.

GPT-5.5 is not just another "minor version iteration"; it is a holistic leap brought about by a completely new base model.

A single image is all it takes to completely defeat Opus 4.7.

In conclusion, the birth of GPT-5.5 can be described as a complete transformation. One image is enough to show it against Opus 4.7.

In Vending-Bench, GPT-5.5 also outperformed Opus 4.7.

Opus 4.7 performed much better than 4.6: it kept lying to suppliers and cheated customers on refunds. In contrast, GPT-5.5 operated ethically and still won the game.

Ultraman also made a joke, "Don't share this, don't share this, don't share this... Oh well, life is ultimately imitating art."

The price has doubled; it's more powerful, but also more expensive.

Having discussed strength, we must now talk about money.

The API pricing for GPT-5.5 is $5 per million input tokens and $30 per million output tokens.

What is the price of GPT-5.4? $2.50 and $15.

It has doubled.

The GPT-5.5 Pro is even more outrageous, with an input of $30 and an output of $180.

Compared to the Opus 4.7, which costs $5 for input and $25 for output, the GPT-5.5 has the same input price as the Opus 4.7, but its output is 20% more expensive.

OpenAI's explanation is improved token efficiency. For the same Codex task, GPT-5.5 uses significantly fewer tokens than GPT-5.4.

It is stronger and more efficient.

However, a simple calculation reveals that if a team spends $100,000 per month on GPT-5.4, even if token usage decreases by 30% after switching to GPT-5.5, the monthly bill will still rise to around $140,000.

In other words, GPT-5.5 is a premium product where "you pay more for more intelligence." In contrast, GPT-5.4 will most likely continue to be seen as a cost-effective option.

OpenClaw has been integrated with the most powerful GPT-5.5.

Eight days, a microcosm of an era

Let's look back at what happened during those 8 days.

On April 16, Anthropic launched a surprise attack on SWE-Bench Pro using Opus 4.7, wresting the programming throne from GPT-5.4.

On April 24th, GPT-5.5 was officially released. It crushed Terminal-Bench competitors, doubled in price, and caused a research sensation.

The AI race in 2026 will no longer be a contest of "whose model is stronger".

In the narrative of GPT-5.5, OpenAI repeatedly emphasizes "exploring entirely new ways of working on computers," a universal agent that can autonomously plan tasks, invoke various tools, and switch back and forth between browsers and local software.

Benchmarking is just the appetizer; agent-based office work is the main battleground. Whoever defines "how AI can do things for humans" first will define the next generation of computer user interfaces.

A round trip takes eight days. This pace will only accelerate.

References:

https://openai.com/index/introducing-gpt-5-5/

https://x.com/OpenAI/status/2047376561205325845?s=20

This article is from the WeChat official account "New Zhiyuan" , author: New Zhiyuan, and published with authorization from 36Kr.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

BlockTempo

Alibaba plans to integrate its Qwen AI system with 4 billion products on Taobao: automatically comparing prices, placing orders, and managing logistics.

MarsBit

Why did these Altcoin surge by 120% in a week?

LAB

8.37%

PANews

PA Daily News | WorldCoin team transfers 30 million WLD to Bitgo; Aave obtains judge's permission to transfer $71 million in ETH related to the North Korean hack.

WLD

3.37%