Every time you open an AI tool, you probably have to think for a second: which model should I use for this task? Writing code is one thing, looking up information is another, and having AI help you operate your computer requires opening another window.
After today, this sense of division finally has an answer.
Just now, OpenAI officially released GPT-5.4, which integrates programming, reasoning, computer control, web search, and the Million Tokens context into a single model without sacrificing any of these capabilities for the sake of integration.
OpenAI CEO Sam Altman also posted a short tweet on the X platform, highlighting five areas: stronger knowledge work, better web search, native computer control, support for millions of token contexts, and the ability to intervene at any time during the response process.
These few words precisely address the five most prominent pain points in the application of AI over the past two years.
Knowledge work: Eight out of ten times, AI beats professionals.
To understand the advancements of GPT-5.4 in knowledge work, it is necessary to first understand the design logic of the GDPval benchmark.
It spans 44 professions across the nine industries that contribute the most to the US GDP. The tasks are real-world jobs that happen every day in the workplace: writing financial models for investment banks, scheduling emergency room shifts for hospitals, and creating presentations for sales teams.
After the task is completed, the output is given to real practitioners in the industry for blind testing and scoring to see how many percentages of human peers the AI's output can beat.
The answer for GPT-5.4 is 83.0%, meaning that in more than eight out of ten comparisons, industry professionals believe that the AI's output has reached or exceeded the level of its human counterparts. The previous generation, GPT-5.2, had a score of 70.9%, a difference of nearly 13 percentage points.
The progress is most evident in spreadsheet modeling. GPT-5.4 simulated a junior investment banking analyst completing a modeling task, with an average score of 87.3%, compared to 68.4% for GPT-5.2 and 79.3% for GPT-5.3-Codex, a difference of nearly 20 percentage points.
Harvey's BigLaw Bench test results were equally impressive, with a GPT-5.4 score of 91%, and it also took first place in Mercor's APEX-Agents benchmark.
Accuracy is also a concern. The problem of hallucinations has always been the biggest obstacle for AI to enter professional scenarios, and every percentage point reduction means that it can be used safely in more scenarios.
Data shows that compared to GPT-5.2, GPT-5.4 has a 33% lower probability of errors in a single statement and an 18% lower probability of errors in a complete response.
Programming: One model, all code writing and testing covered.
GPT-5.4 integrates the programming capabilities of GPT-5.3-Codex into the mainline. For developers, this means that you no longer need to create a separate model for writing code, and the programming capabilities themselves are not compromised in any way.
SWE-Bench Pro is specifically designed to test real-world software engineering tasks. It scores 57.7% on GPT-5.4, 56.8% on GPT-5.3-Codex, and 55.6% on GPT-5.2. After integration, the programming score actually increases, while also gaining a whole set of general-purpose capabilities such as computer control, making it almost impossible to find any obvious weaknesses.
After trying it out, well-known AI review blogger Dan Shipper wrote: "This is the best planning capability we've seen from OpenAI in recent times. The code review is also very strong, and the cost is about half that of Opus."
He pointed out two specific dimensions. First, planning ability is crucial to the success of long-term tasks, and GPT-5.4 is significantly more organized in task breakdown and continuous progress. Second, compared to Claude Opus, it costs about half the price; for developers who need to make large-scale API calls, this difference will be very noticeable on the bill.
Enabling the /fast mode in Codex can increase the token generation speed of GPT-5.4 by up to 1.5 times, allowing users to maintain a smooth workflow during coding, iteration, and debugging.
At the same time, the newly introduced experimental feature Playwright Interactive takes the programming experience of GPT-5.4 a step further.
GPT-5.4 enables real-time debugging through a visual browser when building web or electron applications. The model can write code and test the application it is building at the same time, simultaneously assuming the roles of both developer and tester.
OpenAI showcased a prime example: with just a single lightweight prompt, GPT-5.4 generated a complete isometric theme park simulation game, encompassing a tile-based path-laying and attraction construction system, AI-powered visitor navigation and queuing behavior, and a comprehensive score that is dynamically updated in real time across four metrics: funding, visitor numbers, satisfaction, and cleanliness.
Playwright Interactive undertook multiple rounds of automated testing throughout the process, verifying the correctness of path laying, camera navigation, visitor response, and UI metrics. From writing code to testing and acceptance, the model completed the entire process autonomously.
Blogger Angel also created a Minecraft clone using GPT-5.4. The model took about 24 minutes to build and ran smoothly without any crashes. He tweeted, "Minecraft is basically cracked. Now I need to find a new test."
Wharton professor Ethan Mollick also received early access. Using the same prompt, he had GPT-5.4 Pro generate a 3D scene inspired by Piranesi, without any errors, only adding the instruction "Make it better." He then placed the result side-by-side with a version generated by GPT-4 two years earlier, and the difference was immediately apparent.
It's better at controlling computers than you are now.
This is the most noteworthy change in the GPT-5.4 release. Previously, OpenAI's computer manipulation capabilities were a separate module, with a clear separation between them and the model's language understanding and code generation.
The two systems previously operated independently, requiring information to be transmitted back and forth, which naturally reduced efficiency. Now that this separation is gone, GPT-5.4 uses the model's own reasoning capabilities when controlling the computer, eliminating the need for a roundabout approach.
This is also OpenAI's first product to natively integrate computer use capabilities into a general model, and I believe this will be a new starting point for future discussions on AI Agents.
Benchmark results show that OSWorld-Verified benchmarks demonstrate desktop navigation capabilities, enabling users to complete real operating system tasks using screenshots and mouse/keyboard interaction. GPT-5.4 achieved a success rate of 75.0%, compared to 72.4% for the human baseline and 47.3% for GPT-5.2.
In short, it has not only caught up with humans, but has also surpassed them.
In the Online-Mind2Web benchmark, which tests browser control using only screenshot mode, GPT-5.4 achieved 92.8%, while the comparison target, ChatGPT Atlas, achieved 70.9% in Agent Mode.
Real-world deployment examples speak for themselves. Mainstay used GPT-5.4 for automated form filling on approximately 30,000 property tax portals, achieving a 95% first-time success rate and a 100% success rate within three attempts, compared to only 73% to 79% for previous similar models. Session completion speed increased by approximately three times, and token consumption decreased by approximately 70%.
This is inseparable from the improvement in visual perception capabilities. Controlling a computer is ultimately about "seeing clearly"—seeing clearly what's on the interface, where the buttons are, and whether the clicks are accurate.
GPT-5.4 has made specific enhancements to this layer, introducing an original image input mode that supports high-fidelity image input with a maximum side length of 10.24 million pixels or 6,000 pixels; the upper limit of the original high mode has also been increased from the previous standard to a maximum side length of 2.56 million pixels or 2,048 pixels.
Tool usage and web search: Sustainability is the core competitiveness
A complex AI Agent system may be backed by dozens of MCP tools. In the past, the approach was to cram all the tool descriptions into the system before each conversation began, regardless of whether the tools would be used or not, and spend the tokens first.
GPT-5.4 takes a different approach: first, it provides the model with a simple list of tools (i.e., it introduces a tool search mechanism). When a tool is actually needed, its detailed description is retrieved. Tools that have been used once can be cached directly so they don't need to be retrieved again next time.
In a test of 250 tasks, with a full configuration of 36 MCP servers enabled, the tool search mode reduced total token consumption by 47% while maintaining the same accuracy. Nearly half the cost was saved, without sacrificing any accuracy.
In web search, GPT-5.4 scored 82.7% on the BrowseComp benchmark, 17 percentage points higher than GPT-5.2's 65.8%, with the Pro version reaching 89.3%, setting a new industry record. Zapier's CEO commented that GPT-5.4 continues searching where other models give up, making it the most persistent model they have ever tested.
Million Tokens Context: long long long long long long
GPT-5.4 supports context windows with up to 1 million tokens in its API, which means that all relevant documents for a complete project can be crammed into the same conversation at once.
However, based on the test results, 128K to 272K is the most stable range and is suitable for daily use.
Accuracy begins to decline above 256K, requiring validation for specific tasks before use. The score in the 512K to 1M range drops to 36.6%, currently more experimental and unsuitable for direct use in production tasks requiring high accuracy.
Another practical cost issue to note is that requests exceeding 272K will be counted towards the quota at twice the usage rate. In other words, sending a request with an excessively long context consumes the same amount of quota as two normal requests. It's worth considering carefully whether you really need such a long context before making such a request.
As for the ARC-AGI-2 visual abstract reasoning benchmark, the GPT-5.4 Pro scored 83.3%, while the previous generation GPT-5.2 Pro only scored 54.2%.
For example, FrontierMath Tier 4 is widely recognized as one of the most difficult mathematical benchmarks, containing 50 research-level mathematical problems that might take human mathematicians several weeks to solve. The GPT-5.4 Pro scored 38.0% on this benchmark, compared to 31.3% for its predecessor.
The benchmark for this figure is: a year ago, the best result was 2% of o3, and the current best open source model is 4.2%.
Blogger Deedy tweeted that the jump from 2% to 38% was "simply astonishing." With the aid of tools, Humanity's Last Exam scored 58.7% for GPT-5.4 Pro and 50.0% for GPT-5.2 Pro, a difference of nearly 9 percentage points.
Adjustments during implementation, not rework after completion.
Anyone who has used AI to handle long tasks has probably had this experience: after the model has run a long section, you realize that it is going in the wrong direction and you have to start all over again, wasting all your time.
GPT-5.4 Thinking introduces a new "interruption" feature in ChatGPT: before tackling complex tasks, the model presents a work plan outline and then begins execution. Users can intervene at any time during execution to adjust the direction, without having to wait for the result and start over.
This feature moves the correction process from "completed" to "in progress," making a noticeable difference in user experience for tasks requiring multiple rounds of collaboration. The feature is currently available on chatgpt.com and the Android app, with an iOS version coming soon.
Starting today, GPT-5.4 is available to ChatGPT Plus, Team, and Pro users, replacing GPT-5.2 Thinking as the default thinking model.
GPT-5.2 Thinking will be retained until its official retirement on June 5th of this year. Enterprise and Edu users can have early access enabled by their administrators in the background. GPT-5.4 Pro is only available to Pro and Enterprise plans.
The standard API is priced at $2.50 per million tokens for input, $0.25 per million tokens for cached input, and $15 per million tokens for output. The Pro version is priced at $30 per million tokens for input and $180 per million tokens for output. Batch and Flex processing are offered at half the standard price, while Priority Processing is priced at twice the standard price.
Of course, powerful reasoning ability also has its downsides. Hyperbolic co-founder Justin Jin complained on the X platform that the GPT-5.4 Pro was the model he loved to "overthink" the most— it started seriously reasoning after just one simple "Hi," burning through $80.
This is not an isolated case. The nature of inference models dictates that they tend to engage in deep thinking when processing any input, even if the problem itself doesn't require it. For everyday, lightweight tasks, the standard version might be a more suitable choice; it's more worthwhile to reserve the Pro version's inference capabilities for truly valuable situations.
Over the past two years, discussions about AI capabilities have mainly focused on the "intelligence" of benchmark test scores, but GPT-5.4's intelligence refers to its ability to reliably assume responsibility in real-world workflows.
In the past, AI could only output text, and people still needed to manually operate it to make things happen. Now, the model can open a browser, fill out a form, click a button, and record the results on its own, independently completing a complete task loop.
AI is transforming from a system adept at answering questions to one adept at completing tasks. And this transformation is happening much faster than most people anticipated.
Reference address attached:
https://openai.com/index/introducing-gpt-5-4/
This article is from the WeChat official account "APPSO" , authored by APPSO who discovers tomorrow's products, and published with authorization from 36Kr.


