GPT-5.4, the "Agent Native" Big Model is Here?

This article is machine translated
Show original

Just two days after the rumors surfaced, on March 5th local time, OpenAI officially released GPT-5.4. This model update focuses on the currently hottest area: AI Agents.

Prior to GPT-5.4, the capability limits of large models could be summarized in one sentence: they could tell you "how to do it," but they couldn't do it themselves.

If you ask it to analyze your competitors, it will give you a lengthy written report; if you ask it to organize an Excel spreadsheet, it will write a piece of Python code for you to run; if you ask it to book a flight, it will tell you step by step which website to go to and which button to click.

The wall in the middle is called "Computer Operation" .

GPT-5.4 is the first general-purpose model from OpenAI to remove this wall.

Improvements of GPT-5.4 compared to previous models | Image source: OpenAI

It can recognize screen content by taking screenshots, issue mouse and keyboard commands, and execute multi-step workflows across different applications. In OpenAI's own words, this is their " most powerful and efficient cutting-edge model for professional work to date ."

On a more technical level, GPT-5.4 supports context windows with up to 1 million tokens and can call libraries such as Playwright to directly control browsers and desktop applications.

This means that it no longer deals with "dialogue about the task," but with "the task itself."

01

OpenAI's groundwork

If you've been following OpenAI's moves over the past few months, you'll find that GPT-5.4 isn't a product that suddenly appeared, but rather the latest move in a clear strategic line.

Just two weeks ago, OpenAI released GPT-5.3-Codex, upgrading Codex from an "Agent that can write code" to an "Agent that can do almost everything a developer can do on a computer," and setting new industry benchmarks on SWE-Bench Pro and Terminal-Bench.

At the same time, OpenAI launched the "Frontier" platform for enterprises, with HP, Intuit, and Uber already being early users.

GPT-5.4 is significantly smarter than 5.2 in form completion | Image source: OpenAI

Earlier, on March 2nd, OpenAI and AWS expanded their existing $3.8 billion partnership to over $100 billion over eight years, with AWS becoming the exclusive third-party cloud distributor for the OpenAI Frontier platform. The sheer size of this investment is itself a signal.

The latest $110 billion funding round, backed by hundreds of billions of dollars each from Amazon, SoftBank, and Nvidia, also took place at the same time.

This is not a company that is "developing good products," but a company that is going all out to "win the enterprise AI agent market."

GPT-5.4's native computer operation capabilities are the key weapon in this sprint.

02

Is it really easy to use?

The feature demonstrations at press conferences always look great, but the problem lies in their actual performance.

Fintech company Walleye Capital reported in internal testing that GPT-5.4 improved accuracy by 30 percentage points in Excel financial model evaluations, significantly accelerating the automation process of scenario analysis.

The CEO of talent assessment platform Mercor called it " the best model we've ever tested ," highlighting its outstanding performance in handling long-cycle tasks such as presentation creation, financial modeling, and legal analysis.

An independent developer who uses Codex daily offered a more down-to-earth assessment: "GPT-5.4 is my new daily driver in Codex. Its way of thinking is closer to human, and it's not as obsessed with technical details as 5.3." However, he also added a cautionary note: " Be careful; I've encountered several situations where the model incorrectly executes a task but conceals this fact ."

GPT-5.4's improvements in operation and vision | Image source: OpenAI

This detail is worth pondering.

Benchmark data also confirms this improved capability. Reportedly, GPT-5.4 outperformed 83% of the average office worker on the GDPval benchmark . This number sounds impressive, but the real question isn't "how many people it surpasses," but rather "in which tasks can it replace humans?"

However, Dr. Jeff Dalton of the University of Edinburgh's School of Informatics also pointed out a practical problem—the current demonstrations lack sufficient detailed evaluation evidence to support those grand claims. The capabilities are real, but where the boundaries lie requires further independent verification.

03

Agent battlefield, no safe zone.

If GPT-5.4 represents OpenAI's agent ambitions, then its competitors have not been idle.

Anthropic's Claude 3.7 Sonnet launched the "Computer Use" feature back in February of this year, positioning it as a hybrid inference model designed specifically for complex tasks.

Google's Gemini 2.0 series also continues to focus on "Agentic" capabilities, with Project Mariner already able to perform multiple operations autonomously within the Chrome browser.

However, the fundamental difference between GPT-5.4 and its competitors lies in the fact that it is OpenAI's first product to integrate computer operation capabilities into a general-purpose model —not a standalone tool, not an API that requires additional calls, but a capability that is inherent in the model itself.

In engineering terms, "native" means lower latency, smoother task transitions, and less "glue code." For companies looking to quickly deploy agent applications, this difference directly impacts deployment costs.

OpenAI also announced that GPT-5.4 can directly connect to Microsoft Excel and Google Sheets, enabling granular analysis and automation at the cell level. This step clearly targets the core of enterprise decision-making processes.

The battlefield for agents has never been about who runs the fastest, but about who can embed themselves into the enterprise workflow first and become an "indispensable presence".

Tech launches are always full of passion, but the real test comes on the 91st day—when the hype has died down and users open the tool in real-world work scenarios. Can it reliably catch the screenshot, accurately click the button, quietly complete the task, and then deliver the results?

The developer's statement about "concealing errors" is the most alarming sentence I've seen in this report so far.

The ceiling of AI agent capabilities is never "what it can do", but "whether you dare to trust it to do it" .

Trust is the true currency in this agent war .

This article is from the WeChat official account "GeekPark" (ID: geekpark) , author: Hualinwuwang, editor: Jingyu, published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments