Two months later, GPT-4.1 claimed to replace GPT-4.5. How good is it? In many actual tests, its performance is indeed remarkable, but it still cannot beat Gemini 2.5 Pro and Claude 3.7 Sonnet. So the question is, why did OpenAI release a model that is far behind Google?
However, two months later, GPT-4.5 was officially eliminated, and the old wave beat the new wave onto the beach.
The birth of the GPT-4.1 family directly surpassed 4.5 with stronger encoding performance, millions of token contexts, and more cost-effective prices.
The performance of the nano version of GPT-4.1 is comparable to that of the GPT-4o mini, and it is faster and cheaper.
These models are currently only available in the API, but the popular coding platforms Windsurf and Cursor have launched a seven-day free trial of GPT-4.1.
Look, the first wave of actual tests across the entire network has arrived.
GPT-4.1 has amazing encoding, but it can’t beat Gemini 2.5
How does this model, which is famous for its powerful encoding, perform in actual tasks?
OpenAI scientists say GPT-4.1 is not an inference model, but it can score 55% in the software engineering benchmark test
Netizen Flavio Adamo used the same prompt - letting the ball simulate free fall in a rotating hexagon, to test the encoding performance of three GPT-4.1 models and GPT-4.5.
It is not difficult to see that GPT-4.1 accurately simulates the physical movement process of the ball, but GPT-4.1-mini/GPT-4.1-nano are far behind.
The performance of GPT-4.5 is almost as good as that of GPT-4.1.
In another similar test, GPT-4.1 was challenged to rotate a square to simulate the effect of a sphere bouncing in the square.
Kaggle developer Parul Pandey said it was fun to create educational physics simulations with GPT-4.1.
As shown below, during the code generation process of using a ball to knock down a pyramid, the model reads very few unnecessary files and the code structure is very concise.
Another engineer used Windsurf to let GPT-4.1 generate a Snake game in 30 seconds.
Microsoft researcher Dimitris Papailiopoulos used GPT-4.1, GPT-4o, and GPT-4.5 to draw unicorns, and speculated that 4.1 has fewer parameters than 4o.
To be honest, the unicorn generated by GPT-4.1 is the ugliest one.
Wharton School professor Ethan Mollick used GPT-4.1 to generate p5js for the spacecraft control panel. He said that compared with GPT-4, 4.1 has made great progress and performed well overall.
Moreover, Ethan stated that GPT-4.1 is the fourth model that can run shaders in twigl for the first time.
Netizens asked GPT-4.1 and Gemini 2.5 Pro to simulate a cyberpunk city night scene illuminated by neon lights. In this case, the 4.1 model is still much stronger than the Google model.
From the above demo, it is not difficult to see that the encoding performance of GPT-4.1 is indeed amazing, but from a macro perspective, it is still not as good as Gemini 2.5 Pro and Claude 3.7 Sonnet.
In the latest Aider multi-language coding test, GPT-4.1 scored 52.4%, close to Grok 3 and DeepSeek V3. The cost is also reduced by half compared to o3-mini.
Netizens complained that GPT-4.1 programming is not as good as DeepSeek V3, but the price is 8 times more expensive.
Similarly, in the latest Livebench benchmark evaluation, it is also confirmed that GPT-4.1's reasoning, encoding, and mathematical capabilities are worse than Gemini 2.5.
Abacus.AI founder Bindu Reddy said that 4.1 performs above GPT-4o, but the Livebench results show that the new model is just an incremental update to 4o.
Harvard scientist Pierre Bongrand pointed out that OpenAI released a model that was far behind Google for the first time.
In the GPQA Diamond knowledge question-answering benchmark test, the GPT-4.1 family did not reach the human doctoral level, let alone surpass the Gemini 2.5 Pro.
A netizen joked in a spoof picture that during the period when OpenAI released GPT-4 and GPT-4.1, Google evolved Bard to the strongest version Gemini 2.5.
This year's AI war is clearly the ultimate head-to-head contest between OpenAI and Google.
Google is surrounded, but OpenAI cannot be underestimated
With the release of GPT-4.1, Nathan Lambert, head of post-training at Ai2, also published an analysis article as soon as possible.
He said that while GPT-4.1 was a minor version update, it made it clearer that very different models were driving the best API businesses.
Today, OpenAI is using GPT-4.1 to separate the API and ChatGPT.
Its model is optimizing the intelligence of every dollar, and we will continue to see differences in how ChatGPT handles and its API business.
Recently, OpenAI has been making various small updates, and their ultimate vision is to make ChatGPT a monolithic application independent of its API.
Last week, ChatGPT’s memory functionality was improved.
Today, OpenAI announced another set of API-only models, GPT-4.1, which directly competes with Google's Gemini.
Taken individually, none of the recent releases actually represent disruptive cutting-edge breakthroughs. After all, models with comparable performance already exist.
However, these updates reveal where OpenAI's strategic focus is headed.
Today, its weekly active users have exceeded 1.9 billion. At this time, what it needs is ChatGPT and the model behind it, which are completely different from any other AI products on the market.
Unlike other products that focus primarily on coding or information processing, ChatGPT places special emphasis on personality, atmosphere, and entertainment.
A classic example of this is that GPT-4.5, along with its high pricing, is being deprecated from the API, but will remain in ChatGPT.
The upcoming o3, o4, or open model still makes it unclear what OpenAI’s macro-strategic direction is.
As can be seen from the figure below, the core message conveyed by OpenAI is simple - to provide models with better performance and faster inference speed.
Below is a comparison of the new OpenAI model and the price per million Google Gemini tokens (in USD).
OpenAI new model:
GPT-4.1: Input/Output: 2.00/8.00 | Cache Input: 0.50
GPT-4.1 Mini: Input/Output: 0.40/1.60 | Cache Input: 0.10
GPT-4.1 Nano: Input/Output: 0.10/0.40 | Cache Input: 0.025
OpenAI old model:
GPT-4o: Input/Output: 2.5/10.00 | Cache Input: $1.25
GPT-4o Mini: Input/Output: 0.15/0.60 | Cache Input: $0.075
Google Gemini:
Gemini 2.5 Pro (≤200K Tokens): In/Out: 1.25/10.00 | Cache: Unavailable
Gemini 2.5 Pro (>200K Tokens): In/Out: 2.50/15.00 | Cache: Unavailable
Gemini 2.0 Flash: Input/Output: 0.10/0.40 | Cache Input: 0.025 (text/image/video), 0.175 (audio)
Gemini 2.0 Flash-Lite: Input/Output: 0.075/0.30 | Cache: Unavailable
Although OpenAI's models have strong academic evaluation results, this does not fully reflect their actual performance. After all, in practice, they need to perform repetitive and niche tasks.
Obviously, these new models are intended to directly compete with the Gemini Flash and Flash-Lite (after the stunning launch of the Gemini 2.5 Pro, the much-anticipated Gemini 2.5 Flash is also about to be released).
In comparison, the performance of GPT-4o-mini is already lagging behind and is not as easy to use as Flash.
To succeed in the API business, OpenAI needs to make breakthroughs in this cutting-edge field where Gemini already has an advantage.
Are they all distilled from GPT-4.5?
Many people have noticed that in OpenAI's official propaganda, the release pattern of these new models is exactly the same - there are broad improvements, but there is little explanation of the specific reasons.
So it is almost certain that these various new models are distilled from GPT-4.5 in order to obtain better personality and reasoning ability.
Or in terms of coding and mathematics, drawing on models like o3.
It can be seen that the new model has made significant progress in the code. You know, OpenAI’s early models were once extremely poor in this regard, almost zero.
However, these new models still lag significantly behind state-of-the-art models such as Gemini 2.5 (inference model) or Claude 3.7 (optional inference model) in terms of coding and mathematical evaluations.
Today, we are in the early stages of a shift toward models that encompass reasoning, but the notion of what the single best model is has become more complex.
These inference models will achieve a significant performance improvement by consuming far more tokens than before. Performance is king, but if the performance is equal, the one with lower cost will win.
But the first-mover advantage is still difficult to shake
But in the final analysis, for most ordinary users, the above technical details actually don't mean much.
For them, the annoying slider jokingly called "model engagement" is more intuitive.
For a long time, many people were more hesitant about the subscription fees for chatbots than the prices of APIs.
But it’s becoming increasingly clear that truly personalized, user-friendly experiences often only exist within these integrated applications.
Of course, developers can also build competing products through APIs and accumulate user interaction data, but given that OpenAI has already established a huge first-mover advantage at the product level, it may not be so easy to defeat OpenAI.
All of this once again confirms our understanding: productization is the top priority in the current development of AI.
The memory function, as well as a clearer separation between the ChatGPT product line and API services, will help OpenAI pave the way for future development.
But OpenAI still has a long way to go to fully realize this vision.
References:
https://x.com/bindureddy/status/1911865521504747563
https://x.com/paulgauthier/status/1911927464844304591
https://x.com/flavioAd/status/1911848067470598608
This article comes from the WeChat public account "Xinzhiyuan" , author: Xinzhiyuan, published by 36Kr with authorization.





