Yesterday, OpenAI announced adjustments to its API usage rules.
In the future, access to OpenAI's latest large models will require an authenticated ID (that is, an ID card issued by the government of one of the countries/regions supported by OpenAI, and one ID card can only verify one organization every 90 days). Failure to pass the verification will affect the use of the model.
The controversy caused by the new regulations has not yet subsided. OpenAI launched three GPT-4.1 series models early this morning. However, they can only be used through the API and will not appear directly in ChatGPT.
GPT-4.1: flagship model, best in class at encoding, instruction following, and long-context understanding, suitable for complex tasks.
GPT-4.1 mini: A small and efficient model that outperforms GPT-4o in multiple benchmarks while reducing latency by nearly half and cost by 83%, making it suitable for scenarios that require efficient performance.
GPT-4.1 nano: OpenAl's first ultra-small model, the fastest and cheapest, with a 1 million token context window, suitable for low-latency tasks such as classification and auto-completion.
Although everyone was prepared for OpenAI's confusing naming logic, GPT-4.1 was still criticized by netizens. Even OpenAI's Chief Product Officer Kevin Weil joked: "Our naming skills definitely didn't improve this week."
GPT-4.1 model card 🔗 https://platform.openai.com/docs/models/gpt-4.1
Programming + long text, GPT-4.1>GPT-4.5?
Technology is the bottom line. Although the name has been criticized, the strength of GPT-4.1 is still obvious to all.
OpenAI claims that the GPT-4.1 series of models performs well in multiple benchmarks and is one of the most powerful programming models currently available.
Able to complete complex coding tasks independently
Improve front-end development capabilities
Reduce unnecessary code modifications
Better compliance with diff format
Tool calls are more consistent and stable
OpenAI even likened GPT-4.1 to a "quasar", implying that it has a strong influence and energy in the field of AI like a quasar.
In the SWE-bench Verified benchmark test, a standard for evaluating real software engineering capabilities, GPT-4.1 scored 54.6%, an increase of 21.4 percentage points over GPT-4o and 26.6 percentage points over GPT-4.5.
GPT‑4.1 has been specially trained in the diff format, which allows it to output modified fragments more stably, saving latency and cost. In addition, OpenAI has increased the output token limit of GPT‑4.1 to 32,768 tokens to facilitate the need to rewrite the entire file.
In the front-end development task, OpenAI blind test results showed that 80% of evaluators preferred web pages generated by GPT-4.1.
OpenAI also invited Varun Mohan, founder and CEO of Windsurf, to share his experience in its live broadcast this morning. Varun revealed that its internal benchmark test showed that GPT-4.1's performance was 60% higher than GPT-4.
Given the outstanding performance of GPT-4.1, Windsurf decided to provide all users with a one-week free trial of GPT-4.1 , and then continue to provide the model at a significant discount. In addition, Cursor users can now also use GPT-4.1 for free.
In real conversations, especially in multi-round interaction tasks, it is crucial for the model to remember and correctly reference information in the context. In Scale’s MultiChallenge benchmark, GPT‑4.1 improved by 10.5 percentage points over GPT‑4o.
IFEval is a test set based on clear instructions (such as content length and format restrictions) to evaluate whether the model can output content according to specific rules. GPT-4.1 still outperforms GPT-4o.
In the uncaptioned long video category of the multimodal long context benchmark Video-MME, GPT-4.1 set a new record with a score of 72.0%, leading GPT-4o by 6.7 percentage points.
Model miniaturization is an inevitable trend in the commercialization of AI.
The "small but powerful" GPT‑4.1 mini even surpassed GPT-4o in many tests, while maintaining similar or higher intelligent performance as GPT‑4o, with latency almost halved and cost reduced by 83%.
OpenAI researcher Aidan McLaughlin wrote that with GPT-4.1 mini/nano, it is now possible to achieve GPT-4-like quality functionality at a much lower cost (25 times cheaper), which is extremely cost-effective.
GPT‑4.1 nano is OpenAI’s fastest and lowest-cost model, suitable for tasks that require low latency.
It also supports context windows of 1 million tokens, and scored 80.1%, 50.3% and 9.8% in the MMLU, GPQA and Aider polyglot programming tests respectively, all higher than GPT-4o mini, and is suitable for lightweight tasks such as classification and automatic completion.
However, GPT-4.1 can only be used through the API and will not appear directly in ChatGPT. But the good news is that the GPT-4o version of ChatGPT has quietly added some of the features of GPT-4.1, and more will be added in the future.
GPT‑4.5 Preview will be retired on July 14, 2025. The core model of the developer API will also be gradually replaced with GPT-4.1.
According to official explanations, GPT-4.1 is superior in performance, cost, and speed, while the creative expression, text quality, sense of humor, and delicate style that users love in GPT-4.5 will continue to be retained in future models.
GPT-4.1 has also been upgraded in terms of understanding instructions, whether it is format requirements, content control, complex multi-step tasks, or even maintaining consistency in multiple rounds of conversations, it has done better.
Long text is a highlight of the GPT-4.1 series. It supports ultra-long context processing capabilities of up to 1 million tokens, which is approximately equivalent to 8 complete sets of React source code, or hundreds of pages of documents, far exceeding the 128,000 tokens of GPT-4o. It is suitable for tasks such as large code base analysis and multi-document review.
In the "needle in a haystack" test, GPT-4.1 accurately retrieved ultra-long context information and performed better than GPT-4o; in the search test, it had a stronger ability to distinguish similar requests and cross-location reasoning, with an accuracy rate of 62%, far exceeding GPT-4o's 42%.
Despite supporting ultra-long contexts, GPT-4.1's response speed is not slow. A 128K token request takes about 15 seconds, and the nano model takes less than 5 seconds. OpenAI has also optimized the prompt cache mechanism, increasing the discount from 50% to 75%, making it cheaper to use.
In the live demonstration session early this morning, OpenAI fully demonstrated GPT-4.1's powerful long-context processing capabilities and strict instruction-following capabilities through two cases, which may also be quite practical usage scenarios for developers.
In the first case, the demonstrator had GPT-4.1 create a website that could upload and analyze large text files, and then used this newly created website to upload a NASA server request log file from August 1995.
The demonstrator "secretly" inserted a line of non-standard HTTP request records into this log file, and asked GPT-4.1 to analyze the entire file and find this abnormal record. As a result, the model successfully found this line of abnormal records in this file of about 450,000 tokens.
In the second case, the presenter sets up a system message to let the model act as a log analysis assistant, stipulating that the input data must be within the <log_data> tag and the user question must be within the <query> tag.
When the presenter asked a question without a <query> tag, the model refused to answer, but when the tag was used correctly, the model accurately answered questions about log files. In contrast, the previous GPT-4o ignored these rules and answered the questions directly.
In short, the core advantages of GPT-4.1 include ultra-long context support, powerful retrieval reasoning, excellent multi-document processing, low latency and high performance, and high cost-effectiveness. It is suitable for scenarios such as law, finance, and programming, and is an ideal choice for tasks such as code search, smart contract analysis, and customer service.
OpenAI's real trick is a reasoning model that can think like Feynman
OpenAI has not officially launched o3 yet, but some news has already come out.
According to The Information, citing three people familiar with the test, the new AI model that OpenAI plans to launch this week will be able to integrate concepts across disciplines and propose new experimental ideas ranging from nuclear fusion to pathogen detection.
OpenAI first launched a model centered on reasoning last September. This type of model performs particularly well when dealing with verifiable problems such as mathematical theorems. The longer the thinking time, the better the effect.
As Scaling Law hits a bottleneck, OpenAI has shifted its research and development focus to reasoning. It believes that in the future it will be able to provide a monthly subscription service of up to US$20,000 (RMB 140,000) to support doctoral-level research.
This reasoning model, like Tesla or scientist Feynman, can integrate knowledge from multiple fields such as biology, physics, and engineering to come up with unique insights. In reality, such interdisciplinary results require time-consuming and laborious teamwork, but OpenAI's new model can complete similar tasks independently.
ChatGPT's "Deep Research" tool supports browsing web pages and organizing reports, which scientists can use to summarize literature and propose new experimental methods, demonstrating its potential in this area. According to a tester, scientists can use this AI to read public literature in multiple scientific fields, summarize existing experiments, and propose new methods that have not yet been tried.
Existing reasoning models have also greatly improved scientific research efficiency.
The Information cited the example of Sarah Owens, a molecular biologist at Argonne National Laboratory in Illinois, who used the o3-mini-high model to quickly design experiments that applied ecology-related techniques to detect sewage pathogens, saving days.
Chemist Massimiliano Delferro used AI to design a plastic decomposition experiment, obtaining a complete solution including temperature and pressure ranges, with efficiency far exceeding expectations. In the "AI Improvisational Experiment" in February this year, testers used o1-pro and o3-mini-high to assess the potential environmental impact of building power plants or mines in specific geographical areas, and the results were also far beyond expectations.
According to reports, at an experimental event held at Oak Ridge National Laboratory in Tennessee, OpenAI President Greg Brockman told thousands of scientists from nine federal institutes:
“We are moving toward a trend where AI will spend a lot of time ‘thinking hard’ about important scientific problems, and this will make you ten or even a hundred times more efficient in the next few years.”
Currently, OpenAI has committed to providing private access to multiple national laboratories to use inference models hosted on supercomputers at Los Alamos National Laboratory.
However, the ideal is full, but the reality is very skinny. In many cases, there is still a gap between the suggestions made by AI and the ability of scientists to verify these ideas. For example, the model can recommend the intensity of lasers to release a specific amount of energy, but it still needs to be verified in a simulator; suggestions involving chemistry or biology require laboratory testing.
OpenAI also released an AI agent called Operator, but it was criticized for frequent errors.
According to people familiar with the matter, OpenAI plans to improve performance through "reinforcement learning based on human feedback" (RLHF), screening failure cases based on actual user usage data and training Operator with successful examples.
David Luan, head of Amazon AGI SF Lab and former OpenAI engineering director, provided an interesting perspective. He said that before the advent of inference models, if a traditional AI model "discovered a new mathematical theorem", it would be "punished" because it was not in the training data.
In addition, OpenAI is also developing more advanced programming agents. OpenAI CFO Sarah Friar revealed at the Goldman Sachs Summit in London in March this year:
“The next thing we’re going to launch is a product we call A-SWE. By the way, our marketing skills are not the best (laughs). A-SWE stands for ‘Agentic Software Engineer’.”
She said that A-SWE is not just an assistant to the software engineers in your team like Copilot is now, but a software engineer with real "autonomous capabilities" who can independently develop an application for you.
You only need to submit a PR (Pull Request) like you would to an ordinary engineer, and it can complete the entire development process independently.
“Not only can it complete development, but it can also do all the work that engineers hate the most: it will do its own QA (quality assurance), test and fix bugs, and write documentation—things that are usually difficult for engineers to do on their own initiative. So the combat effectiveness of your engineering team will be greatly amplified.”
On the one hand, models like GPT-4.1 can handle more complex tasks than ever before through ultra-long contexts and precise instruction-following capabilities; on the other hand, reasoning models and autonomous agents are breaking the limitations of traditional AI and moving towards true autonomous thinking capabilities.
This article comes from the WeChat public account "APPSO" , author: APPSO, and is authorized to be published by 36Kr.




