GPT-4.1 debuted late at night, led by alumni of USTC, with amazing million-context programming, GPT-4.5 will be eliminated in three months

04-15

This article is machine translated

Show original

Just now, OpenAI launched three new models specifically for developers: GPT-4.1 , GPT-4.1 mini and GPT-4.1 nano!

They all have ultra-large context windows of up to 1 million tokens, and comprehensively surpass GPT-4o and GPT-4o mini in core capabilities such as code and instruction following. The knowledge deadline has also been updated to June 2024.

It is worth noting that the GPT‑4.1 series will only be available via an API and is open to all developers.

GPT-4.1 nano is OpenAI’s first nano model and is also the fastest and cheapest model they currently have available.

Don’t be fooled by its small size, its performance is not weak at all: MMLU score is 80.1%, GPQA score is 50.3%, Aider multi-language encoding benchmark score is 9.8%, which is a clear win over GPT-4o mini!

GPT-4.1 mini surpasses GPT-4o in multiple benchmark tests, is twice as fast and costs 83% less, maximizing efficiency!

GPT‑4.1 , the flagship model, is even more powerful:

Strongest encoding: GPT‑4.1 scores 54.6% on SWE-bench Verified, an improvement of 21.4% over GPT‑4o and 26.6% over GPT‑4.5.

Instruction Following: On Scale’s MultiChallenge⁠, GPT‑4.1 scored 38.3%, a 10.5% improvement over GPT‑4o.

Long context: On Video-MME, GPT‑4.1 achieves a new SOTA - scoring 72.0% in the long video, no subtitles category, an improvement of 6.7% over GPT‑4o.

Since then, the "quasar" mentioned by Ultraman the Riddler has finally been confirmed - it is GPT-4.1!

With the launch of the more powerful and lower-cost GPT-4.1, the controversial GPT‑4.5 Preview will be removed from the API in three months (July 14).

In response, OpenAI stated that GPT‑4.5 was originally launched as a research preview version with the aim of exploring and experimenting with a large-scale, computationally intensive LLM.

Although the model is about to be discontinued, OpenAI will continue to incorporate features that developers love, such as creativity, writing quality, and sense of humor, into future API models.

On-site Demo Test

First of all, of course, is programming ability.

In this demo, the researchers asked GPT-4.1 to make an online flashcard web application, and put forward many very specific requirements, such as 3D animation when clicking on a flashcard.

This is how GPT-4o performs this task.

In comparison, GPT-4.1 performs very smoothly, both in color and 3D animation.

Notice that from start to finish, you only need one prompt to get a complete application!

Below is an OpenAI Playground. In the demonstration, the researcher asked GPT-4.1 to generate a single Python file code application, simulating user queries on the right. This website can receive large text files and answer related questions.

As you can see, the model generated hundreds of lines of code. After the researchers actually ran the code, they found that the results were surprisingly good.

Just a hint and it created this website.

Next, there is the display of finding a needle in a haystack.

The researchers uploaded the file - NASA's server request and response log files since August 1995.

In this file, on the left is the name of the client that made the request to the NASA server. This is a long file with a lot of log lines. There are about 450,000 tokens on the left.

It is impossible to use this file on OpenAI's previous models.

Here, the researcher sneakily added a line that is not actually an HTTP request response. This small "needle" in the stack is difficult to detect.

Finally, GPT-4.1 succeeded!

The researchers confirmed that this line was indeed in the log file they uploaded.

OpenAI specifically emphasizes that a very important point in practice is how API developers prompt the model.

In this task, GPT-4.1 is tasked as a log analyst assistant. The researchers tell it the input data and how the user's query should be constructed.

There are some rules that follow, like the model should only answer questions that are relevant to the content of the log data, questions should always be formatted within query tags, reply with an error message if one of the items is not true, etc.

Next, it’s time to showcase GPT-4.1.

The researcher asked: How many requests did fnal.gov make? The model rejected it because it was not formatted within the query tag.

If the same request is made within the query tag, it will find both references in the log file.

This way, developers can explicitly tell the model not to do something, which is an extremely meaningful and critical detail in the development process - following negative instructions.

Pricing

In terms of price, although GPT‑4.1 is 26% cheaper than GPT‑4o, the input and output prices are still as high as US$2 and US$8 per million tokens.

GPT‑4.1 nano is OpenAI’s cheapest and fastest model to date, with an input cost of $0.1 and an output cost of $0.4.

For queries that reuse the same context, these new models have increased the hint cache discount from 50% to 75%.

Finally, long context requests are included in the standard per-token billing at no additional cost.

Programming: OpenAI's most powerful model is born

Compared with models such as GPT-4o, o1, and o3-mini, GPT-4.1 has made great improvements in programming.

It is obviously much better than GPT-4o in various programming tasks, such as using intelligent agents to solve programming problems, front-end development, reducing unnecessary code modifications, strictly following different formats, maintaining consistency in tool use, etc.

In SWE-bench Verified, a test that reflects real software engineering capabilities, GPT-4.1 completed 54.6% of the tasks, while GPT-4o (2024-11-20) only completed 33.2%.

This shows that GPT-4.1 has made great improvements in browsing code bases, completing tasks, and generating code that can both run and pass tests.

For SWE-bench Verified, the model receives a code repository and a problem description and needs to generate a patch to solve the problem. Its performance is highly dependent on the prompt words and tools used.

For API developers who want to edit large files, GPT-4.1 is much more reliable when handling code diffs in various formats.

The Aider multi-language diffing benchmark ⁠— which measures not only the model’s ability to encode across multiple programming languages, but also its ability to generate code changes in both full file formats and in different formats.

Here, GPT‑4.1 scores more than twice as well as GPT‑4o, and is even 8% higher than GPT‑4.5.

This allows developers to avoid having to rewrite the entire file and instead have the model output the changed lines, significantly saving costs and reducing latency.

For developers who prefer to rewrite the entire file, the output token limit of GPT‑4.1 has also been increased to 32,768 tokens (GPT‑4o is 16,384). Among them, the Predicted Outputs function can be used to reduce the latency of full file rewrite.

In Aider's multi-language benchmark, the model solves coding exercises from Exercism⁠ by editing the source file and allowing one retry. The "whole" format requires the model to rewrite the entire file, which can be slow and expensive. The "diff" format requires the model to write a series of search/replace blocks.

In addition, GPT‑4.1 also has significant improvements in front-end coding compared to GPT‑4o, which can create more functional and visually beautiful web applications.

In direct comparison evaluations, human judges preferred websites generated by GPT‑4.1 over GPT‑4o 80% of the time.

Command follow-up: Now in the first echelon

In terms of command following, OpenAI has developed an internal evaluation system to track the performance of the model in multiple dimensions and the following key command following categories:

Format following: Generates a response in a required custom format (such as XML, YAML, Markdown, etc.).

Negative instructions: Avoid performing a specific action. (Example: "Don't ask the user to contact support")

Ordered instructions: Perform a series of actions in a given order. (Example: "First ask the user for their name, then ask for their email address")

Content requirements: Ensure that the output contains specific information. (Example: "When writing a nutrition plan, it must include the number of grams of protein")

Ranking: Arranging output in a specific way. (Example: "Sort results by population")

Identify overconfidence: When the requested information is not available or the request is beyond the specified scope, answer "I don't know" or similar statements. (Example: "If you don't know the answer, please provide the contact email of the support team")

These categories were determined based on developer feedback and reflect the dimensions of command following that they find most relevant and important. Each category divides the prompts into three levels of difficulty: easy, medium, and hard.

When it comes to processing difficult prompt words, GPT-4o and GPT-4o mini have less than 30% accuracy, while the smallest nano in the new series reaches 32%.

At the same time, GPT-4.1 reached 49%, almost catching up with o1 and o3-mini, but still some distance away from GPT-4.5.

The internal instruction following assessment is based on real developer use cases and feedback, covers tasks of varying complexity, and incorporates instruction requirements regarding format, level of detail, length, etc.

For many developers, multi-turn command following is critical, which means the model needs to remain coherent as the conversation progresses and remember what the user has previously told it.

GPT-4.1 is better able to extract information from conversation history messages, enabling more natural interactions.

In the MultiChallenge benchmark test launched by Scale AI, although GPT‑4.1 is not as good as o1 and GPT-4.5, it has caught up with o3-mini and is 10.5% higher than GPT‑4o.

In the MultiChallenge benchmark, the model is challenged to correctly use four types of information from previous messages (conversation context) in multiple rounds of dialogue.

Additionally, GPT‑4.1 scored 87.4% on IFEval, while GPT‑4o scored 81.0%. IFEval uses prompts that contain verifiable instructions (e.g., specifying content length or avoiding specific terminology/formatting).

In IFEval, the model must generate answers that conform to various instructions.

Stronger command-following capabilities can not only improve the reliability of existing applications, but also enable new applications that were difficult to implement in the past due to insufficient model reliability.

Early testers have reported that GPT‑4.1 may be more inclined to follow literal instructions, so OpenAI recommends being clear and specific when designing prompts.

Long context: Finding a needle in a haystack gives you full marks

Long-context understanding is a critical capability for applications in law, coding, customer support, and many other fields.

GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano can not only process contexts of up to 1 million tokens, but also reliably process their content and ignore interference information.

What does 1 million tokens mean? By analogy, the amount of content it contains is more than 8 times the entire React code base!

Compared with GPT‑4o’s 128,000 tokens, this is a huge improvement.

Below, we demonstrate GPT‑4.1’s ability to retrieve small hidden pieces of information (i.e., “needles”) at different locations in the context window.

GPT‑4.1 consistently and accurately retrieves the needle at all context lengths and positions up to 1 million tokens, which means it can effectively extract the relevant details needed for the task at hand, no matter where in the input those details are located.

However, real-world tasks are rarely as straightforward as retrieving a single, obvious “needle”.

In the “Needle in a Haystack” evaluation, GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano all successfully retrieved the “needle” at all locations in a context of up to 1 million tokens.

OpenAI-MRCR

In practical applications, users usually require the model to be able to retrieve and understand multiple pieces of information, and understand the relationships between these pieces of information.

To this end, OpenAI has open-sourced a new benchmark for testing models to find and distinguish multiple hidden "needles" in long contexts: OpenAI-MRCR (Multi-Round Coreference).

The evaluation consists of multiple turns of synthetic conversations between a user and an assistant, in which the user asks the model to compose on a topic, such as “write a poem about tapirs” or “write a blog post about rocks.”

Next, 2, 4, or 8 requests with similar content but different instances are randomly inserted into the context.

The model must accurately retrieve the response that corresponds to a specific instance specified by the user (e.g., “Please give me the third poem about tapirs”).

The challenge of this task is that these similar requests are very close to the rest of the context — the model can easily be misled by small differences, like mistaking a short story about tapirs for a poem, or mistaking a poem about frogs for a poem about tapirs.

When the context reaches the GPT‑4o limit of 128,000 tokens, GPT‑4.1 performs significantly better; even when the context length is extended to 1 million tokens, it still maintains strong performance.

In OpenAI-MRCR, the model must answer a question that involves distinguishing between 2, 4, or 8 user prompts among distracting content.

Graphwalks

Graphwalks is a dataset for evaluating multi-hop long-context reasoning.

Many long-context use cases for developers require making multiple logical jumps in context, such as switching between multiple files when writing code or cross-referencing documents when answering complex legal questions.

Models (or even humans) could theoretically solve the OpenAI-MRCR problem with a single pass or read-through of the context, but Graphwalks is designed to require reasoning across multiple locations in the context and cannot be solved by sequential processing.

Graphwalks fills the context window with a directed graph of hexadecimal hash values, and then asks the model to perform a breadth-first search (BFS) starting from a random node in the graph. Next, the model is asked to return all nodes at a certain depth.

GPT‑4.1 achieves 61.7% accuracy on this benchmark, matching the performance of o1 and handily beating GPT‑4o.

In Graphwalks, the model is asked to perform a breadth-first search from a random node in a large graph.

Vision: Image understanding surpasses GPT-4o to dominate

The GPT‑4.1 series is extremely capable in image understanding, and the GPT‑4.1 mini in particular has achieved a significant leap forward, often outperforming the GPT‑4o in image benchmarks.

In the MMMU benchmark, the model needs to answer questions containing charts, diagrams, maps, etc.

In the MathVista⁠ benchmark, models are required to solve visual math tasks

In the CharXiv-Reasoning benchmark, models are asked to answer questions about graphs in scientific papers.

Long-context processing capabilities are also critical for multimodal use cases such as processing long videos.

In the Video-MME (Long Video, No Subtitles) benchmark, the model needs to answer multiple-choice questions based on videos that are 30-60 minutes long and have no subtitles.

Here, GPT‑4.1 again achieves SOTA — scoring 72.0%, higher than GPT‑4o’s 65.3%.

In Video-MME, the model answers multiple-choice questions based on videos that are 30-60 minutes long and have no subtitles.

Full results

The results on the academic, programming, instruction following, long context, vision, and function call evaluations are listed in full below.

Academic knowledge

programming

Instructions to follow

Long context

Vision

Function Call

Chinese Tour Leader

Jiahui Yu

Jiahui Yu is currently in charge of the Perception team, and his research areas are deep learning and high performance computing.

He was one of the key members when GPT-4o was released.

Previously, he co-led the Gemini multimodal project at Google DeepMind.

He has had internship experiences at Microsoft Research Asia, Megvii Technology, Adobe Research, Snap Research, Jump Trading, Baidu Research, Nvidia Research, and Google Brain.

He received his bachelor's degree in computer science from the China University of Science and Technology and his doctorate from the University of Illinois at Urbana-Champaign.

References:

https://openai.com/index/gpt-4-1/

https://x.com/OpenAI

This article comes from the WeChat public account "Xinzhiyuan" , author: Xinzhiyuan, editor: Editorial Department HNZ, 36Kr is authorized to publish.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

MarsBit

The Black Swan Event Was Actually This: The Real Reason for the Recent Bitcoin Crash

BTC

3.11%

ODAILY

The Black Swan Event Was Actually This: The Real Reason for the Recent Bitcoin Crash

BTC

3.11%

ME News

Breaking News! The Year of China's RWA: A Compliant Channel Opens for Trillions of Yuan in Domestic Assets to Go Global