GPT-4.1 sneak attack at night, OpenAI took out the smallest, fastest and cheapest three models in history, millions of token context

04-15

This article is machine translated

Show original

Zhidongxi reports on April 15th that OpenAI has just released three models in the GPT-4.1 series, claiming they are the smallest, fastest, and most affordable models in their history, with overall performance superior to GPT-4o and GPT-4o mini.

The GPT-4.1 series includes three models: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, with a context window of 1 million tokens and an output of 32,768 tokens, with a knowledge cutoff date of June 2024. OpenAI's benchmark tests show that its scores in coding, instruction following, and long-text understanding exceed those of GPT-4o and GPT-4o mini.

The GPT-4.1 series models are only available via API and are now open to all developers. OpenAI will begin deprecating the GPT-4.5 preview version in the API, as the GPT-4.1 series models offer similar performance in many key capabilities, with lower cost and latency. The GPT-4.5 preview version will be closed on July 14th this year.

The specific performance optimizations focus on coding, instruction following, and long-text understanding:

Coding: GPT-4.1 scored 54.6% in the SWE-bench verification test, an improvement of 21.4% over GPT-4o and 26.6% over GPT-4.5.

Instruction Following: In Scale's MultiChallenge benchmark measuring instruction following ability, GPT-4.1 scored 38.3%, an improvement of 10.5% over GPT-4o.

Long Text Understanding: In the Video-MME multimodal long-text understanding benchmark, GPT-4.1 scored 72.0% in the uncaptioned long-text category, an improvement of 6.7% over GPT-4o.

For latency-sensitive scenarios, OpenAI specifically mentioned GPT-4.1 nano, calling it their fastest and most economical model. GPT-4.1 nano's benchmark scores are 80.1% for MMLU, 50.3% for GPQA, and 9.8% for Aider multilingual coding, all higher than GPT-4o mini.

OpenAI mentioned in their blog that the better-performing and more economical GPT-4.1 series models will open up new possibilities for developers to build intelligent systems and complex agent applications.

In terms of pricing, for medium-scale queries, GPT-4.1 is 26% cheaper than GPT-4o. For queries reusing the same context, OpenAI has increased the prompt cache discount from 50% to 75%. Finally, OpenAI will not charge extra for long-context requests beyond the standard per-token fee.

[The rest of the translation continues in the same manner, maintaining the structure and translating all text while preserving the <> tags.]

▲GPT-4.1 Test Results in MultiChallenge

In the IFEval test, it used prompts with verifiable instructions, such as specifying content length or avoiding certain terms or formats. GPT-4.1 scored 87.4%, while GPT-4o scored 81.0%.

▲GPT-4.1 Test Results in IFEval

Early testers noted that GPT-4.1 might be more likely to understand literal meanings, so OpenAI suggests developers can specify explicit instructions in prompts.

03. Long Text Understanding: Suitable for Handling Large Code Libraries and Long Documents

Finding a Needle in a Haystack is No Problem

The GPT-4.1 series models can handle 1 million token contexts, compared to GPT-4o's previous 128,000 token window. 1 million tokens are more than 8 times the entire React code library, making long contexts suitable for handling large code libraries or extensive long documents.

OpenAI has also trained the GPT-4.1 models to ignore distracting information in both long and short context lengths, which is a key capability for enterprise applications in fields like law, coding, and customer support.

In their blog, OpenAI demonstrated GPT-4.1's ability to retrieve a hidden piece of information (a "needle") at different positions within the context window, essentially a "needle in a haystack" capability.

▲OpenAI's Internal Assessment of GPT-4.1's "Needle in a Haystack" Capability

The results show that GPT-4.1 can accurately retrieve the key information ("needle") at all positions and various context lengths (up to 1 million tokens). Regardless of where the relevant details are in the input content, it can extract details relevant to the current task.

In practical use, users often need the model to understand, retrieve multiple information fragments, and comprehend the relationships between these fragments. To evaluate this capability, OpenAI is open-sourcing a new evaluation tool: OpenAI-MRCR (Multi-Round Core Word Recognition).

OpenAI-MRCR can be used to test the model's ability to find and distinguish multiple hidden key information in context. The evaluation includes multi-round synthetic dialogues between users and assistants, where users ask the model to write an article about a topic, such as "write a blog post about rocks". Subsequently, it inserts 2, 4, or 8 identical requests throughout the conversation context, and the model needs to retrieve replies corresponding to specific request instances.

In OpenAI-MRCR, the model answers questions with 2, 4, or 8 similar prompt distractors scattered across the context, and the model needs to disambiguate between these questions and user prompts.

▲Assessment Results in OpenAI-MRCR with 2 Distractors Added to Model's Question Answering

▲Assessment Results in OpenAI-MRCR with 4 Distractors Added to Model's Question Answering

▲Assessment Results in OpenAI-MRCR with 8 Distractors Added to Model's Question Answering

The challenge is that these requests are very similar to the rest of the context, and the model can easily be misled by subtle differences. OpenAI found that GPT-4.1 outperforms GPT-4o when the context length reaches 128K tokens.

OpenAI also released the Graphwalks dataset for evaluating multi-hop long context reasoning. This is because many developer use cases requiring long contexts need to make multiple logical jumps within the context, such as switching between multiple files when writing code or cross-referencing documents when answering complex legal questions.

Graphwalks requires the model to reason across multiple positions in the context, using a directed graph filled with hexadecimal hashes to populate the context window. It then asks the model to start from a random node in the graph and perform a breadth-first search (BFS), then return all nodes to a certain depth.

▲Graphwalks Assessment Results

GPT-4.1 achieved 61.7% accuracy in this benchmark, performing comparably to o1 and outperforming GPT-4o.

Besides model performance and accuracy, developers also need models that can respond quickly to meet user needs. OpenAI improved the inference stack to reduce first token time and further lowered latency and saved costs through prompt caching.

OpenAI's initial tests show that GPT-4.1's p95 first token latency is about fifteen seconds, with half a minute for 1 million context tokens under 128,000 context tokens. GPT-4.1 mini and nano are faster, with GPT-4.1 nano typically returning the first token within five seconds for queries with 128,000 input tokens.

04. Multimodal Understanding: Answering Questions from Unsubtitled Videos, Solving Math Problems from Images

Outperforming GPT-4o Across the Board

In image understanding, GPT-4.1 mini outperforms GPT-4o in image benchmark tests.

For multimodal use cases like processing long videos, long context performance is also crucial. In Video-MME (long unsubtitled), where models answer multiple-choice questions based on 30-60 minute unsubtitled videos, GPT-4.1 scored 72.0%, higher than GPT-4o's 65.3%.

Model results for MMMU test answering questions containing charts, graphs, maps, etc.:

Model results for MathVista test solving visual math tasks:

Model results for CharXiv-Reasoning test answering questions about scientific paper charts:

05. Conclusion: Opening Possibilities for Building Complex Intelligent Agents

The improvements in GPT-4.1 are related to developers' daily real needs, from coding and instruction following to long context understanding. The better-performing and more economical GPT-4.1 series models have opened up new possibilities for building intelligent systems and complex agent applications.

In the future, this may enable developers to combine it with various APIs to build more useful and reliable intelligent agents that have potential applications in real-world software engineering, extracting insights from large document collections, solving customer requests with minimal human intervention, and other complex tasks.

This article is from the WeChat public account "Zhidx" (ID: zhidxcom), author: Cheng Qi, editor: Yun Peng, published with authorization from 36Kr.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

ME News

Breaking News! The Year of China's RWA: A Compliant Channel Opens for Trillions of Yuan in Domestic Assets to Go Global

The Defiant

Bitcoin Selloff Sparks Hedge Fund Speculation Around BlackRock ETF

BTC

2.37%

BlockTempo

Arthur Hayes speculates that the reason for the BTC crash is "institutional hedging operations": IBIT options saw a surge of $900 million.

BTC

2.37%