How OpenAI made GPT-5 surpass Claude in terms of technology: it quietly skipped the 23 most difficult questions

avatar
36kr
08-20
This article is machine translated
Show original
Here's the English translation: A few days ago, at the OpenAI press conference, Altman announced that GPT-5 has reached the top, claiming to have the world's first code capability. However, a major blunder occurred at the press conference: 52.8 > 69.1 = 30.8? As a result, the table created by OpenAI's billion-dollar geniuses went viral worldwide (on the left). Although this table was initially accurate on OpenAI's official blog, they made such a big bug when broadcasting to the world. Setting aside the blunder, a more important but overlooked fact is that GPT-5 achieved a 74.9% pass rate on the SWE-bench Verified benchmark. This score is slightly higher than Anthropic's Claude Opus 4.1's 74.5%. This instantly made GPT-5 the leading model on the current software engineering task benchmark. But wait, this score... seems a bit fishy. OpenAI did not run all 500 SWE-bench Verified test tasks, but instead skipped 23 tasks that could not be run, calculating the score based on only 477 tasks. SemiAnalysis specifically posted about this issue. Anthropic also "hinted" at this problem in its blog. Out of the total 500 SWE-bench Verified questions, GPT-5 only completed 477, directly skipping those 23 tasks! As for its competitor Claude? It honestly completed all 500 questions without missing a single one. Now the nature of the situation has completely changed. Of course, OpenAI acknowledges this matter. They have been noting since GPT-4.1 that: OpenAI's infrastructure cannot run these 23 tasks. (How curious, what kind of tasks would make OpenAI's geniuses say they cannot run?) If these 23 unrunnable tasks are calculated as zero scores, GPT-4.1's score would drop from 54.6% to 52.1%. By this estimation, GPT-5's 74.9%, if those 23 tasks are considered completely wrong, its actual pass rate for all 500 questions would be approximately 71.4% (74.9% × 477/500, note this is an extremely simplified calculation), which is clearly lower than Claude Opus 4.1's 74.5% based on 500 questions. It needs to be emphasized that those 23 skipped tasks are not "irrelevant" to GPT-5. On the contrary, they are mostly the most difficult set of problems in the Verified collection. According to third-party analysis, in the "takes >4 hours" level tasks in the Verified dataset, most models cannot solve any of them. Models show significantly reduced performance on "difficult" problems that take more than an hour to complete. Only Claude Sonnet 4 (non-reasoning mode), o3, and GPT4.1 can complete some tasks over 4 hours (each accounting for 33%). These extremely difficult tasks are a severe test of the model's comprehensive capabilities. If GPT-5 cannot run these tasks, then in terms of overall capability, it may not have truly surpassed Claude 4.1. In the information provided by Anthropic, Claude 4.1 likely attempted these tasks (Anthropic did not claim that its model skipped any Verified tasks), so its 74.5% score includes the test of all difficult problems. While GPT-5's 74.9% is the result after removing these "roadblocks". The main point of controversy arising from this difference is: the comparability of test scores and the transparency of reporting methods. Even the SWE-bench Verified dataset, which serves as the referee, was created by OpenAI itself. SemiAnalysis believes that to "fairly" compare the achievements between models, perhaps the official ranking on swebench.com might be the clearest description of current model performance in this benchmark. Without the "verified" subset, tool usage is limited (bash only), and most scaffolding content is openly visible. Under these premises, the May 14th Claude 4 Opus checkpoint (67.6) performs better than GPT-5 (65). The next question is, what is SWE-bench, what is the "verified" subset, and why create an additional SWE-bench Verified?

Four Scores:

0: The problem description is clear, and the conditions required for successful resolution are also evident.

1: There are some blanks about this issue that need to be filled, but there is a reasonable interpretation of the content needed for a successful solution.

2: The problem description is vague, with ambiguity, and it is not clear what characteristics a successful solution should have.

3: With no additional information, it is almost impossible to understand what you need to do.

Directly discard items scored 2 and 3, keeping only those with scores of 0 and 1.

Although this method will lead to a higher false removal rate of samples, it helps to increase confidence in the quality of the final dataset samples.

Then, randomly select 500 questions from the 0 and 1 scored items, which is the final SWE-bench Verified.

Speaking of scores, Claude is taking the "comprehensive" exam, while OpenAI is taking the "selected" version.

How can these results be directly compared? The story behind the numbers is more worth pondering.

Apart from the blunder in the chart at the press conference, this "concealed" fact seems to have not attracted much attention.

We can even speculate conspiratorially: Did OpenAI do this intentionally, using this small blunder to cover up the SWE-Bench scores?

After all, to hide a truth, the best approach is not to deny it, but to divert everyone's attention with a larger "truth".

References:

https://x.com/SemiAnalysis_/status/1955028150217478177

This article is from the WeChat public account "New Intelligence", author: New Intelligence, editor: Ding Hui, published with authorization by 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments