GPT-5 goes to great lengths to cheat, just to surpass its inner demon Claude

avatar
36kr
08-18
This article is machine translated
Show original
GPT5 has finally been released, but compared to GPT3.5 and Sora, it hasn't brought a shocking feeling. To put it nicely, OpenAI has abandoned its legendary futures trader identity and focused on the landing and application of large models. This actually explains why at the press conference, OpenAI particularly emphasized GPT-5's programming capabilities: after all, there's no more grounded AI direction this year than AI Coding. Various AI IDE tools have also immediately integrated GPT5, which would have been unimaginable two months ago. However, media has disclosed that OpenAI "cheated" in programming ability tests. Specifically, in the SWE‑Bench Verified programming test, OpenAI did not actually run all 500 questions, but only tested 477. In contrast, models like Claude and Google run all 500 questions when testing programming capabilities. Moreover, more intriguingly, SWE‑Bench Verified is a "refined version" introduced by OpenAI. Because the original SWE‑Bench had 2294 software engineering problems, OpenAI felt that some questions were too difficult and unstable to fairly assess the model's programming capabilities, so they selected 500 questions to make the evaluation more reliable. Even more absurdly, they further cut down this "self-selected subset" to 477 questions for testing. OpenAI's official website published a blog post explaining and introducing why they launched SWE‑Bench Verified: https://openai.com/index/introducing-swe-bench-verified/ Some netizens complained: What is OpenAI afraid of? [The rest of the text continues in the same translated manner, maintaining the specific translations for technical terms as requested.]

Prompts: Create a SWE-Bench Verified database query tool that can easily query the issues in SWE-Bench Verified, their links, and scoring criteria.

The GPT5 generation process went smoothly, without any irreversible bugs. The first version only showed 11 projects, but after one round of communication, it was completed with 500 projects.

GPT5 version preview: http://4d916460ea034a90bd4e0c1dd25efc6b.ap-singapore.myide.io

Subsequently, using the same prompts with Claude-4-Sonnet, it was very obvious that Claude-4-Sonnet's first-time success rate was not as good as GPT5, such as common webpage display issues that were only resolved after multiple interactions with Claude.

Claude-4-Sonnet version preview: http://7561fbea40ff4069a3c2c8ae367cd7ea.ap-singapore.myide.io

In terms of UI, since both used the MUI framework, the visual style was not significantly different. However, in terms of detail refinement, the webpage generated by Claude-4-Sonnet was clearly superior - with a more excellent responsive layout that maintains an elegant presentation across different screen sizes. The organization of external link information was also more reasonable, with project issues and details distributed clearly, while the GPT5-generated page not only "exposed" the database source (HuggingFace) but also had a somewhat chaotic content arrangement.

In terms of functionality, GPT5 excelled in filtering, with a complete repository tag count of 10, which was better than Claude-4-Sonnet's 8. However, from an interaction experience perspective, Claude-4-Sonnet's filtering operation was more intuitive and user-friendly, providing a dedicated filtering entry for mobile devices and reducing operational steps.

To be more objective, we also introduced Gemini 2.5 Pro to score the two projects. The results showed that the project generated by Claude-4-Sonnet was superior to GPT5 in almost all key dimensions. The former focused on a modular architecture, dividing components by function and achieving data and view separation through custom Hooks, resulting in better maintainability and readability; the latter used a flat component structure with data logic highly coupled with UI, more like a prototype validation application.

In terms of overall functional experience, Claude-4-Sonnet not only integrated capabilities such as search, view switching, and responsive layout but also shortened the operation path through modern interaction modes like sidebar details and mobile-specific filtering. In contrast, GPT5 relied on traditional page jumping methods with a longer operation chain. Overall, Claude-4-Sonnet demonstrated a more mature software engineering approach and broader application scenario coverage in code quality, functional depth, and user experience, while GPT5's advantages were mainly concentrated in the completeness and implementation speed of specific functions.

After seeing Gemini's evaluation, it seems to understand why OpenAI did fewer 23 problems.

Returning to the test, in fact, there are too many variables that can affect large model capabilities - dataset composition, reasoning strategies, context management, tool calling capabilities, and even the characteristics of the IDE itself can cause significant fluctuations in results. Perhaps with a different task, GPT5 would perform better, or with a different IDE, the same model would produce different scores. But after all, this is GPT5. Some have joked that the valuation and bubble of this round of large models were entirely carried by OpenAI, and now this heavy burden seems to be slightly lifted.

In the AI Coding field, rankings have always been just a slice. What truly determines productivity is the model's stability, maintainability, compatibility with tool chains in real development environments, and whether the product can still deliver usable and reliable code in complex application scenarios.

This article is from the WeChat public account "Silicon Star Pro", author: Dong Daoli, published with authorization by 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments