Prompts: Create a SWE-Bench Verified database query tool that can easily query the issues in SWE-Bench Verified, their links, and scoring criteria.
The GPT5 generation process went smoothly, without any irreversible bugs. The first version only showed 11 projects, but after one round of communication, it was completed with 500 projects.
GPT5 version preview: http://4d916460ea034a90bd4e0c1dd25efc6b.ap-singapore.myide.io
Subsequently, using the same prompts with Claude-4-Sonnet, it was very obvious that Claude-4-Sonnet's first-time success rate was not as good as GPT5, such as common webpage display issues that were only resolved after multiple interactions with Claude.
Claude-4-Sonnet version preview: http://7561fbea40ff4069a3c2c8ae367cd7ea.ap-singapore.myide.io
In terms of UI, since both used the MUI framework, the visual style was not significantly different. However, in terms of detail refinement, the webpage generated by Claude-4-Sonnet was clearly superior - with a more excellent responsive layout that maintains an elegant presentation across different screen sizes. The organization of external link information was also more reasonable, with project issues and details distributed clearly, while the GPT5-generated page not only "exposed" the database source (HuggingFace) but also had a somewhat chaotic content arrangement.
In terms of functionality, GPT5 excelled in filtering, with a complete repository tag count of 10, which was better than Claude-4-Sonnet's 8. However, from an interaction experience perspective, Claude-4-Sonnet's filtering operation was more intuitive and user-friendly, providing a dedicated filtering entry for mobile devices and reducing operational steps.
To be more objective, we also introduced Gemini 2.5 Pro to score the two projects. The results showed that the project generated by Claude-4-Sonnet was superior to GPT5 in almost all key dimensions. The former focused on a modular architecture, dividing components by function and achieving data and view separation through custom Hooks, resulting in better maintainability and readability; the latter used a flat component structure with data logic highly coupled with UI, more like a prototype validation application.
In terms of overall functional experience, Claude-4-Sonnet not only integrated capabilities such as search, view switching, and responsive layout but also shortened the operation path through modern interaction modes like sidebar details and mobile-specific filtering. In contrast, GPT5 relied on traditional page jumping methods with a longer operation chain. Overall, Claude-4-Sonnet demonstrated a more mature software engineering approach and broader application scenario coverage in code quality, functional depth, and user experience, while GPT5's advantages were mainly concentrated in the completeness and implementation speed of specific functions.
After seeing Gemini's evaluation, it seems to understand why OpenAI did fewer 23 problems.
Returning to the test, in fact, there are too many variables that can affect large model capabilities - dataset composition, reasoning strategies, context management, tool calling capabilities, and even the characteristics of the IDE itself can cause significant fluctuations in results. Perhaps with a different task, GPT5 would perform better, or with a different IDE, the same model would produce different scores. But after all, this is GPT5. Some have joked that the valuation and bubble of this round of large models were entirely carried by OpenAI, and now this heavy burden seems to be slightly lifted.
In the AI Coding field, rankings have always been just a slice. What truly determines productivity is the model's stability, maintainability, compatibility with tool chains in real development environments, and whether the product can still deliver usable and reliable code in complex application scenarios.
This article is from the WeChat public account "Silicon Star Pro", author: Dong Daoli, published with authorization by 36Kr.



