AI genius Andrej Karpathy explained that this test is about the ability of large language models to layout multiple elements on a 2D grid, which is very difficult for AI as they do not "see" things like humans do, but "grope in the dark" using text to layout.
The result of GPT-4.5 is as follows, and compared to GPT-4o, it is still not bad.
GPT-4.5 generated
GPT-4o generated
The premise is that without comparing to the non-reasoning Claude 3.7 Sonnet, this is simply a downgrade.
Claude 3.7 Sonnet generated
Even Andrej Karpathy suspects that Claude was specially optimized for SVG capabilities during training.
As for coding capabilities, I referred to the prompt from X netizen @AGI_FromWalmart to generate an interactive weather animation card, comparing Claude 3.7 Sonnet and GPT-4.5.
GPT-4.5 generated successfully on the first try, but the design was a bit crude.
GPT-4.5 generated
Claude 3.7 Sonnet generated
The problem with Claude 3.7 Sonnet (without reasoning enabled) was even greater. On the first generation, it forgot to add the interactive functionality, and after I reminded it, it generated a result that met the requirements. In this round, GPT-4.5 edged out a slight victory.
This time, I don't want to let GPT-4.5 count how many R's are in "strawberry" again, as that is essentially a word segmentation problem. What I really want to test GPT-4.5 on is the brain teaser that has recently become very popular, causing large models to be repeatedly defeated - can a 5.5m long pole pass through a 3x4m door?
This problem is not at all difficult for us, as you just need to carry it in horizontally, but large models will get themselves tangled up, as if the world is 2D rather than 3D, thinking that the diagonal of the door is 5m, so the 5.5m pole won't fit.
Even the reasoning-capable Claude 3.7 Sonnet was dragged into the ditch.
So how did GPT-4.5 fare? Well, it wasn't spared either.
Currently, GPT-4.5 has another issue: the speed when accessed through the API is a bit slow. Although it's not jumping one character at a time, it still feels a bit sluggish.
Moreover, the price of GPT-4.5 is also too high, $75 per million inputs and $150 per million outputs. In comparison, Claude 3.7 Sonnet charges $3 for 1 million Token inputs and $15 for 1 million Token outputs (including Token used in the thinking process).
The first batch of X netizens who tested it also summarized some of the advantages of GPT-4.5, such as high emotional intelligence, strong image reading and writing abilities, and proficiency in creative tasks and data extraction...
OpenAI employees' own assessment of GPT-4.5 is that it is not a reasoning model, nor a killer in benchmark tests, but a low-key research preview version. For complex math, coding, and strictly following instructions tasks, o1 or o3-mini are more recommended.
In short, as the last non-thought chain model, GPT-4.5's positioning is a bit awkward. Its capabilities have improved, but the feel is not obvious, especially under the high price, making it hard to say it's really great. All we can do is hope that GPT-5 can be launched soon, ushering in a world of reasoning.
This article is from the WeChat public account "APPSO", author: Discovering Tomorrow's Products, 36Kr authorized the publication.






