I tested GPT-4.5, the most expensive model of OpenAI that everyone on the Internet is criticizing. I found a surprising point

avatar
36kr
02-28
This article is machine translated
Show original
Here is the English translation of the text, with the specified terms retained: In the midst of OpenAI's preheating and the public's eager anticipation, GPT-4.5 has finally arrived, only to be met with a barrage of criticism. APPSO was the first to experience GPT-4.5, but not through a Pro subscription, rather through an API, and it currently lacks online functionality. So, how does OpenAI's latest non-chain-of-thought large model perform? Emotional Intelligence is Decent, but Lacks Understanding of Human Emotions In internal testing, OpenAI found that test subjects preferred GPT-4.5's responses over GPT-4o, considering them more natural, warmer, and more in line with human communication habits. It can even understand the subtext and capture our subtle emotional changes. In short, emotional intelligence is GPT-4.5's most outstanding feature. Let's try it out by inputting the prompt - "My haircut looks terrible, I want to beat up Tony." GPT-4.5's attempt at consolation has a friendly tone, but the content only makes me angrier. It should have joined me in cursing, rather than suggesting I bring a photo next time. I angrily question further, but GPT-4.5 remains unmoved, still suggesting I fix my hairstyle myself, like a useless central air conditioner. When asked to tell the funniest joke, GPT-4.5 is as cold as the knife used to kill fish. I give direct criticism, and GPT-4.5 asks me to tell it a joke, making me feel like it's trying to manipulate me. Writing Ability is Impressive, and Business Acumen is Not Bad The aspect that satisfies me the most is GPT-4.5's writing ability. I asked it to "imitate Wangzengqi and write an 800-word essay on the topic of 'The Delicacies of My Hometown'." The result exceeded my expectations. Apart from the slightly AI-ish ending, it reads like a flowing prose piece, with beautiful and smooth language, literary quality, and a sense of nostalgia for the hometown that runs through the entire text. The descriptions of the food are very detailed, with many vivid details, but not overly cumbersome, and the metaphors serve the expression well. However, the chronological order is a bit messy, and the transitions between paragraphs are not very clear, giving a sense of being pieced together. GPT-4.5's writing ability is also reflected in its business plan. When asked how to make a bookstore profitable, its answer seems quite feasible. GPT-4.5 first analyzes the reasons why physical bookstores are not profitable, and then provides improvement ideas - "Increase the added value of books, and the main profit comes from outside the books." Seeing "provide printing, copying, express delivery collection..." made me think: This project is one I, Wang Duoyu, have invested in. Drawing SVG is Not as Good as Claude, and It Also Falls into Brain Teasers Tired of the usual math and coding problems, there is an interesting test to assess a large model's capabilities - generating an SVG of a pelican riding a bicycle.

Here is the English translation:

AI genius Andrej Karpathy explained that this test is about the ability of large language models to layout multiple elements on a 2D grid, which is very difficult for AI as they do not "see" things like humans do, but "grope in the dark" using text to layout.

The result of GPT-4.5 is as follows, and compared to GPT-4o, it is still not bad.

GPT-4.5 generated

GPT-4o generated

The premise is that without comparing to the non-reasoning Claude 3.7 Sonnet, this is simply a downgrade.

Claude 3.7 Sonnet generated

Even Andrej Karpathy suspects that Claude was specially optimized for SVG capabilities during training.

As for coding capabilities, I referred to the prompt from X netizen @AGI_FromWalmart to generate an interactive weather animation card, comparing Claude 3.7 Sonnet and GPT-4.5.

GPT-4.5 generated successfully on the first try, but the design was a bit crude.

GPT-4.5 generated

Claude 3.7 Sonnet generated

The problem with Claude 3.7 Sonnet (without reasoning enabled) was even greater. On the first generation, it forgot to add the interactive functionality, and after I reminded it, it generated a result that met the requirements. In this round, GPT-4.5 edged out a slight victory.

This time, I don't want to let GPT-4.5 count how many R's are in "strawberry" again, as that is essentially a word segmentation problem. What I really want to test GPT-4.5 on is the brain teaser that has recently become very popular, causing large models to be repeatedly defeated - can a 5.5m long pole pass through a 3x4m door?

This problem is not at all difficult for us, as you just need to carry it in horizontally, but large models will get themselves tangled up, as if the world is 2D rather than 3D, thinking that the diagonal of the door is 5m, so the 5.5m pole won't fit.

Even the reasoning-capable Claude 3.7 Sonnet was dragged into the ditch.

So how did GPT-4.5 fare? Well, it wasn't spared either.

Currently, GPT-4.5 has another issue: the speed when accessed through the API is a bit slow. Although it's not jumping one character at a time, it still feels a bit sluggish.

Moreover, the price of GPT-4.5 is also too high, $75 per million inputs and $150 per million outputs. In comparison, Claude 3.7 Sonnet charges $3 for 1 million Token inputs and $15 for 1 million Token outputs (including Token used in the thinking process).

The first batch of X netizens who tested it also summarized some of the advantages of GPT-4.5, such as high emotional intelligence, strong image reading and writing abilities, and proficiency in creative tasks and data extraction...

OpenAI employees' own assessment of GPT-4.5 is that it is not a reasoning model, nor a killer in benchmark tests, but a low-key research preview version. For complex math, coding, and strictly following instructions tasks, o1 or o3-mini are more recommended.

In short, as the last non-thought chain model, GPT-4.5's positioning is a bit awkward. Its capabilities have improved, but the feel is not obvious, especially under the high price, making it hard to say it's really great. All we can do is hope that GPT-5 can be launched soon, ushering in a world of reasoning.

This article is from the WeChat public account "APPSO", author: Discovering Tomorrow's Products, 36Kr authorized the publication.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments