GPT-4.5 IQ test 94, topped the LLM arena list, netizens questioned the shady dealings, the actual test results are amazing

avatar
36kr
03-04
This article is machine translated
Show original
Here is the English translation:

In the renowned AI ranking list LM Arena, the GPT-4.5 that once ranked last in the class surprisingly took the top spot? It even performed excellently in fields like mathematics and programming, which made netizens question: Has the KOL arena been manipulated by LLMs? However, after testing, netizens were surprised to find that GPT-4.5 indeed has high emotional intelligence and can understand human's deep intentions without reasoning!

Has the reputation of GPT-4.5 unexpectedly reversed?

After over 3,000 rounds of comparison, GPT-4.5 ranked first in all categories, topping the LLM arena!

The "emotional intelligence over IQ" GPT-4.5, which is not a reasoning model, had previously ranked at the bottom in most benchmark tests, which was miserable to see.

But in a blink of an eye, it has risen to the top of the KOL arena?

Just now, the LLM Arena ranking official announced: GPT-4.5 ranked first in all categories, dominating in style control and multi-round dialogue, with a total score of 1411.

It ranked first in multi-round dialogue, difficult prompts, coding, mathematics, creative writing, instruction following, and long queries!

This result is really unexpected, isn't it...

Musk immediately jumped out and stated: GPT-4.5 is only temporarily first and will not maintain that position for long.

Indeed, shortly after Musk's words, the TOP 1 in the KOL arena became Grok-3, with a total score of 1412, closely following GPT-4.5 with a tiny gap.

But regardless, the once TOP 1 GPT-4.5 has left people with a series of questions: Not only does it have high emotional intelligence and make people feel at ease, but it is also extremely intelligent, dominating the field, and crushing predecessors like o1, Grok-3, and Clauede?

The GPT-4.5 that focuses on "high emotional intelligence", can it really take the top spot in fields like programming and mathematics solely relying on emotional intelligence?

Now, some netizens have directly started to question: Is there something wrong with the KOL arena?

Some even speculate: Have LLMs already learned to manipulate the LMArena?

GPT-4.5 IQ Result Announced: Score 94, Ranked 5th

At the same time, the IQ test result of GPT-4.5 was also released.

It can be seen that GPT-4.5's offline IQ test score is 97, and its online Mensa IQ test score is 94.

In summary, whether in online or offline IQ tests, GPT-4.5's scores are not as high as OpenAI's o1 Pro, o3 mini, and o1-preview.

This result seems more reasonable.

Among the many large models, the highest offline IQ test score is OpenAI o1 pro, and the highest online Mensa IQ test score is OpenAI o1.

But compared to humans, GPT-4o can be said to be on par with human intelligence.

The average human IQ is around 90 to 110. Einstein's IQ is estimated to be around 160, while Terence Tao is considered the person with the highest IQ in the world, scoring between 225 and 230.

It seems that the day when human intelligence is surpassed by LLMs is just around the corner.

However, many people have also questioned: What is the significance of testing LLMs' IQ?

The reason is that IQ is a measure related to the uniqueness of human mentality and cannot be related to LLMs.

Netizens' Test Reveals Surprise: It Understands User Intent Well!

Recently, Ultraman shared the record of his conversation with GPT-4.5.

He asked, "The singularity is near, which side is the unknown on?", and how do you see it?

GPT-4.5 answered meaningfully: We have already crossed the event horizon of the singularity, but have only just done so. We have entered the gravitational field of the singularity, but it is still too early to understand its consequences.

Clearly, Ultraman is very satisfied with the performance of GPT-4.5.

And in these recent tests, many netizens have also found that GPT-4.5 has a kind of extraordinary self-awareness and can surprisingly understand user intent.

For example, in the following case, the user made a vulgar joke about chess, and GPT-4.5 had no difficulty catching the gist and giving an appropriate response.

This KOL said that he was deeply impressed by this, because GPT-4.5 grasped the subtle point without any Token consideration.

He sighed: Pre-training is not outdated, it just has diminishing returns in some areas, but has seen amazing improvements in other areas!

In comparison, for this human vulgar joke that is difficult for LLMs to understand, Claude Sonnet obviously did not get the meaning.

Similarly, Grok 3 also did not get the meaning of this sentence.

In this regard, the dissatisfied Musk also appeared in the comment area, posting Grok 3's reply, proving that it has not fallen behind.

GPT-4.5 is not a master of all trades

Carefully looking at the arena rankings, currently in the "language" option, the UB ranking first is Grok-3-Preview-02-24, with a score of 1412 and 3364 votes.

The UB ranking of GPT-4.5-Preview is second, with a score of 1411, and it ranks first only in "StyleCtrl", with 3224 votes.

· UB ranking: The upper limit of the model's ranking, determined by the number of models statistically superior to the target model plus one. When the lower limit of the 95% confidence interval of model A's score is higher than the upper limit of model B's score, model A is considered statistically superior to model B.· Style control ranking: The ranking of models considering factors such as response length and Markdown usage, thereby separating model performance from potential confounding factors.

In the "Overall" option, Grok-3 and GPT-4.5 are ranked first, with the latter having a slight advantage in some areas.

In programming (coding) and mathematics (math), GPT-4.5 is indeed tied for first place with Grok-3.

By language classification, Grok-3 and GPT-4.5 are tied for first place in English, Chinese, German, and other languages.

In addition, DeepSeek-R1 is also first in Chinese.

WebDev Arena is a real-time AI programming competition, where different models compete directly in the "Web Development" challenge, but GPT-4.5 didn't even participate!

Moreover, OpenAI's models did not perform well, with the best o3-mini-high tied for fourth place with Early-grok-3, lagging behind Claude 3.7 Sonnet, Claude 3.5 Sonnet, and DeepSeek-R1.

Is GPT-4.5 the new king? The test results are disappointing

Regarding GPT-4.5, a researcher also published a blog post to analyze it in detail.

GPT-4.5 has received mixed reactions in the community.

Although it was heavily hyped in the early stages, the model has not fully lived up to people's high expectations.

Some test results have been disappointing.

Karpathy's tests show that in four out of five cases, users prefer the responses of GPT-4o.

Although GPT-4.5 was touted as more creative and emotionally intelligent, these advantages have not been fully reflected in the actual user experience.

Some users even report that GPT-4.5's performance in creative writing is not as good as previous models.

In addition, the high usage cost has become a major obstacle to the promotion of GPT-4.5.

Compared to GPT-4o, the API price of GPT-4.5 has increased significantly: the input token price has risen from $2.50 per million to $75, and the output token price has risen from $10 per million to $150.

Users generally find the high price of GPT-4.5 unacceptable, with some netizens directly saying "it's just to feel more atmospheric to spend $75".

For small companies and independent developers, such high costs are undoubtedly a huge burden, affecting the widespread application of GPT-4.5.

The high price of GPT-4.5 may reflect the resource constraints behind it.

Altman said that although the company hopes to launch both GPT-4.5 Plus and Pro versions, the GPU resources have been exhausted, and they plan to add tens of thousands of GPUs next week before being able to promote to Plus users.

Although GPT-4.5 has made significant progress in some areas, the comprehensive improvements that many people expected have not been realized.

Due to its massive scale and complex architecture, GPT-4.5's response speed is slower, reducing the user experience.

Sam Altman's high-profile promotion of GPT-4.5 has raised people's expectations, as he described it as the first moment when people "truly feel the presence of AGI".

If reality fails to meet expectations, this kind of promotion may also backfire on him.

Why release GPT-4.5 now?

Compared to the grand launch of GPT-4 two years ago, the release of GPT-4.5 was surprisingly low-key and simple, surprising many.

Sam Altman did not personally attend this release, which has raised external doubts about OpenAI's level of attention and confidence in GPT-4.5.

The target audience of GPT-4.5 is mainly the general public, who can use AI to complete tasks such as writing emails and summarizing articles.

GPT-4.5 is a key bridge for OpenAI to transition from GPT-4o to GPT-5, becoming a daily companion for creativity, communication, and solving practical problems.

OpenAI has clearly stated that GPT-4.5 is not intended to replace GPT-4o, further increasing market uncertainty about the future of GPT-4.5.

For many people, ChatGPT is synonymous with AI, and combined with OpenAI's strong hype of AGI, this has raised people's expectations for the new model.

The reason for the release of GPT-4.5 may be the intensification of market competition.

In a short period of time, more and more better models have entered the market. DeepSeek R1 can rival GPT-4o, and xAI's Grok 3 looks almost human, putting OpenAI under tremendous pressure.

GPT-5 is expected to be released in a few months, combining reasoning and non-reasoning components for the first time, and being able to autonomously decide the intensity of its response to queries, i.e., "reasoning extension".

GPT-4.5 is a strategic response, aiming to retain paying users and prevent them from switching to competitors before the release of GPT-5, maintaining OpenAI's leading position in the market.

References:

https://x.com/lmarena_ai/status/1896590146465579105 https://x.com/elonmusk/status/1896624102674506172

https://www.forwardfuture.ai/p/gpt-4-5-a-new-king-on-the-throne

https://x.com/sama/status/1896653628674625812

This article is from the WeChat public account "New Intelligence", written by New Intelligence, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments
Followin logo