Claude 3.7 hard controls Mario for 90 seconds, GPT-4o dies at the beginning, Karpathy calls the benchmark invalid, and games become a new battlefield for LLM

avatar
36kr
03-03
This article is machine translated
Show original
Here is the English translation of the text, with the specified terms preserved:

Karpathy raised a soul-searching question, evaluating what metrics AI should really be looking at? The answer may be hidden in classic games! Recently, UC San Diego's Hao AI Lab used Super Mario and other games to evaluate AI agents, and the results of Claude 3.7 were jaw-dropping.

Is the "golden standard" of LLM evaluation benchmarks becoming ineffective?

Early in the morning, AI guru Karpathy expressed doubts, "There is currently an evaluation crisis, and I really don't know what metrics to look at now."

Benchmarks like MMLU, SWE-Bench Verified, Chatbot Arena, etc. all have their own pros and cons.

If these are not enough, can games be considered?

After all, the once-popular AlphaGo was the number one AI in the world of Go; even OpenAI ventured into the gaming field in the early days, and their self-developed AI achieved outstanding results in the DOTA International Tournament.

Recently, the emergence of Claude 3.7 has made "Pokémon" a new benchmark for LLM evaluation.

UCSD Hao AI Lab has stepped in again, open-sourcing a brand-new "game agent" that can run puzzle and puzzle games in real-time using a computer agent (CUA).

The results show that Claude 3.7 Sonnet played Super Mario for a full 90 seconds, directly crushing OpenAI, Gemini, and its own predecessors; while GPT-4o crashed right at the start...

Google's Gemini 1.5 Pro was defeated in the first battle, and it jumped in a very regular pattern of two steps per jump. By the time Gemini 2.0 came, it took a few more steps, but still ended up falling into a pit.

The GamingAgent project code has been open-sourced, and you can download and install it to watch the AI game showdown.

GPT-4.5 is slow to react, and GPT-4o is always killed by the first enemy

GPT-4o is always killed by the first enemy, just like a gaming noob who gets flamed by teammates.

The game ends in just 20 seconds.

In comparison, GPT-4.5's performance is much better, at least it doesn't get stuck at the first enemy.

But its reactions are still very slow, almost two steps and then a pause.

Before jumping over a low pipe, it also hesitates for a moment, feeling like it just learned the game controls and is still learning to walk.

For a slightly higher pipe, it tried 7 times and took 10 seconds to jump over it.

After finally jumping over, it ran into an enemy and died. The first round ended like this.

Even funnier, in the second round, GPT-4.5 again stumbled at the first enemy. After all, it and GPT-4o belong to the OpenAI family, so their gameplay is rather poor (bushi).

The third round's performance was also rather average, not as good as the first round. It got stuck at the first low pipe for nearly 10 seconds before remembering to jump.

Finally, it smoothly jumped over the second pipe, but was still killed by an enemy, not going as far as the first round, where it at least jumped over the third pipe before being killed.

Gemini 1.5 jumps every two steps, 2.0 falls into a pit

On the Google side, Gemini 1.5 Pro didn't fare well in the first battle either, unable to escape the clutches of the first enemy.

In the second round, Gemini 1.5 managed to evade the first enemy, even encountering a question mark box and getting a mushroom.

Interestingly, unlike GPT-4.5's two-step-one-pause, Gemini 1.5 is "two-step-one-jump".

After walking such a short distance, it jumped a total of 9 times. Jumping on the floor, jumping on the pipes.

In the end, it also jumped over the third pipe, and even almost made it over the fourth, going farther than GPT-4.5.

As for the updated Gemini 2.0 Flash, its performance is unsurprisingly much better.

First, it jumps more boldly; secondly, the jumps are also more fluid.

It reached platforms that its "predecessors" had never set foot on, and easily jumped over the first three pipes in 10 seconds.

Although in the second round, it also fell victim to the first enemy.

But in the end, it went farther than the OpenAI family and Gemini 1.5 - it jumped over the fourth pipe, but fell into a pit it couldn't jump over.

Claude 3.7 Sonnet discovers hidden rewards

In comparison, Anthropic's Claude is much more impressive.

Compared to Gemini's two-step-one-jump operation, Claude 3.7's operation is more fluid, and it goes much farther.

Especially in the timing of the jumps, it seems more methodical, only jumping when encountering pipes or pits.

In addition, there will also be conscious jumps to avoid small monsters.

Skipping the Gemini 2.0 Flash pit twice without jumping over it, Mario under Claude's operation finally ate the gold coins; finally encountered small monsters other than goblins (mushroom-like) - Koopa (turtle-like); even encountered a hidden reward - the Super Star.

Finally, fell into the pit between the stair platforms, ending the game.

AI Battle 2048 Puzzle Game, GPT-4o Can't Show Off

Next, let's take a look at another puzzle game, 2048.

Many people may not be familiar with this game, the rule is to merge blocks with the same number by sliding, and the player tries to reach the highest possible value.

During the challenge, GPT-4o got stuck due to prolonged thinking.

While CLAUDE 3.7 took a few more steps, it was much stronger than GPT-4o, but still ended in failure.

Tetris, Intelligence Online

So how did CLAUDE 3.7 perform in playing TETRIS?

Anthropic's Head of Developer Relations, Alex Albert, praised, "Very cool! We need to turn every video game into an assessment tool".

There are already netizens in the comments wishing for Grok 3 to join the battlefield.

It seems that the LLM assessment is about to open up a whole new path.

References:

https://x.com/haoailab/status/1895557913621795076

https://x.com/haoailab/status/1895605453461340472

https://lmgame.org/#/aboutus

This article is from the WeChat public account "New Singularity", author: New Singularity, authorized by 36Kr for release.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
1
Comments
Followin logo