Google's new Gemini leaked, LMArena test: the only AI that can understand charts, GPT-5's random answers

This article is machine translated
Show original

Google's Gemini 3.0 is suspected to be launched on LMArena! Many actual tests have been exposed in advance, but the results are difficult to evaluate.

Gemini 3.0 has been rumored for so long, but it has finally been revealed.

Still in the LMAreana arena, the two "disguises" of Gemini 3.0 were exposed.

  • Gemini 3.0 Pro's vest: lithiumflow
  • Gemini 3.0 Flash alias: orionmist

This has become a "traditional skill". Every time a new model is launched, we have to go to LMArena to build momentum for it.

However, after seeing the actual test results in the arena, Gemini 3 does have something really good. I hope Google will not follow OpenAI this time and show some strength!

Some front-end cases of Gemini 3 have been leaked before, and netizens broke the news that Google's next-generation flagship model will be released on October 22 .

Some developers who have obtained internal testing qualifications have released some demos.

But this time it was launched directly in the LMArena arena.

Users who were lucky enough to come across the Gemini 3 hack shared their experiences. If you are also lucky, please share whether the performance of Gemini 3 has been significantly improved.

AI understands clocks for the first time

Actual measurement of "reading a watch" has always been a major challenge for AI , which involves many factors, including the style of the clock, the length and direction of the hands, the judgment of minute intervals, etc.

However, actual testing with Gemini 3 Pro (lithiumflow) shows that this model can be accurate to hours (6), minutes (02) and seconds (30).

For the same problem, GPT-5 Thinking went a little crazy and directly recognized it as 12:30, confusing the hour and minute hands.

The same thing happened when I tested it with Gemini 2.5 Pro. It was really hard to tell the clock time with the model.

In comparison, other non-top models in LMArena are even more "crazy".

In addition, I have tested it many times in LMArena and have never encountered Gemini 3 vests.

If the Gemini 3's vest ability in the arena is true, then Gemini 3 is indeed worth looking forward to!

SVG: A Pelican Riding a Bicycle

Every time a new model comes out, SVG testing is inevitable.

The SVG test results of Gemini 3 Pro are very good at first glance.

The picture performance has been improved more than before, and it can even be seen that it has a bit of an "abstract" style.

Of course, you can never avoid the pelican on a bicycle, but at least the bicycle is really well drawn this time.

However, one thing that needs to be complained about is that this internet meme "Pelican on a Bicycle" may have become a joke for testing new models.

Therefore, each model seems to have quietly made fine-tuning to this prompt word.

For example, the following two arena examples do not emphasize the use of SVG.

Even though the use of SVG is emphasized, the effect is still "perfect". In comparison, the one drawn by Gemini 3 is not good-looking and the effect is average.

The first decent composition model

Another major update is that Gemini 3 Pro can compose music.

Can imitate musical styles, maintain a beat for a long time, and bring some energy and variation.

What do you think of this music effect?

Currently, most of the actual tests are still done on LMArena.

(By the way, I ran through almost 100 prompts and still didn’t encounter Gemini 3)

So why do we judge that these two vests are the actual tested codes of Gemini 3?

Some people say that "Orion" itself may be related to Gemini 3 , and the two-word combination method of " orionmist " is what Google will use.

Before Gemini 3 was released on LMArena, various internal tests showed that it was very powerful.

Even an HTML can be compatible with the UI interaction of the entire MacOS and Windows systems.

Even in just 1 minute, Gemini 3 Pro can create an entire style animation using SVG.

I captured part of the animation, and the effect looks quite "bluffing".

However, some people have encountered unsatisfactory test results.

It’s been almost a year since Google released Gemini 2.5, and now all the big tech companies are keeping an eye on OpenAI’s moves.

After OpenAI played GPT-5 and the new version of Sora 2, Google only followed up with Veo 3.1.

This wave of launches on LMArena is probably a test before the release, and Gemini 3 should be coming soon!

In general, although the models have become much more powerful, such as being able to read tables, draw SVG, and compose music, the "traditional skills" of the entire AI circle are becoming increasingly fixed.

First the rumor spreads, then it goes on LMArena, and then a bunch of people try to identify the real thing and test the SVG to see which one looks more like the real thing.

It’s a bit boring to watch too much.

After all, whether it is Gemini 3, GPT-5 or the new version of Claude, in the end it is still the same set of "actual screenshots + prompt comparison + picture description".

Models are getting smarter, but our evaluation methods seem to be stuck in the old ways.

I hope that next time, not only will the model be stronger, but we can also come up with some new tricks.

References

https://x.com/synthwavedd/status/1979969871921225881

https://x.com/ai_for_success/status/1979980654713696340

https://x.com/scaling01/status/1979996937743954101

https://x.com/scaling01/status/1979996937743954101

This article comes from the WeChat public account "Xinzhiyuan" , author: Dinghui, and is authorized to be published by 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments