Google's Gemini 3 Deep Think has undergone a major upgrade: its reasoning capabilities surpass Opus 4.6 and GPT-5.2, aiming to become "the most research-savvy AI."

This article is machine translated
Show original

Google today (13th) released a major upgrade to the Gemini 3 Deep Think. In the ARC-AGI-2 test (a reasoning test specifically designed to prevent AI from memorizing question banks; it doesn't test how much you know, but whether you can deduce rules from a few examples) , the Gemini 3 Deep Think scored 84.6%.

For reference, Claude Opus 4.6 (Thinking Max mode) achieved 68.8%, GPT-5.2 (Thinking xhigh mode) achieved 52.9%, while the human average is about 60%.

Even more astonishingly, Deep Think achieved a score of 96% on the original ARC-AGI-1 benchmark, essentially pushing this benchmark, once considered "one of the most difficult AI exams," to its limit.

Deep Think is currently available to Google AI Ultra subscribers, while the API is open to enterprises with early access.

They not only give exams, but they also catch people making mistakes.

Beyond benchmark scores, Google mentioned a detail in its announcement: Deep Think, while reviewing a mathematics paper that had undergone human peer review, successfully uncovered a logical flaw that had gone unnoticed by all previous reviewers. This paper was confirmed by mathematicians at Rutgers University.

The significance of this case lies not in the model's performance in standardized tests, but in its capabilities demonstrated in real-world, open-ended scientific scenarios. Peer review is the core quality control mechanism in academia, and if AI can consistently provide valuable assistance in this process, its accelerating effect on scientific research will far exceed what any benchmark test can measure.

Deep Think achieved gold medal level in the written sections of both the 2025 International Physics Olympiad and the International Chemistry Olympiad, and received an Elo score of 3,455 on Codeforces, corresponding to the "Legendary Master" level, a level that only a very small number of human programmers worldwide can reach.

On the "Humanity's Last Exam," a benchmark designed by experts from various fields to deliberately make it difficult for AI to answer, Deep Think achieved a score of 48.4% (without using tools), setting a new record.

Market share crustal changes

The technological race among the three AI giants is reshaping the market landscape. ChatGPT's market share has fallen from a peak of 87% to about 68%, while Gemini has surged from less than 5% to over 18%, and Anthropic's Claude is steadily eroding the enterprise market.

Google's unique advantage in this race is its distribution capabilities. Gemini is built into Android, the Chrome browser, Google Workspace, and the search engine, meaning that even if Google's model capabilities are on par with its competitors, it can still win users through its channel advantage.

However, the distribution advantage is a double-edged sword. If Gemini's user experience is not good enough, it may lose user trust faster than any competitor because users are "passively exposed" rather than "actively choosing." OpenAI's users are willing to pay, and naturally have higher tolerance and stickiness.

The ripple effect on the crypto industry

Each escalation of the AI ​​arms race drives up the demand for computing infrastructure. The cost of GPU clusters required to train a cutting-edge model has ballooned from hundreds of millions of dollars in 2024 to billions of dollars in 2026. This directly impacts two things.

First, the transformation path of Bitcoin miners. As mining profits are squeezed (JPMorgan Chase estimated this week that the cost of BTC production has dropped to $77,000, while the price of BTC is around $66,000), miners with large-scale computing infrastructure are accelerating their shift to AI computing services.

High-cost mining companies are not "exiting" but rather "changing careers," shifting from mining Bitcoin to generating revenue through contracts that provide AI computing power.

Second, the narrative of AI tokens. Whenever Google, OpenAI, or Anthropic releases a major upgrade, on-chain AI-related tokens (such as decentralized computing protocols) typically experience short-term hype.

But the fundamental problems with these tokens remain unchanged: decentralized computing still has a long way to go in terms of latency and throughput to meet the needs of enterprise-grade AI training. Narratives can run very fast, but the infrastructure cannot keep up with the speed of narratives.

The decisive battle of science has only just begun.

The Deep Think upgrade has propelled Google back to the forefront of the AI ​​race, at least in the areas of reasoning and science. But if you look closely at the wording of Google's announcement, you'll notice a subtle shift in positioning: it no longer emphasizes "the smartest general AI," but repeatedly mentions "born for science."

As benchmarks for general AI become increasingly crowded and differentiation increasingly difficult, "My AI can help you with scientific research" is a more compelling value proposition than "My AI has the highest benchmark score." If Deep Think can truly and consistently assist peer review, accelerate drug discovery, or find solutions that humans have missed in physics simulations, this is more meaningful than any benchmark list.

The problem is that the distance between "being able to score high on benchmark tests" and "being able to reliably assist humans in real scientific scenarios" may be farther than Google suggests, since benchmark tests have standard answers, but science does not.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments