Google officially launched Gemini 3: the most powerful AI agentic and Vibe Coding large-scale language model to date.

avatar
ABMedia
11-19
This article is machine translated
Show original

Google today officially announced its new generation of large-scale language model, Gemini 3, and simultaneously launched Gemini 3 Pro in multiple services including the Gemini App, SearchAI mode, AI Studio, and Vertex AI. Google stated that Gemini 3 is the most powerful multimodal and inference model to date, significantly outperforming its predecessor in several important AI benchmarks, including science, mathematics, visual understanding, and long-term planning. In addition to Vibe Coding, Gemini 3 is also the most powerful agentic model, meaning AI can proactively help you complete tasks.

Google CEO: Gemini's understanding has evolved from text and images to the ability to "read the air."

Google CEO Sundar Pichai pointed out that since the Gemini series was launched nearly two years ago, AI products have reached two billion users worldwide. Each generation of Gemini builds on the achievements of the previous generation. Gemini 1 brought breakthroughs in native multimodality and long context, enabling it to handle more and more complex information. Gemini 2 laid the foundation for agency capabilities, pushing the limits of reasoning and thinking.

Now, the Gemini 3: our most intelligent generation of model, bringing together all the capabilities of the Gemini series, allowing you to bring any idea to life. It achieves state-of-the-art (SOTA) in reasoning, mastering both depth and subtlety. Whether capturing subtle clues in creative ideas or disassembling complex, multi-layered problems.

Gemini 3 can also better understand the context and intent behind your requests, allowing you to get the answers you truly need without having to rack your brains for hints. Amazingly, in just two years, AI has evolved from being able to read text and images to being able to understand scenes and situations (reading the room).

Breakthrough in Reasoning Ability: Gemini 3 Wins Top Scores in LMARaena, Science Reasoning, and Math Tests

The Gemini 3 Pro broke several records in the latest review:

  • LMArena tops the leaderboard with 1501 Elo points.
  • Humanity's Last Exam (Academic Reasoning): 37.5% (without tools).
  • GPQA Diamond (Scientific Reasoning): 91.9%.
  • MathArena Apex (Advanced Mathematics): 23.4%.
  • MMMU-Pro: 81%
  • Video-MMMU: 87.6%
  • SimpleQA Verified: 72.1% (Improves factual accuracy)

These results demonstrate the high reliability of the Gemini 3 Pro in scientific, mathematical, and multimodal reasoning, enabling it to handle extremely complex problems.

Google simultaneously released the Gemini 3 Deep Think inference mode, achieving 45.1% in ARC-AGI-2 (including program execution), further enhancing its inference capabilities. Other highlights include:

  • Humanity's Last Exam: 41.0%
  • GPQA Diamond: 93.8%

Gemini 3: Enhanced learning, execution, and planning capabilities

Gemini 3 is currently the most powerful vibe coding and proxy programming model, with specific scores including:

  • WebDev Arena: 1487 Elo (highest)
  • Terminal-Bench 2.0: 54.2% (Tool operation capability)
  • SWE-bench Verified: 76.2% (Large Programming Tasks)

It also supports Google's new Google Antigravity agent-based development platform, enabling AI to autonomously plan, write programs, operate terminals, verify programs, and control browsers—a multi-step task. Agentic AI refers to AI systems that can proactively take action, plan multi-step tasks, and autonomously operate tools. The core concept is that AI no longer just provides answers, but can proactively complete tasks like an assistant.

For example, when I type: "Help me get today's ETH price and update the Google Sheet," Agentic AI will automatically check the API and update the Google Sheet.

Large language models mean that the same input from a user can produce drastically different outputs depending on the model's computation. Gemini 3, however, maintains consistent decision-making throughout a year in Vending-Bench 2, meaning it can assist you in:

  • Book local services
  • Organize Gmail
  • Handling multi-step workflows

Starting today, the Gemini Agent is available to Google AI Ultra subscribers. Google states that Gemini 3 is the most security-tested model to date, with enhanced resistance to "flattery generation," prompt injection, and cyberattacks. The Deep Think mode will be available to Google AI Ultra subscribers after completing additional security testing.

Risk Warning

Investing in cryptocurrencies carries a high degree of risk; prices can fluctuate wildly, and you could lose all of your principal. Please carefully assess the risks.

xAI announced on November 17th that its latest model, Grok 4.1, is now officially available to all users, including grok.com, Twitter (X), and the iOS and Android apps. xAI stated that this upgrade focuses on "real-world usability," including stronger emotional understanding, more natural personality representation, higher creativity, and a lower rate of hallucinations, while retaining the reasoning ability and stability of the previous Grok 4.

Grok 4.1, with a near 65% win rate in secret testing, is confirmed for full release.

xAI conducted a two-week secret test from November 1st to November 14th, importing a small percentage of Grok 4.1 beta version into Grok.com, X, and the mobile app's real traffic, and directly comparing it with the previous Grok 4 model through a "blind test comparison".

xAI stated that in blind testing, Grok 4.1 showed a preference index of 64.78% in real traffic, significantly outperforming Grok 4, and announced that it would be officially available to all users on November 17th. They also stated that from now on, all users can use Grok 4.1. It will automatically use Grok 4.1 if the user enables Auto mode, or users can manually select it from the model menu.

Grok 4.1: Three Key Technical Highlights

Grok 4.1 Technical Highlight 1: A brand-new reinforcement learning architecture makes responses more natural and more human-like.

The core upgrade of Grok 4.1 comes from using the same "large-scale reinforcement learning infrastructure" as Grok 4, but this time it introduces new methods to allow the model to automatically optimize responses at a larger scale. This training focuses on unverifiable response quality, such as tone, persona consistency, emotional interaction, and understanding of intent, which cannot be directly scored based on data alone.

To address this issue, xAI employed a "cutting-edge reasoning model" as its reward model. This allowed AIs with deep reasoning capabilities to automatically evaluate Grok 4.1's responses and learn, through extensive comparisons, what constitutes a better and more human-expected answer, making adjustments accordingly. As a result, Grok 4.1 showed significant improvements in tone, personality, emotion, and naturalness of interaction, while maintaining its original reasoning ability and stability.

Grok 4.1 Technical Highlight 2: Tops all blind test evaluations, with significant upgrades in emotion understanding and creativity.

xAI also released several test results, showing that Grok 4.1 has made significant improvements in multiple capability tests.

  • In the LMARaena global blind beta gaming platform:

    • Grok 4.1 Thinking ranks first in the world with 1483 Elo ratings.

    • Grok 4.1 Non-Thinking ranked second with 1465 Elo , even surpassing the "Full Inference Mode" of other models.

  • Emotional Understanding Test (EQ-Bench 3): This test uses 45 challenging scenarios and 3 rounds of interaction, scored by Claude Sonnet 3.7. Grok 4.1 showed significant improvement in empathy, emotional insight, and interpersonal understanding.

  • Creative Writing v3: In a 32-question, 3-round writing test, Grok 4.1 scored higher in writing style, narrative quality, and story flow, with multiple sample responses shown by the official documentation.

Overall, Grok 4.1 not only improves reasoning ability, but also shows significant upgrades in "emotional interaction" and "creative ability".

As shown in the figure, Grok 4.1 ranks among the top three in the overall ranking of inference models, emotion understanding, and creative writing.

(Note: Elo refers to Grok 4.1's power score on the global blind testing platform LMARaena, which uses the Elo ranking system originally used for chess to evaluate the quality of model responses.)

Grok 4.1 Technical Highlight 3: AI illusions reduced by 3 times, information sources more reliable.

For common information retrieval problems, xAI specifically highlights the significant reduction in the illusion rate in Grok 4.1. Previously, Gork's fast mode (Non-Reasoning) was prone to illusions due to insufficient reasoning depth, but xAI has explicitly addressed this issue in the post-training of 4.1. xAI's verification methods include:

  • We conduct sampling tests based on questions that users actually ask in real situations and that actually appear on the platform.

  • Compare the differences in responses between Grok 4.1 and the older model.

  • Evaluate performance on FActScore.

The results showed that the new version significantly reduced the illusion rate when searching for facts and answering informational questions, and the answers were more stable and credible. This makes Grok 4.1 more practical and accurate than its predecessor in scenarios of "quick answering" and "data searching".

As shown in the graph, Grok 4.1's hallucination rate decreased from 12.09% to 4.22%, a drop of approximately three times. The Fact Verification Score (FActScore) also decreased from 9.89% to 2.97%, indicating a significant improvement in Grok 4.1's accuracy.

(Note: FActScore is a public test consisting of 500 real-life biographical questions, used to evaluate a model’s performance in fact-finding, judgment accuracy, and answer consistency; it can be called a validation fact score.)

(A comprehensive analysis of the five latest mainstream AI Language Models (LLM) in 2025: understanding their pricing, applications, and security at a glance)

Risk Warning

Investing in cryptocurrencies carries a high degree of risk; prices can fluctuate wildly, and you could lose all of your principal. Please carefully assess the risks.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
71
Add to Favorites
11
Comments