Google's latest "Deep Research" report counters GPT-5.2.

12-12

This article is machine translated

Show original

Google and OpenAI are now locked in a fierce battle! The two companies are now bombarding each other with various new products.

Last night, OpenAI successfully avenged its loss to Gemini 3 using the expert-grade GPT-5.2!

More than an hour before the release of GPT-5.2, Google launched a brand-new version of the Gemini Deep Research Agent .

Google has reimagined Gemini Deep Research, making it more powerful than ever before.

The new version of Deep Research Agent is built on Gemini 3 Pro;

Improve accuracy and reduce hallucinations through multi-step reinforcement learning training;

It can handle massive amounts of context and provides source verification for every point made.

In addition to the Deep Research Agent feature update , two other new capabilities have also been released:

DeepSearchQA, a new open-source benchmark for network research agents , validates the comprehensiveness of agents in network research tasks.

Introducing a brand new Interactions API .

Although GPT-5.2 has just been released and cannot be compared directly, Lukas Haas, a product manager at Google DeepMind, revealed on the social media platform X:

The latest version of Gemini Deep Research Agent scored 46.4% on Google's new benchmark and was comparable to GPT-5 Pro on BrowseComp, but at an order of magnitude lower.

In-depth research, even more "in-depth"

Gemini Deep Research is an agent optimized for long-term context gathering and synthesis tasks.

The agent's reasoning core employs the most fact-accurate Gemini 3 Pro model to date, and has been specially trained to reduce illusion generation and maximize report quality in complex tasks.

By extending the application of multi-step reinforcement learning in search, this agent is able to autonomously navigate complex information environments with high precision.

Gemini Deep Research achieved a leading 46.4% on the full Humanity's Last Exam (HLE) test set, an excellent 66.1% on DeepSearchQA, and a high score of 59.2% on the BrowseComp test.

DeepResearch employs an iterative research planning mechanism—it formulates queries, reviews results, identifies knowledge gaps, and searches again.

This version significantly improves the web search functionality , enabling it to delve deeper into websites to retrieve specific data.

The agent has been optimized to generate well-researched reports at a lower cost.

Unlike traditional chatbots, Deep Research is designed as a long-running system, with its core competency lying in handling complex tasks that are not instantaneous.

A brief discussion on in-depth research

In-depth research is arguably the most frequently used function among AI tools in daily use.

After all, for just $20 a month, you can enjoy multiple "doctor-level" services, so why not?

My view is that in-depth research is the AI tool that ordinary people can most effectively leverage to outperform knowledge services.

Deep Research, this type of deep research, does not derive its intelligence from brute-force computation of a single model, but rather from its complex agentic workflow.

This workflow simulates the cognitive behavior of human experts when facing unfamiliar domains, and mainly includes four closed-loop stages: planning, execution, reasoning, and reporting.

When a user submits a vague macro-level instruction (such as "analyze the commercialization path of quantum sensors by 2030"), DeepResearch first activates its planning module.

Based on the powerful reasoning capabilities of Gemini 3 Pro, the system does not immediately perform a search. Instead, it uses the "back one step hint" technology to break down this macro-level problem into multiple sub-dimensional research paths, such as technology maturity, supply chain bottlenecks, policy and regulatory environment, and analysis of major competitors.

This planning process is dynamic. In traditional chain-like thinking, the path is often linear; however, in DeepResearch, the planning tree is scalable.

If an unforeseen new concept is discovered during the initial search, the system will modify the research plan in real time, adding new branches for in-depth exploration.

DeepSearchQA: Benchmarking for Deep Research Agents

In the benchmark tests above, you should notice something called DeepSearchQA.

This is a benchmark that Google developed specifically for deep learning agents, a brand-new benchmark for evaluating the performance of agents in complex, multi-step information retrieval tasks.

DeepSearchQA includes 900 hand-designed causal chain tasks covering 17 domains, where each step relies on previous analysis.

Unlike traditional fact-based tests, DeepSearchQA assesses research completeness by requiring agents to generate an exhaustive set of answers, while also testing research accuracy and information recall.

DeepSearchQA can also be used as a diagnostic tool for thinking about time efficiency.

In its internal evaluation, Google found that its performance improved significantly when the agent was allowed to perform more search and reasoning steps.

Comparing the results of pass@8 and pass@1 demonstrates the value of allowing the agent to verify the answer by exploring multiple trajectories in parallel.

These results were calculated based on a subset of 200 hints from DeepSearchQA.

Interactive API: Designed specifically for Agent application development

The interaction API natively integrates a set of dedicated interfaces designed specifically for Agent application development scenarios. These interfaces can efficiently handle complex context management tasks such as interleaved messages, thought chains, tool calls, and their state information.

In addition to the Gemini model suite , the interactive API also provides its first built-in Gemini Deep Research Agent .

Next, Google will expand its built-in Agent and provide the ability to build and introduce other Agents, which will enable developers to connect the Gemini model, Google's built-in Agent, and their custom Agents through a single API .

The interactive API provides a single RESTful endpoint for interacting with the model and agent.

The Interactions API extends the core functionality of generateContent, providing the features needed for modern intelligent agent applications, including:

Optional server-side state: The ability to offload history management to the server. This simplifies client-side code, reduces context management errors, and may reduce costs by improving cache hit rates.

An interpretable and composable data model: a clear architecture designed specifically for the historical records of complex agents. You can debug, manipulate, stream, and logically reason about interwoven messages, thought processes, tools, and their results.

Background execution: The ability to offload long-running inference loops to the server without maintaining client connections.

Remote MCP tool support: The model can directly call the Model Context Protocol (MCP) server as a tool.

With the launch of the Interactions API , Google is attempting to redefine how developers build AI applications, shifting from a "stateless request-response" model to a "stateful agent interaction" model.

Most current LLM APIs are stateless. Developers must maintain the entire conversation history on the client side and send the context of tens of thousands of tokens back to the server with each request.

This not only increases latency and bandwidth costs, but also makes building complex, multi-step agents extremely cumbersome.

The Interactions API introduces server-side state management .

Developers only need to create a session through the /interactions endpoint, and Google's servers will automatically maintain all the context of that session, the results of tool calls, and the internal thought state of the Agent.

That's what I find terrifying about Google's latest API.

The most revolutionary feature of the Interactions API is that it allows developers to directly invoke Google's pre-trained high-level agents, not just the base model.

For example, developers can embed Google's top research capabilities into their own ERP, CRM, or research software through a simple API call (specifying agent=deep-research-pro-preview-12-2025).

Given that a single DeepResearch task may consume hundreds of thousands of tokens in reads and generation, the cost of a single deep research study could reach several dollars.

However, this price still represents a very high return on investment compared to the hours or even days of work required to replace a junior human analyst.

DeepMind partners with the UK government

Finally, there is one more piece of news worth noting.

While Google and OpenAI are locked in a fierce battle, Google DeepMind has already begun collaborating at the national level.

DeepMind, an AI giant born in London, is conducting an unprecedented "AI-driven governance" experiment with the British government through DeepResearch and its underlying technologies.

This collaboration extends beyond scientific exploration to the very core of public administration, achieving groundbreaking progress, particularly in addressing the UK’s long-standing housing crisis and planning inefficiencies.

Project Extract: Breaking Down the "Data Silos" in Urban Planning

The UK’s urban planning system has long been considered a bottleneck hindering economic growth and housing construction.

Each year, local councils need to process about 350,000 planning applications, and a large number of historical planning files still exist in the form of paper, scanned PDFs, or hand-drawn maps.

Planners often have to spend hours searching through dusty archives for underground pipelines or protected area boundaries drawn decades ago.

To address this pain point, DeepMind partnered with the UK government's AI incubator (i.AI) to develop the Extract tool.

This is not a simple OCR software, but a complex geospatial intelligence system based on Gemini's multimodal reasoning capabilities.

Unstructured information understanding:

Extract first utilizes Gemini's visual language capabilities to read low-quality scanned documents. It can not only recognize text but also understand the semantics of handwritten annotations (e.g., recognizing "approval date" instead of "application date" in a side note), achieving a date recognition accuracy of 94%.

Visual reasoning and polygon extraction:

This is the core technological breakthrough. Gemini can understand the visual symbol language on maps, such as distinguishing between property boundaries represented by "red solid lines" and drainage ditches represented by "blue dashed lines". Once the target area is identified, the system calls computer vision tools such as OpenCV and SAM to extract geographic polygons from pixel images with the precision of a digital scalpel, achieving a shape matching degree (IoU) of 90%.

Spatiotemporal feature matching:

Historical maps often use different scales and reference systems than modern satellite maps. Extract uses the LoFTR algorithm to find common feature points (such as old churches and intersections) between old and modern maps, calculates an accurate transformation matrix, and precisely maps hand-drawn red lines from decades ago onto today's digital map coordinate system.

Full-process automation:

Through this process, Extract reduces the processing time for a complex planning document from an average of 2 hours to 40 seconds to 3 minutes . This means that a local council can digitize hundreds of backlogged documents every day, increasing efficiency a hundredfold.

Currently, Extract is being piloted in four areas, including Westminster and Hillingdon.

The UK government plans to extend it to all local councils across the country in the spring of 2026.

This will not only free up thousands of hours of administrative manpower, but more importantly, it will build a unified national digital planning database, providing a data foundation for the UK government's promised plan to build 1.5 million new homes.

This is a prime example of DeepResearch technology being applied in a vertical industry—transforming general multimodal reasoning capabilities into concrete administrative productivity.

New Scientific Infrastructure: From AlphaFold to Automated Materials Labs

In the field of basic science, DeepMind's collaboration with the UK government aims to accelerate the flywheel effect of scientific discovery through AI.

DeepMind has announced plans to establish its first automated AI science lab in the UK in 2026 .

Closed-loop discovery system: The lab will run a closed-loop system driven by Gemini and GNoME (Graph Networks for Materials Exploration). AI will be responsible for designing new crystal structures based on quantum chemical principles and predicting their stability.

Robotic synthesis: These design instructions are sent directly to a fully automated robotic platform, which is responsible for ingredient mixing, synthesis, sintering, and testing.

Data feedback: Experimental results are fed back to AI in real time to revise predictions for the next round. The goal is to shorten the discovery cycle of new materials (such as room-temperature superconductors and high-efficiency battery electrolytes) from decades to months or even days. This initiative directly serves the UK's NetZero strategy and energy security.

In addition to its hardware labs, DeepMind has also opened up a range of cutting-edge AI models to British scientists:

National security and the digital immune system

In the security field, the focus of cooperation has shifted from "offensive capabilities" to "defensive resilience".

DeepMind has partnered with the UKAI Security Institute to deploy a network defense tool based on DeepResearch technology.

BigSleep (formerly Project Naptime): This is an intelligent agent that uses LLM (Limited Memory Management) to find hidden vulnerabilities in large-scale codebases. It has successfully discovered memory safety vulnerabilities in core open-source infrastructure such as SQLite that went undetected by human experts.

Code Mender, in conjunction with BigSleep, not only discovers vulnerabilities but also automatically generates patch code to fix them. This automated "discovery-remediation" loop aims to build a real-time "digital immune system" for the UK's National Critical Information Infrastructure (CII) to defend against increasingly sophisticated cyberattacks.

The above summarizes Google's update for GPT 5.2.

Personally, I think Google is still the strongest.

Although GPT 5.2 successfully defeated Gemini 3 last night, it is still slightly behind in multimodal capabilities. Perhaps a product that can rival Nano Banana Pro will appear by the end of the year.

Moreover, judging from the latest research on intelligent agents and DeepMind's strategic layout in the UK, Google is even more ahead.

This leading position shows us a clear picture of the development of AI technology:

The prototype of Artificial General Intelligence (AGI) is emerging from dialog boxes and evolving into intelligent agents capable of perceiving, planning, and changing the physical and digital worlds.

References:

https://blog.google/technology/developers/deep-research-agent-gemini-api/

https://x.com/GoogleDeepMind/status/1999165701811015990

https://deepmind.google/blog/strengthening-our-partnership-with-the-uk-government-to-support-prosperity-and-security-in-the-ai-era/

This article is from the WeChat official account "New Intelligence" , author: Ding Hui, and published with authorization from 36Kr.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content