Do you think large language models can easily "browse the internet"?
The new benchmark test BrowseComp-ZH directly challenges mainstream AI.
BrowseComp-ZH is a new benchmark test jointly released by HKUST (Guangzhou), Peking University, Zhejiang University, Alibaba, ByteDance, NIO, and other institutions, causing over 20 domestic and international mainstream large models to collectively "fail":
GPT-4o's accuracy is only 6.2% in the test; most domestic/international models' accuracy falls below 10%; even the currently best-performing OpenAI DeepResearch only scored 42.9%.
Currently, all data from BrowseComp-ZH has been open-sourced.
The research team stated directly:
Why do we need a Chinese web page capability test?
Today's large models are increasingly adept at "using tools": they can connect to search engines, call plugins, and "view web pages".
However, most evaluation tools are established only in English contexts, with little consideration for Chinese contexts, Chinese search engines, and Chinese platform ecosystems.
Moreover, Chinese internet information is highly fragmented, with diverse search entry points and complex language expressions.
How difficult is the Chinese web page world? Here are a few examples that will make it clear:
Information fragmentation, scattered across Baidu Baike, Weibo, local government websites, video channels, and multiple platforms
Common language structures include omissions, allusions, and references, with keyword searches often "going off track"
Search engine quality varies, with information often "buried" or "lost"
Therefore, simply "translating" English test sets is not enough.
It needs to be natively designed from the Chinese context to truly measure whether large models can "understand", "find", and "infer" on Chinese web pages.
How was BrowseComp-ZH created?
The research team used a "reverse design method": starting from a clear, verifiable fact answer (such as a painting genre, institution, or film/TV show name), constructing complex problems with multiple constraint conditions in reverse, ensuring the following three points:
Baidu/Bing/Google's first screen cannot directly hit the answer
Multiple mainstream large models cannot directly answer correctly in retrieval mode
Manually verified with a clear problem structure and only one unique answer
Ultimately, they constructed 289 high-difficulty Chinese multi-hop retrieval questions, covering 11 major domains including film and television, art, medicine, geography, history, and technology.
Large models collectively "crash"? DeepResearch barely breaks 40%, with the vast majority not even reaching 10%
Under the BrowseComp-ZH test, multiple domestic and international mainstream large models collectively "crashed":
Despite these models already showing strong capabilities in dialogue understanding and generative expression, their accuracy in complex Chinese internet retrieval tasks is surprisingly low:
Most models' accuracy is below 10%, with only a few breaking through 20%
OpenAI DeepResearch ranks first with 42.9%, still far from "passing"
Researchers point out that this result shows: models not only need to be able to "look up information" but also need to be able to "multi-hop reasoning" and "information integration" to truly find answers in the Chinese internet.
Four major findings reveal the "model blind spots" of Chinese web page tasks
1. Memory alone is not enough, real skills are needed
Models relying purely on parameter memory (without search) often have an accuracy rate below 10%, showing that "rote memorization" is not reliable.
2. Models with reasoning perform better
DeepSeek-R1 (23.2%) is 14.5% higher than DeepSeek-V3 (8.7%), and Claude-3.7 improved by 12.2% compared to Claude-3.5, with reasoning ability becoming a key variable.
3. More searches ≠ More accurate searches, multi-round strategies are key
AI search products with multi-round retrieval capabilities comprehensively outperform:
DeepResearch: 42.9%
Doubao Deep Search: 26.0%
Perplexity Research mode: 22.6%
In comparison, models that search only once (such as Kimi and Yuanbao) have accuracy rates as low as single digits.
4. Search function "crashes"? Adding it actually makes performance worse
The most typical example is DeepSeek-R1, whose accuracy plummeted from 23.2% to 7.6% after enabling search functionality.
Research indicates that models fail to effectively integrate web retrieval information with existing knowledge and are instead misled.
Dataset open! Developers are welcome to challenge it
All data from BrowseComp-ZH has been open-sourced.
Researchers hope this benchmark test can become a touchstone for promoting LLM implementation in Chinese information environments, helping to build truly "internet-savvy" intelligent agents.
Next, they plan to expand sample size, extend question formats, and conduct in-depth analysis of model reasoning paths and failure cases.
Paper address: https://arxiv.org/abs/2504.19314
Code address: https://github.com/PALIN2018/BrowseComp-ZH
This article is from the WeChat public account "Quantum Bit", authored by the BrowseComp-ZH team, published by 36Kr with authorization.





