HLE's "Humanity's Final Exam" exceeded 60 points for the first time. Eigen-1, based on DeepSeek V3.1, significantly outperformed Grok4 and GPT-5.

This article is machine translated
Show original

For the first time, a system broke the 60-point mark on the expert verification subset of HLE (“Humanity’s Last Exam”)!

Just recently, the Eigen-1 multi-agent system, jointly developed by Tang Xiangru and Wang Yujie from Yale University, Xu Wanghan from Shanghai Jiao Tong University, Wan Guancheng from UCLA, Yin Zhenfei from Oxford University, and Jin Di and Wang Hanrui from Eigen AI, achieved a historic breakthrough.

On the HLE Bio/Chem Gold test set, the Pass@1 accuracy reached 48.3%, and the Pass@5 accuracy soared to 61.74%, surpassing the 60-point mark for the first time. This result far surpasses Google Gemini 2.5 Pro (26.9%), OpenAI GPT-5 (22.82%), and Grok 4 (30.2%).

What is most exciting is that this achievement does not rely on closed-source large-scale models, but is entirely based on the open-source DeepSeek V3.1 .

On this open source foundation, the research team achieved a qualitative leap by superimposing three innovative mechanisms: Monitor-based RAG (implicit knowledge enhancement), HSR (hierarchical solution repair), and QAIR (quality-aware iterative reasoning).

The following details are expanded -

Technological innovation: three pillars support the 60-point breakthrough

As AI begins to challenge the ultimate boundaries of human knowledge, an unprecedented contest is unfolding.

As large models achieve scores of 90 on traditional benchmarks such as MMLU and GPQA, these tests gradually lose their discriminatory power. To track the true progress of AI at the forefront of scientific reasoning, the Center for AI Safety and Scale AI jointly launched the "Humanity's Last Exam" (HLE).

It covers a total of 3,000 doctoral-level difficult problems in more than 100 fields including mathematics, natural sciences, engineering, humanities and social sciences, and is regarded as the ultimate test of AI knowledge reasoning.

HLE Bio/Chem Gold is the gold standard subset of HLE , which contains 149 questions that have been manually reviewed and corrected by domain experts.

Compared to the original HLE dataset, this subset excludes questions that may have ambiguous or incorrect answers, ensuring the accuracy and reliability of the labels, making it the most reliable benchmark for evaluating AI scientific reasoning capabilities.

It was on the HLE Bio/Chem Gold subset that the Eigen-1 system crossed the 60-point mark for the first time, and this was inseparable from its three major innovative mechanisms.

1. Monitor-based RAG: Say goodbye to implicit search enhancement with "tool tax"

Traditional retrieval-augmented generation (RAG) systems are like a video player that is frequently paused—each time external knowledge is needed, it must interrupt the reasoning process, construct a query, process the results, and then reintegrate the context.

The research team figuratively calls this overhead "Tool Tax" - every tool call interrupts the thinking process and causes loss of context.

The "tool tax" problem of traditional RAG systems is vividly demonstrated in the population genetics example shown in the figure below. The left side shows the model overconfidently using an incorrect formula, while the right side shows that even if the correct formula is obtained through explicit RAG, the interruption in the reasoning process prevents the model from reintegrating the knowledge into the original problem.

Eigen-1's Monitor-based RAG completely changes this paradigm:

Implicit Monitoring : Monitor continuously monitors uncertainty in the inference flow, acting like a careful assistant, silently watching behind the scenes for every moment when assistance might be needed. Monitor scans the inference trajectory to trigger RAG when uncertainty arises.

Precise query : When the Querier detects uncertainty, it accurately extracts the minimum set of keywords to avoid unnecessary expansion of the search space.

Seamless injection : The Injector seamlessly integrates the retrieved knowledge into the reasoning flow, just like naturally supplementing background information in a conversation, rather than inserting references rigidly.

Experimental data show that compared with explicit RAG, Monitor-based RAG reduces token consumption by 53.5% and workflow iterations by 43.7%, while maintaining higher accuracy.

As shown in the figure below, in the haplotype counting case, the Monitor detects the uncertainty of the recombination constraint, the Querier generates targeted queries, and the Injector injects two key facts, enabling the model to exclude invalid cases and obtain the correct 30 haplotype answers.

2. Hierarchical Solution Refinement (HSR): From “Democratic Voting” to “Hierarchical Refinement”

In addition to implicit knowledge enhancement, Eigen-1 also revolutionizes the multi-agent collaboration model.

Traditional multi-agent systems use a "democratic voting" mechanism, in which all candidate solutions are treated equally, which easily "dilutes" the optimal solution.

The Hierarchical Solution Refinement (HSR) introduced in Eigen-1 breaks this assumption. HSR adopts an "anchor-repair" structure: one candidate serves as an anchor, and the rest serve as references to be revised in sequence, forming a hierarchical collaboration.

In the HSR framework, each candidate solution takes turns serving as an "anchor," while other solutions serve as "references" to provide targeted corrections. This design allows strong solutions to absorb valuable insights from weaker solutions, rather than simply averaging them.

Specifically, it includes four repair dimensions: logical completion (filling in missing reasoning steps), numerical correction (correcting calculation errors), method replacement (replacing weaker methods with better strategies), and expression optimization (improving clarity without changing the essence).

This design allows high-quality solutions to absorb valuable insights from other solutions rather than simply averaging them.

The figure below vividly demonstrates the working principle of HSR through an image recognition task.

For the complex task of insect recognition and flower counting, the anchor solution initially selected ResNet (option C), but this was affected by a deployment time miscalculation. By using other solutions as references, the system made four targeted corrections.

3. Quality-Aware Iterative Reasoning (QAIR): Quality-Driven Iterative Optimization

Quality-aware iterative reasoning (QAIR) can adaptively adjust the iteration depth based on the quality of the solution: high-quality solutions can converge earlier, while low-quality solutions trigger more exploration, thereby achieving a balance between efficiency and accuracy.

This mechanism evaluates each solution along three dimensions: logic, answer correctness, and completeness of explanation. Only solutions that fail to meet these criteria advance to the next round of revisions, avoiding wasting computing resources on low-quality candidates.

Total crush: more than just HLE

The advantages of Eigen-1 are not limited to HLE:

1. HLE Bio/Chem Gold (149 questions)

Pass@1: 48.30% (13.4 percentage points ahead of SciMaster)

Pass@5: 61.74% (first time breaking 60%)

2. SuperGPQA Biology (Hard Edition)

Pass@1: 69.57%

Pass@5: 78.26%

3. Understanding TRQA literature

Pass@1: 54.65%

Pass@5: 79.07%

Deep Insight: The Laws Behind Success

Error pattern analysis

The pie chart in Figure 7 reveals a key insight: 92.78% of errors involve reasoning problems, and 88.66% involve knowledge application problems, with a significant overlap between the two.

This shows that the core challenge of scientific reasoning lies not in simple knowledge retrieval or logical reasoning, but in how to seamlessly integrate knowledge and reasoning.

In contrast, execution compliance errors (13.40%) and comprehension errors (9.28%) accounted for a relatively small proportion, indicating that the model is relatively mature in terms of instruction understanding and execution.

Accurate quantification of component contributions

The team precisely quantified the contribution of each component through incremental construction and ablation experiments .

Without any external knowledge, the baseline system only achieved 25.3% accuracy, consuming 483.6K tokens. Adding explicit RAG improved accuracy to 41.4%, but at the cost of a surge in workflow steps from 43.4 to 94.8, a clear example of a "tool tax."

After the Monitor component was introduced, although the accuracy dropped slightly to 34.5%, the token consumption dropped sharply to 218.4K and the number of workflow steps dropped to 51.3.

With the addition of Querier and Injector, the accuracy returned to 40.3%. The introduction of HSR increased the accuracy to 43.7%. Finally, QAIR pushed the accuracy of the complete system to 48.3%, while maintaining efficient resource utilization (218.9K tokens, 53.4 steps).

Ablation experiments validated the necessity of each component from another perspective. Removing the Monitor caused token consumption to surge to 461.3K and the number of workflow steps to increase to 95.3, demonstrating the significant value of implicit enhancement.

Removing HSR or QAIR causes the accuracy to drop to 44.8% and 43.7%, respectively, demonstrating the important roles of layer refinement and quality-aware iteration.

The delicate balance between diversity and consensus

The authors reveal a counterintuitive but highly suggestive finding through scatter plots and regression analysis.

In the information retrieval task (339 samples), the consistency between solutions showed a weak positive correlation with accuracy (slope 0.369), which means that different retrieval paths and perspectives can bring complementary information and diversity is beneficial.

In the reasoning task (392 samples), the situation was exactly the opposite - consistency and accuracy showed a strong positive correlation (slope 0.851), indicating that when multiple reasoning paths reach the same conclusion, this conclusion is likely to be correct.

Therefore, retrieval tasks should encourage solution diversity and parallel routes; pure reasoning tasks should tend to early consensus and convergence.

This finding provides important guidance for the task-adaptive design of future intelligent agent systems.

Accurate quantification of tool taxes

Finally, the author intuitively demonstrates the huge advantage of implicit enhancement over explicit RAG by comparing the relationship between accuracy improvement and token reduction.

Although the traditional baseline + RAG solution can improve accuracy, it comes at the cost of huge computational overhead, which is shown as extending to the upper right in the figure (accuracy is improved but the number of tokens increases).

Eigen-1, however, is in the upper left quadrant, significantly improving accuracy while reducing token consumption by 53.5%. The number of workflow iterations also dropped from 94.8 to 53.4, a 43.7% reduction. This "both" achievement is the essence of architectural innovation.

Significance: A new paradigm for scientific AI

The significance of Eigen-1 breaking 60 points for the first time goes far beyond a benchmark test: Eigen-1 also heralds a new paradigm in AI-assisted scientific research .

When AI can truly understand and reason about complex problems at the forefront of human knowledge, it will become a powerful assistant to scientists, accelerating the entire process from basic research to applied transformation.

The research team stated that they will continue to optimize the architecture design, explore expansion into other scientific fields, and study how to integrate these technologies into a wider range of scientific workflows. As more researchers join this open source ecosystem, we have reason to expect that scientific AI will usher in even faster development.

As the team said: "HLE may be an important test we need to conduct on our models, but it is far from the last benchmark for AI." As the open source community works together to advance, a new era of collaborative exploration of the unknown between humans and AI is accelerating.

Paper link: https://arxiv.org/pdf/2509.21193v1

Project address: https://github.com/tangxiangru/Eigen-1

This article comes from the WeChat public account "Quantum Bit" , author: Eigen-1 team, and is authorized to be published by 36Kr.

Sector:
Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments