GPT-5 Controversy, Open Source Catching Up, and Leap in Capabilities: Epoch AI's Year-End Report Reveals Accelerated AI Capabilities

This article is machine translated
Show original

On December 25, Epoch AI, a non-profit organization focused on artificial intelligence benchmarking, released its year-end report, showing that overall, the capabilities of AI models are rapidly improving.

Top international models such as GPT and Gemini perform exceptionally well on the expert-level mathematical problem FrontierMath, but they still fall short of perfect scores when faced with truly challenging problems, indicating that there is still room for improvement in their reasoning abilities. Meanwhile, advancements in AI reasoning capabilities and reinforcement learning have nearly doubled the growth rate, significantly reduced costs, and enabled many models to run on consumer-grade hardware.

Against this backdrop, while Chinese open-source large-scale models have made progress, a significant gap remains compared to top international models. In the FrontierMath test, the vast majority of Chinese models scored almost nothing, with only DeepSeek-V3.2 achieving a score of approximately 2%. This indicates that although Chinese models are catching up, they still face challenges when dealing with truly complex problems.

01 China's "Seven-Month Catching-Up": Open Source Power is Reshaping the Landscape

The highest score achieved by the Chinese model still lags behind the global forefront by about seven months.

In Epoch AI's latest FrontierMath benchmark, a Chinese open-source model delivered a remarkable performance. FrontierMath is a highly challenging mathematical benchmark meticulously designed by expert mathematicians, covering major branches of modern mathematics such as number theory, real analysis, algebraic geometry, and category theory. The complete dataset contains 350 problems, with 300 in the basic set (levels 1-3) and 50 extremely difficult problems (level 4). Solving these problems typically requires researchers hours or even days of effort.

FrontierMath Problem Set

FrontierMath problem sets are divided into two categories: public and private. The first 3 levels of the basic set contain 10 problems that are publicly available, while the remaining 290 problems constitute the private set. In the 4th level, the most difficult problems contain 2 publicly available problems and the remaining 48 problems are in the private set.

The evaluation results show that, in the question bank of levels 1-3, the highest score of the Chinese model still lags behind the global cutting-edge level by about seven months. This figure may seem significant, but in the history of AI development, it signifies that Chinese models are closing the gap with top labs like OpenAI and Anthropic at an astonishing pace. Just two years ago, the gap between open-source models and closed-source cutting-edge models was measured in years; now, the performance gap between the best open-source model running on consumer GPUs and the absolute cutting edge is less than a year.

Even more noteworthy is the fourth layer of the problem set—50 extremely difficult math problems that "take several days to solve." DeepSeek V3.2 (Thinking) was the only Chinese model to achieve a non-zero score on this layer, correctly answering one question (approximately 2%). While seemingly small, this is highly symbolic: it demonstrates that Chinese models have the potential to tackle top-tier mathematical problems. Even OpenAI's o3 and o3-mini only achieve single-digit accuracy on these types of questions.

Technically, DeepSeek achieved pre-training performance comparable to Meta Llama 3 using only one-tenth of the computing power through innovative architectures such as Multi-Head Latent Attention (MLA), Hybrid Expert (MoE), and multi-label prediction. Its subsequent inference model, R1, rivaled OpenAI's o1 in performance, but at a fraction of the latter's development cost. This confirms Epoch AI's view that the main driver of declining AI training costs is not cheaper hardware, but rather algorithm optimization and data improvement.

Epoch AI's evaluations utilize third-party APIs (Fireworks for DeepSeek, and Together for the remaining models) to ensure the security of the FrontierMath question bank. Epoch AI's analysis indicates that some third-party APIs may slightly influence model scores, with newly released models being more significantly affected. This suggests that the actual capabilities of Chinese models may be stronger than publicly demonstrated.

FrontierMath's approach to problem-solving is also worth understanding: the model submits a Python function `answer` that returns the answer, typically an integer or a sympy object. The model can think, run Python code, and submit an answer when confident. Each problem has a strict tag limit (a hard cap of 1,000,000 tags), and the evaluation system records and scores submissions. The time limit for running code using Python tools is 30 seconds, ensuring that the evaluation can be repeatedly validated on commercial hardware.

The data also reveals a trend: any cutting-edge AI capability goes from emergence to widespread availability in less than a year. This presents both an opportunity and a challenge for Chinese models to catch up with the forefront: because the forefront itself is still advancing rapidly, and there is never an end to the pursuit.

02 The "Arms Race" of Cutting-Edge Global Models: From GPT-5 to Gemini 3

When GPT-5 was released in 2025, it caused "disappointment" in some markets. Compared to intermediate versions such as Claude 3.7 and Gemini 2.5, the performance improvement seemed limited. However, Epoch AI data shows that the leap of GPT-5 over GPT-4 is almost the same as that of GPT-4 over GPT-3:

MMLU: +43%

MATH: +37%

• TruthfulQA: +40%

HumanEval: +67%

GPQA Diamond: +55%

MATH Level 5: +75%

Mock AIME 24-25: +84%

The reason for the reduced "impact" lies in the accelerated release pace: it took about two years from GPT-3 to GPT-4, and only one year from GPT-4 to GPT-5. The market has already been "fed" by intermediate models such as Claude 3.7, Gemini 2.5, and o1, so expectations for GPT-5 have naturally risen.

The Gemini 3 Pro also encountered challenges in the FrontierMath benchmark, primarily due to API stability issues. On the Tier 1-3 question bank, its accuracy was 38%, but API errors resulted in lost points on 10 questions. On the Tier 4 super-difficult questions, its accuracy was 19%, with 3 questions affected by API errors. Epoch AI retried at least 10 times to ensure rigorous evaluation. This demonstrates that API stability has become a significant constraint on the performance of cutting-edge models.

xAI's Grok 4 encountered even more severe network and timeout issues: 8 out of 48 questions in Tier 4 failed to score properly. Epoch AI uses specific rules to handle these issues while maintaining completely independent editing to ensure transparency in the evaluation process.

Furthermore, OpenAI's R&D spending reveals the true cost structure: of its $5 billion computing budget in 2024, 90% is allocated to experimental training and basic research, rather than the final release of GPT-4.5 or other models. This demonstrates that the core cost of building top-tier models is not "making the model," but rather "figured out how to do it." Therefore, DeepSeek's ability to achieve similar performance at a lower cost stems from its advantage of standing on the shoulders of cutting-edge laboratories.

03 Accelerated AI Model Capabilities: The Speed of Advancement of Cutting-Edge Models Doubles

The capabilities of AI models are improving at an unprecedented rate.

Latest data shows that AI model capabilities are improving at an unprecedented rate. According to Epoch AI's Epoch Capabilities Index (ECI), since April 2024, top models have improved at almost twice the rate of the previous two years across various benchmarks. Specifically, the annual capability increase before the breakpoint was approximately 8 points, while the increase after the breakpoint rose to approximately 15 points, demonstrating a significant acceleration.

This acceleration coincides with several important changes: the rapid rise of inference models (such as OpenAI's o1 and DeepSeek R1) and increased investment in reinforcement learning by leading labs. This indicates a shift in the development model of AI: no longer relying solely on large-scale pre-training, but rather employing a multi-pronged strategy of pre-training, inference computation, and reinforcement learning to enhance model capabilities.

Global Major Model ECI Ranking

Epoch AI's report tracked 149 cutting-edge models from the end of 2021 to the end of 2025, including all core cutting-edge models. The analysis used a piecewise linear model to fit the trend of the top models' capabilities over time, identifying the optimal "breakpoint" as April 2024. The capability growth rates before and after the breakpoint were 8.2 points/year and 15.3 points/year, respectively, representing an acceleration of approximately 1.86 times. Statistical analysis shows that this acceleration signal is robust and significant, reflecting the actual speed of development more accurately than a unilinear trend.

This means that after 2024, the performance improvements of cutting-edge models will not only increase in absolute terms, but will also iterate at a faster pace. The investments leading labs make in computing power, algorithms, and training data will directly determine their ability to maintain their lead. At the same time, this also places higher demands on open-source teams: catching up with closed-source models within a shorter timeframe requires continuous optimization of algorithms and training strategies.

In short, the pace of AI capability improvement is accelerating, and the global AI race is being compressed accordingly, making it difficult to maintain a leading advantage in the long term.

04 Top 10 AI Trends by 2025: Technological, Economic, and Social Impacts

In 2025, Epoch AI released 36 data insights and 37 newsletters, totaling 70 short surveys on AI. Which content garnered the most readership? Year-end reviews reveal that readership and engagement data from these insights and newsletters have helped us identify ten core trends.

Of these most popular surveys, the top five offer the most relevant data insights, revealing core industry trends such as advancements in AI capabilities, distribution of computing power, and changes in costs. The next five reflect trends in policy, social applications, and industry practices.

In other words, this year's top ten trends were not simply set by researchers, but rather combined with the weight of readers' attention and data insights, presenting an AI panorama that is both professional and close to the market and public perspective.

Trend 1: Reasoning costs have plummeted, but task differences remain significant.

From April 2023 to March 2025, inference costs decreased exponentially at the same performance level:

Slowest task: 9 times decrease per year

Medium-speed tasks: 40 times lower per year

Fastest task: 900 times decrease per year

Cost reductions are primarily driven by two factors: increased market competition (more API providers, more transparent pricing) and improved efficiency (optimized inference algorithms, increased hardware utilization). However, the rate at which different tasks benefit from these cost advantages varies significantly: simple tasks (such as text classification) are almost free, while complex tasks (such as PhD-level scientific reasoning) experience slower cost reductions. This indicates that the economic advantages brought by the democratization of AI capabilities are not equal for all tasks, and businesses and developers still need to optimize their strategies for specific applications.

Trend 2: The gap between consumer hardware and cutting-edge models is narrowing to 7 months.

Epoch AI found that the gap between the best open-source model running on a single consumer-grade GPU (such as the RTX 4090 and RTX 5090) and the absolute cutting-edge model has been narrowed to about 7 months.

This means that billions of users can run near-cutting-edge AI on their personal computers; companies that rely solely on fixed model capabilities will find it difficult to maintain a competitive advantage in the long term; and in terms of policy, "technology blockades" are unlikely to prevent the diffusion of capabilities.

This trend highlights the disruptive impact of open-source AI: cutting-edge capabilities are rapidly becoming widespread, the window of opportunity for market competition is shortening, and innovative advantages need to rely on continuous iteration and overall service capabilities, rather than the performance of a single model.

Trend 3: OpenAI's computing power is mainly invested in experiments, with R&D costs far exceeding training costs.

Epoch AI data shows that most of OpenAI's computing power in 2024 was not directly used for model inference or final training, but rather to support experimental and R&D activities. The specific expenditure structure is as follows (all figures are cloud computing power costs):

Basic research and experimental computing power: approximately $4.5 billion, including basic research, experimental/risk-avoidance operations (for final training preparation), and unreleased models.

GPT-4.5 final training: approximately $400 million (90% confidence interval: $170 million – $890 million)

Other model training: approximately $80 million (including GPT-4o, GPT-4o mini, Sora Turbo, and GPT-4o updates and post-training of the o series; 90% confidence interval: $24 million–$435 million)

Total R&D computing power: $5 billion

Inference computing power: $2 billion (excluding the cost for Microsoft to run OpenAI models for its own products)

This illustrates that AI development is extremely capital-intensive, requiring leaders to dedicate significant computing power to exploration and experimentation, not just final training and deployment. The majority of spending is on "figured-out how," rather than directly producing a model. This also explains why some open-source or emerging models can achieve near-perfect performance at a lower cost: they stand on the shoulders of cutting-edge labs, skipping numerous trial-and-error phases.

In other words, OpenAI’s computing power utilization strategy demonstrates the immense value of research and development itself: experimentation is the core of driving breakthroughs in AI capabilities, while training and deployment are only part of the outcome.

Trend 4: Nvidia's computing power inventory doubles every 10 months

Since 2020, the installed NVIDIA AI computing power globally has grown by about 2.3 times annually, with new flagship chips accounting for the majority of existing computing power within three years of their release.

The H100 was released in 2022 and became mainstream by 2025. Next-generation chips such as the H200 and B100 will take over in 2026-2028.

The exponential growth of computing power is a prerequisite for maintaining the progress of AI capabilities, but it also raises supply chain pressures: chip shortages or logistical disruptions will directly affect model training and inference capabilities. Epoch AI emphasizes that this "computing power arms race" will continue and is a core support for the speed of AI development.

Trend 5: GPT-5 continues to make leaps in benchmark tests, but the market impact is limited.

Epoch AI data shows that both GPT-4 and GPT-5 have achieved significant improvements over their predecessors in major benchmark tests. For example, in key tests such as MMLU, MATH, TruthfulQA, HumanEval, GPQA Diamond, MATH Level 5, and Mock AIME 24-25, GPT-4's performance improvement over GPT-3 ranged from 37% to 84%, while GPT-5's improvement on the same benchmarks was almost on par with GPT-4, further solidifying its leading position among cutting-edge AI models.

While GPT-5 represents a significant performance improvement over GPT-4, some market participants felt it lacked the "wow factor." Epoch AI analysis suggests this is primarily due to the accelerated pace of model releases over the past two years, rather than a slowdown in capability growth. The leap from GPT-3 to GPT-4 took approximately two years, while the leap from GPT-4 to GPT-5 took only one year, thus raising public expectations for GPT-5, even though the actual performance leap remains substantial.

This trend indicates that AI capabilities are still growing rapidly, but frequent intermediate version updates can easily lead to a discrepancy between the public's perception of the "performance improvement" and the actual situation.

Trend 6: A single ChatGPT query consumes less energy than five minutes of turning on a light bulb.

Josh estimated the average energy consumption of a single GPT-4o query, showing it to be less than the energy required to light a light bulb for five minutes. This estimate was later confirmed by Sam Altman and is similar to the energy consumption data per query for Google's Gemini model.

AI energy consumption has always been a focus of public attention. This data helps quantify costs by comparing AI's energy consumption within the context of everyday household activities: the energy consumption of a single query is relatively small. However, with the exponential growth in global usage, the overall energy consumption of AI continues to rise and may become a more significant problem in the future.

Trend 7: DeepSeek optimizes the Transformer architecture to achieve low cost and high performance.

In 2025, the DeepSeek team proposed three key technologies in their v3 paper, enabling their open-source pre-trained model to achieve state-of-the-art performance at the time, while requiring only one-tenth of the computational power of the next best open-source model, Llama 3. These technologies include:

Multi-head Latent Attention (MLA) – Reduces inference memory footprint and improves computational efficiency

Hybrid Expert (MoE) Architecture Innovation – Improving Model Parameter Utilization

Multi-token prediction – accelerates the training process and improves learning efficiency.

Just three days later, DeepSeek released its inference model R1, which performs comparably to OpenAI's o1, but at a fraction of the cost of the latter.

This case study demonstrates a trend in AI training computational efficiency: through algorithmic innovation and data optimization, model development costs can be reduced by approximately three times annually. In other words, with improvements in training techniques and data, cutting-edge models can quickly catch up with top laboratory results in performance without relying on extreme computing power. This not only provides a viable path for open-source models but also drives a qualitative improvement in efficiency and cost across the entire industry.

Trend 8: The space for expanding inference models may only be 1-2 years left.

Josh analyzed the growth of computing power in inference training for reinforcement learning (RL). Leading labs like OpenAI and Anthropic pointed out in early 2025 that the rate at which this type of reinforcement learning expands cannot be sustained in the long term and may reach the limits of computing infrastructure within 1-2 years.

Reasoning ability has become a core factor in improving AI model performance, particularly in mathematical, programming, and complex reasoning tasks. However, further expansion of this capability faces hardware and cost bottlenecks, meaning the explosive growth period expected in 2024-2025 may be slowing down. To maintain a competitive edge, companies need to find new growth paths, such as more efficient data utilization, better model architecture, or achieving performance breakthroughs through recursive "AI-assisted AI R&D."

The limited growth potential of inference capabilities serves as a reminder to the industry that computing power is not limitless and performance improvements have a ceiling. Future competition will rely more on algorithmic innovation, data optimization, and R&D strategies than simply increasing computing power.

Trend Nine: The "AI Manhattan Project" Has Amazing Potential

An analysis by Epoch AI suggests that if the United States were to establish a national-level AI project similar in scale to the Manhattan Project or the Apollo Program, its training scale could be approximately 10,000 times larger than that of GPT-4.

In November 2024, the U.S.-China Economic and Security Review Commission recommended that Congress "establish and fund AI projects similar to the Manhattan Project to compete for general artificial intelligence capabilities." This idea suggests that centralized national investment could theoretically achieve an unprecedented scale of AI computing power, but it also raises two major questions: investment and return—it remains uncertain whether the hundreds of billions of dollars in funding will bring about actual AGI breakthroughs; and technical and managerial challenges, because such large-scale training requires not only computing power but also data, algorithm optimization, hardware support, and inter-agency coordination.

This trend reveals the extreme potential for expanding AI capabilities, while reminding policymakers and the public that while national-level projects have potential, their feasibility and risks must be carefully assessed.

Trend 10: The value of AI comes primarily from widespread automation, rather than from accelerating scientific research.

Many narratives about the explosive growth of AI, such as those put forward by Sam Altman , Demis Hassabis , and Dario Amodei, argue that automating research and development is a key lever driving the rapid advancement of AI. This implies that AI could quickly and noticeably impact specific areas, such as automating the final stages of research, thereby leading to rapid breakthroughs within AI companies.

However, it is more likely that AI's impact on society will unfold in a dispersed and gradual manner: as different organizations adopt AI to improve efficiency, its effects will gradually emerge over years or even decades. This suggests that policymakers and business decision-makers should focus on the widespread application and efficiency improvements of AI across industries, rather than simply hoping for short-term scientific miracles.

Overall, AI capabilities are still accelerating, with computing power, algorithms, data, and reinforcement learning continuously driving model progress; costs are continuing to decline, providing opportunities for open source and small and medium-sized teams to catch up; however, energy consumption, computing power bottlenecks, evaluation differences, and capability ceilings remain realities that the industry must face.

The future development of AI will exhibit two characteristics : on the one hand, capabilities and efficiency will continue to improve, and cutting-edge laboratories will constantly push the limits; on the other hand, accelerated iteration, market expectations, and uncertainties in policies and regulations will create a highly dynamic competitive environment for the entire industry.

As Epoch AI demonstrates, the AI industry is constantly rewriting its story between fervor and rationality: from "larger models" to "better algorithms," from "closed-source monopolies" to "open-source frenzy," and from a "computing arms race" to an "efficiency revolution." Only through data and analysis can the public remain clear-headed amidst the deluge of information and understand the true pace and potential impact of AI development.

This article is from Tencent Technology , translated by Wuji, edited by Boyang, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments