Nvidia's 4B miniature model beats the GPT-5 Pro, costing only 1/36th the price.

12-08

This article is machine translated

Show original

Nvidia's small-scale models continue to win.

The latest results for ARC-AGI 2 show that the 4B small model NVARC topped the list with a score of 27.64% on the open charts, surpassing GPT-5 Pro's 18.3%.

Moreover, the cost per task is only 20 cents, which is about 1/36th of the cost of a single task for the GPT-5 Pro (which is over $7).

According to official analysis, the highlight of NVARC's victory lies in its zero-pre-trained deep learning method , which does not rely on large-scale general datasets for pre-training, thus avoiding problems such as domain bias and data dependence of pre-trained models.

ARC-AGI 2 is indeed a more challenging test that eliminates overlap with public training data, mainly to see if the test model can efficiently acquire new skills beyond its training data.

After the results were released, the official team interviewed Jean-Francois Puget and Ivan Sorokin from the NVARC team for a technical analysis.

Come and see how the "King of Cost-Effectiveness" was "made"?

Material stacking without relying on parameters

Nvidia's strategy is to move complex inference to an offline synthetic data pipeline and train smaller models that can run quickly at evaluation time.

Simply put, it involves synthesizing high-quality data on a large scale , optimizing existing models, and moving expensive computational work offline .

Because Kaggle competitions have very strict limitations on computing resources, the team realized that they could not directly use large LMMs that require supercomputing power to perform complex, step-by-step inference and code generation.

Therefore, they changed their approach and decided to move the most expensive computational work offline. For example, they used the GPT-OSS-120B to create high-quality synthesis puzzles on a large scale.

The team collected existing ARC puzzle data from the H-ARC and BARC datasets, and then mixed simple puzzles together to generate more complex new puzzles.

To ensure data quality, they broke down the complex inference pipeline into different stages, each of which can be verified independently.

In this way, they built a synthetic dataset containing 3.2 million+ augmented samples, with each sample having up to 7 input/output pairs.

I can't help but mention here that Hassabis just emphasized the importance of the Scaling Law, so why doesn't the scaling of synthetic data count (doge)?

Getting back to the main point, NVARC's core reasoning module is based on an improved version of the ARCHitects method, using the small parameter model Qwen3-4B , and simplifies puzzle understanding through conversational templates.

During training, supervised fine-tuning is performed using the NeMo RL framework and the Megatron backend.

However, a key step in enabling the model to achieve excellent results is the test-time fine-tuning (TTFT).

In response to the characteristic of ARC-AGI-2 that "each task is a completely new rule", NVARC introduced the LoRA fine-tuning technology , and fine-tuned it for each problem, allowing the model to adapt quickly before solving the problem.

The improvement to the ARCHitects method lies in the batch processing optimization of the DFS algorithm during the decoding stage, which fixes the problem of nondeterministic results.

At the same time, eight candidate solutions for evaluating data augmentation operations were unified, and the final score was 27.64% on the public leaderboard.

Later in the competition, the team also applied the "less is more" TRM method and tried to integrate with Qwen3-4B to supplement the score. Although there was some improvement, it was not significantly optimized due to various limitations.

So here's the question: some might say that such a small model trained in this way is just a problem-solving machine, and how can it compare to a super-large model that is fully utilized?

But what deserves more attention is not the model itself, but the method of achieving the breakthrough.

In specific domain tasks, small models, after targeted optimization, are not inferior in performance. In addition, with advantages in cost, speed, adaptability and domain focus, they have already begun to stand out in many scenarios.

Using the right methods in the right places will yield greater value.

To borrow this netizen's words, the model should perhaps be designed to be more "agile".

Paper link: https://drive.google.com/file/d/1vkEluaaJTzaZiJL69TkZovJUkPSDH5Xc/view

Reference link:

[1]https://developer.nvidia.com/blog/nvidia-kaggle-grandmasters-win-artificial-general-intelligence-competition/

[2]https://arcprize.org/blog/arc-prize-2025-results-analysis

[3]https://www.kaggle.com/competitions/arc-prize-2025/writeups/nvarc

This article is from the WeChat public account "Quantum Bit" , author: Wen Le, and published with authorization from 36Kr.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

All-in station

Ho Chi Minh City launches a $1 billion Digital Asset Fund, aiming to become a "financial hub" for investors.

ODAILY

The day CZ missed his best investment, Crypto missed out on AI.

CAI

BlockTempo

Arthur Hayes challenges Multicoin founder: He bets 100,000 magnesium that HYPE will outperform all Altcoin within six months.

HYPE

1.78%