Game AI is here! NVIDIA's new model learns all games by watching live streams, and GPT-5.2 beats Zelda in seconds.

This article is machine translated
Show original

[Introduction] Nvidia has enabled AI to learn general game controls simply by "watching live streams." The virtual world has become a hacker empire for physical intelligence; watching 40,000 hours of live streams allows it to learn almost any game!

As we all know, the reason why Tesla's FSD is regarded as a masterpiece is its core "end-to-end" hardcore logic.

The car no longer relies on rigid, high-precision maps or sensors, but instead acts like a seasoned driver:

Eyes on the road (visual input), feet on the accelerator, hands on the steering wheel (action output).

So here's the question: what would happen if we applied this logic to a game scenario and let AI learn it?

The principle is exactly the same! In the past, when AI played games, it had to rely on reading background data or even "cheating" to find out where the enemy was.

But what are real human players like?

We stare at the pixels on the screen (visual perception) , our brains work, and our fingers tap away on the keyboard or press the controller (operation) .

For example, Faker's screen switching is among the fastest human reaction speeds.

From the screen directly to the mouse and keyboard controls, this is the "FSD" of the gaming world.

Nvidia recently pulled off something really ruthless!

They released a new model called NitroGen , which completely defies expectations.

  • Project address: https://nitrogen.minedojo.org/assets/documents/nitrogen.pdf

This model didn't grow up reading game code; it grew up squatting on YouTube and Twitch:

I watched 40,000 hours of gameplay with controller input!

It's like an extremely studious "cloud gamer" who learns how to move and use basic attacks in various games by observing how humans operate.

It can handle both RPGs and side-scrolling platformers.

You might ask: How can I learn how to operate it just by watching videos? I don't know which button the streamer pressed.

This makes one admire the creativity of Nvidia researchers.

They specifically dug up videos on YouTube and Twitch that featured "controller overlays" .

Yes, it's the kind of video where the streamer puts a small controller in the corner of the screen, and when they press a button, the controller on the screen lights up.

NitroGen stared at the 40,000 hours of video footage, watching what happened in the game (such as Link swinging his sword) and which button on the controller in the corner lit up (such as pressing the X button).

It's like someone who wants to learn guitar watching tens of thousands of concert videos, without looking at sheet music, and somehow managing to match "hearing" with "finger movements"!

Only AI can do this job.

Reject specialization and become a versatile all-rounder.

In the past, game AI was often "specialist"; someone who could play "Honor of Kings" would definitely not be able to play "Super Mario".

But NitroGen's main selling point is that it's a "generalist".

It has learned from over 1,000 different games.

This could mean it has developed a kind of "game intuition" !

Just like when we humans play games, if you've ever played a mind game like Elden Ring, and then you try a new action game like Black Myth: Wukong, even if you've never seen it before, you'll probably know that the left stick is for running and the right button is for attacking.

Test data shows that when NitroGen is put into a new game it has never seen before , it outperforms models trained from scratch by 52% .

Whether it's an action RPG, platformer, or roguelike, it's easy to pick up and play.

Next step: From Hyrule to the real world

Is Nvidia's move simply to create a stronger NPC to play with us?

Nvidia's ambitions have grown bigger than its strategic vision!

Let's take a look at the recent performance of AI in games.

The Decoder's latest research has found that AI is now even beginning to possess complex reasoning abilities.

Researchers conducted a unique "stress test" on the reasoning capabilities of top-tier large-scale models using a classic color-changing puzzle from The Legend of Zelda.

The test requires the model to plan six steps to solve the puzzle without an internet connection, based solely on screenshots.

The results clearly show the differences between the models:

  • GPT-5.2-Thinking demonstrated astonishing dominance, quickly and accurately dominating the entire field;
  • While Google's Gemini 3 Pro can solve the problem, it sometimes gets stuck in a lengthy trial-and-error loop, with the reasoning text reaching as long as 42 pages.
  • Claude Opus 4.5, however, faltered in its visual understanding and required the assistance of mathematical formulas.

The author believes that this powerful reasoning ability, combined with autonomous agent technologies such as NVIDIA NitroGen, foreshadows:

The era of humans writing game guides and software documentation is coming to an end; AI will completely change the way we obtain guidance information.

For example, in The Legend of Zelda, color-changing puzzles that require more than six steps of prediction can now be solved by AI models as easily as solving a math problem.

NitroGen goes even further, as it can not only be used for playing, but also for recording and reviewing gameplay .

Imagine a future where AI plays a game once and can effortlessly write out a "platinum trophy guide" for you, or even automatically fix game bugs. What more could you ask for?

(It seems highly likely that Game Science's "Black Myth: Zhong Kui" will incorporate AI technology.)

But Huang's true ambition is actually hidden in the code: NitroGen is built on NVIDIA's GR00T (robot basic model) .

This guy has huge ambitions!

  • In the game, it learns to: see a cliff -> know it will fall off -> control the controller to jump over it.
  • In reality, it corresponds to: seeing a puddle on the ground -> knowing you will slip -> controlling the robot's legs to step over it.

The virtual world is actually the most efficient "training ground" in the physical world.

Nvidia is using millions of trials and errors in games to create a "general brain" capable of handling all kinds of chaos for the robots that will enter our homes in the future.

Perhaps one day, when you marvel at your teammate's amazing skills, the person sitting on the other side of the screen might not actually be human.

It's actually a real robot holding a controller and playing a game with you!

Games are reality

Video games have evolved from being a simple benchmark for testing AI into a training ground for physical intelligence.

This is not only a victory for game AI, but also a key turning point for robotics technology to overcome the "Moravek Paradox".

A leap from "brain" to "body"

Over the past decade, the field of artificial intelligence has experienced a leap from perceptual intelligence to cognitive intelligence.

However, while large language models can write poetry, code, and even pass the bar exam, they often prove clumsy when faced with the physical world.

An AI that can pass the Turing test may not be able to control a robotic arm to complete the simplest task of "putting a cup into a dishwasher".

This is the famous "Morawieck's Paradox" : For computers, realizing higher-order intelligence such as logical reasoning requires very little computing power, while realizing lower-order intelligence such as perception and movement requires huge computing resources.

Embodied intelligence aims to solve this problem. It requires intelligent agents to not only "think" but also have a "body" and be able to physically interact with their environment.

For a long time, the development of embodied intelligence has been limited by two major bottlenecks:

  1. Data scarcity

The internet is filled with trillions of text data, but lacks an equivalent scale of robot data with precise action tags.

  1. Generalization difficulties

Traditional reinforcement learning (RL) algorithms typically perform well only in specific environments (such as a Go board or a specific factory assembly line), and the model fails once the environment changes even slightly.

Games as simulators of reality

In 2025, we saw a completely new path to overcome the aforementioned bottlenecks: using video games as a bridge to the physical world .

The game offers a rich visual environment, complex physical rules, and clear mission objectives, and is inherently digital and scalable. More importantly, the "perception-decision-action" closed loop in the game world is completely isomorphic to that of a physical robot.

For embodied intelligent agents to survive in the complex and unpredictable real world, conditioned reflexes alone are not enough.

It must possess deep reasoning and planning abilities.

The Zelda Color Ball Puzzle Challenge

This puzzle originates from the Legend of Zelda series of games; the rules seem simple, but they actually require a great deal of logical thinking.

  • Scene

A grid consisting of red and blue spheres.

  • rule

Clicking on a sphere will change the color of the sphere itself and the spheres above, below, left, and right (red to blue, blue to red).

  • Target

Turn all the spheres blue by clicking a series of buttons.

The essence of this puzzle is a constraint satisfaction problem or a graph theory problem.

Its complexity lies in the combinatorial explosion of the state space and the irreversibility of operations.

Players cannot focus solely on the gains of the current move; they must anticipate changes in the state over the next few moves.

This requires extremely strong forward-looking planning ability , that is, building a "decision tree" in your mind and deducing the results of different branches. This is exactly the "System 2" thinking defined in human cognitive psychology - slow, calm and logical thinking.

According to The Decoder's in-depth review:

The current top AI models have shown significant generational differences in their ability to meet this challenge, which directly reflects their potential as the "brain" of an embodied intelligent agent.

The success of GPT-5.2-Thinking lies not only in its solution to the puzzle, but also in its demonstration of a trend toward algorithmic internalization.

For example, when the robot faces a table piled with clutter, it can mentally rehearse, much like solving a Zelda puzzle: "If I take the book at the bottom first, the cup on top will tip over; so I must move the cup first."

This capability is key to the transition from "automated machines" to "autonomous intelligent agents".

If GPT-5.2 solved the problem of "what to think", then NVIDIA's NitroGen model solved the problem of "how to do it".

The release of NitroGen marks the beginning of the "ImageNet moment" in robotics learning, leveraging internet-scale data to train general motion control strategies.

The NitroGen team proposed an extremely ingenious "data mining" strategy: utilizing input overlays commonly found in game live streaming.

The brilliance of this strategy lies in its ability to instantly transform unsupervised video data into supervised visual-action pairs.

NVIDIA used this technology to build the NitroGen dataset, which contains 40,000 hours of data covering more than 1,000 games .

This is an unprecedented scale in the field of robot learning.

Simulation Layer: The World Model as the Robot's "Matrix"

In the movie The Matrix, Neo learns kung fu in a virtual world.

For robots, world models are their "matrix".

If robots can undergo thousands of trials and errors per second in an extremely realistic virtual world, their evolution speed will far exceed the limitations of physical time.

Based on the above analysis, the path to realizing a general intelligent agent through games is not only feasible, but has also begun to take shape.

This path can be summarized as: "Learn control in games, learn physics in simulations, and learn to adapt in reality."

Future general-purpose intelligent agents will inevitably have a layered architecture:

  • Top level (brain)

A reasoning model similar to GPT-5.2 is responsible for handling long-range planning, logic puzzles, and understanding human instructions.

  • Midlayer (cerebellum)

Similar to NitroGen's general strategy model, it is responsible for translating high-level instructions into specific motion trajectories, utilizing "motion intuition" obtained from massive amounts of video data.

  • The underlying layer (spinal cord)

The high-frequency whole-body controller based on GR00T is responsible for specific motor torque output and balance maintenance.

Despite the bright prospects, several key issues still need to be addressed:

  1. Lack of tactile feedback

Games and videos are primarily visual and auditory, lacking tactile feedback. NitroGen cannot learn "how heavy an object is" or "how slippery a surface is."

  1. High-precision operation

Current vision-motion models perform well on coarse movements (such as walking and grasping large objects), but they still fall short in operations requiring millimeter-level precision (such as threading a needle and precision assembly). This may require higher-resolution visual encoders or specialized fine-machining strategies.

  1. Safety and Ethics

When robots have autonomous planning capabilities, how can we ensure that their objective function aligns with human values? The "wash dishes" command should not cause the robot to "break the plates and empty the sink as quickly as possible."

Games are no longer just for entertainment; they are the cradle that humans have built for AI.

In this cradle, AI learned planning (Zelda), control (NitroGen), and the physical laws of the world (Cosmos).

When they leave their cradle and enter the body of Project GR00T, we will witness the birth of true physical intelligence.

This is not only a victory for technology, but also the ultimate manifestation of the various possibilities for humanity to give back to the real world by creating virtual worlds.

References:

https://the-decoder.com/a-zelda-puzzle-proves-ai-models-can-crack-gaming-riddles-that-require-thinking-six-moves-ahead/

https://the-decoder.com/nvidia-wants-to-create-universal-ai-agents-for-all-worlds-with-nitrogen/

This article is from the WeChat official account "New Intelligence" , edited by Ding Hui, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments