Video reasoning R1 moment, 7B model surpasses GPT-4o, CUHK Tsinghua University launches the first Video-R1

avatar
36kr
04-16
This article is machine translated
Show original

HKU and Tsinghua team launch Video-R1 model, first applying the R1 paradigm of reinforcement learning to video reasoning. Through the upgraded T-GRPO algorithm and hybrid image-video dataset, Video-R1 surpasses GPT-4o in video spatial reasoning tests, demonstrating powerful reasoning capabilities, with all code and datasets now open-sourced.

Just as language model reasoning was trending, video AI is now starting to "compete".

This time, the HKU + Tsinghua combination directly brought the R1 approach from reinforcement learning to the video domain, creating the world's first video version R1 model: Video-R1.

Despite having only 7B parameters, it surprisingly surpasses GPT-4o in the VSI-Bench benchmark proposed by Fei-Fei Li!

This is not a simple fine-tuning. It features a brand new time-aware algorithm T-GRPO, combined with image + video hybrid training and two high-quality datasets, effectively maximizing AI's video reasoning capabilities, enabling the model to not just "see" but also "think".

Moreover, the entire model, code, and datasets have been open-sourced!

The "reasoning moment" for video large models has begun.

[Rest of the translation continues in the same manner, maintaining the original structure and translating all text while preserving technical terms and names]

Moreover, the training dynamics reveal insights. As reinforcement learning progresses, the model's accuracy rewards and time rewards continue to rise, indicating that it is not only becoming better at answering questions but also understanding the concept of "time logic".

Interestingly, the model's responses became shorter during early training—actively discarding the suboptimal reasoning patterns learned in previous SFT; however, as training progressed, the output gradually recovered and stabilized, forming a more efficient and logical expression pathway.

In Conclusion

Video-R1 proves with its capabilities: Reinforcement learning is not just a privilege of NLP; video large models can also demonstrate reasoning power.

It doesn't rely on "stacking materials" but on mechanism design and training strategies, and the entire suite is open-source.

R1's reasoning paradigm is bringing the next AI revolution from the text world into every frame of the image.

The era of video reasoning has truly arrived.

References:

https://arxiv.org/abs/2503.21776

This article is from the WeChat public account "New Intelligence", authored by New Intelligence, published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments