ByteDance has open-sourced its first spatiotemporal reasoning video model, with a fully transparent thought process and performance surpassing GPT-4o.

This article is machine translated
Show original

AI can now highlight key points in videos!

It can not only answer "what" and "what happened", but also indicate " when and where" it happened.

A joint team from Peking University and ByteDance has launched the first open-source model that embeds explicit spatiotemporal evidence into the entire video reasoning process— Open-o3 Video . This allows AI not only to answer questions correctly but also to simultaneously and intuitively mark specific locations during the thought process, truly achieving traceable video reasoning.

Meanwhile, the model adopts a non-agent architecture, avoiding complex tool calls and multi-round inference, and directly completing the closed loop of "see-think-prove-answer" in a single response.

In multiple video inference tests, key metrics can be improved to 24.2%, outperforming closed-source models such as GPT-4o and Gemini-2-Flash .

More details are below.

Research Background

Video understanding is one of the most complex tasks in multimodal large model (MLLM).

Unlike static images, videos simultaneously convey dynamic changes in the time dimension and scene interactions in the spatial dimension.

This means that the model must not only identify the objects and actions in the image (What), but also determine when they appear (When) and where they occur (Where).

Recently, models such as Video-R1 and VideoRFT have significantly improved the logical consistency of video understanding through reinforcement learning. However, their thought processes are still purely textual. The model may answer questions correctly, but it cannot point out the specific images that support the answers.

This "black box reasoning" makes the model's judgments both difficult to explain and difficult to verify.

In addition, OpenAI’s o3 model first proposed the concept of “Thinking with Images”, which allows the model to naturally reference visual cues in the reasoning chain by embedding images in the reasoning (such as selecting areas, zooming in on local areas, and zooming in to view), thereby achieving “evidence-based reasoning”.

However, extending this concept to the video domain, namely, enabling models to provide evidence in both time and space during inference, is even more challenging:

1. In reasoning, it is difficult to maintain consistency between text, timestamps, and object bounding boxes.

The model needs to be precisely aligned with the time points of events across tens or hundreds of frames. Any drift will lead to errors in the inference logic, making training difficult.

Furthermore, the position of the same object changes drastically in different frames, requiring continuous tracking of its spatial position in the temporal dynamics.

2. Spatiotemporal coupling supervision is severely lacking.

Existing data either only provides temporal grounding or only single-frame spatial boxes, lacking a unified spatiotemporal annotation and corresponding thought process.

Model training process

Make up for the data gap

Therefore, the most fundamental bottleneck in using video reasoning based on spatiotemporal positioning clues lies in the data.

Existing video understanding datasets often only have annotations in the time or space dimensions, lacking spatiotemporally coupled thought chain data, resulting in a disconnect between modalities.

Therefore, the team constructed the first unified corpus system for explicit spatio-temporal grounded reasoning— STGR (Spatio-Temporal Grounded Reasoning), which includes two parts: STGR-CoT-30k and STGR-RL-36k .

The former is used for supervised fine-tuning (SFT) to help the model learn inference formats and output structures with spatiotemporal annotations; the latter is used in the reinforcement learning (RL) stage to provide high-quality reward signals to continuously optimize the model's spatiotemporal alignment and evidence generation capabilities.

Both datasets contain four types of tasks: temporal localization; spatial localization; spatiotemporal localization data; and video question-answering data, with the data distribution as shown.

Among them, the 5.9k high-quality spatio-temporal data were labeled by the team according to the data pipeline shown in the figure. The specific process is as follows:

1. Initial annotations were performed using Gemini 2.5 Pro for two data sources (temporal grounding and plm-rdcap), generating question-answer pairs, initial keyframes, object detection boxes, and the inference process; the displayed spatiotemporal localization format is as follows:

"<obj>object_name</obj><box>[x min, y min, x max, y max]</box>at<t>timestamp</t>s"

2. Due to the limited quality of the bounding boxes labeled on the large model, the team used two methods for filtering:

Remove invalid frames that cover too large an area (more than 80% of the screen);

Verify whether the target category matches using Qwen2.5-VL-7B , for example, by using the query "Is this a dog?" to confirm the content of the detection box.

3. Consistency check: Rewrite the inference chain to ensure that the question-answer, timestamp, object name, border and inference chain correspond one-to-one, and delete redundant or inconsistent samples.

Two-stage training method

After laying the foundation with high-quality spatiotemporal corpora, the key question becomes how to enable the model to truly learn to "think in the video".

The team found that supervised fine-tuning alone was insufficient to achieve satisfactory results. This is because, during the supervised phase, the model primarily mimics the language patterns of human annotators rather than truly understanding the logical relationship between visual cues and reasoning structures.

Therefore, in order for the model to actively discover and cite key evidence, a self-correcting reinforcement learning mechanism must be used, so that the reward signal directly constrains "which frame to look at, which area to pay attention to, and what to think".

This concept forms the core of Open-o3 Video training: a two-stage learning mechanism— cold-start pre-training and GSPO-based reinforcement learning .

During the cold start phase, the model is first fine-tuned under supervision using STGR-CoT-30k data.

The goal of this stage is to enable the model to master the reasoning format and output specifications, that is, how to generate structured tags such as <input type="text">, <output ... and <output type="text"> in the answer, and learn to match the reasoning chain with the video content.

This stage is equivalent to "teaching the model to speak" : it learns how to describe visual evidence in language, but has not yet formed a spontaneous evidence selection strategy.

In other words, the cold start phase gives the model the ability to generate traceable answers, and the next phase is to make this ability accurate, stable, and generalizable.

In the second phase, the team introduced the reinforcement learning framework GSPO .

Compared to the widely used GRPO, GSPO is optimized based on sequences, which is more conducive to the stability of long-term training and avoids the collapse of the thought chain.

In this stage, the model is required to generate complete spatiotemporal inference sequences in open video scenes, and then self-correct using a reward function. The reward function consists of three parts:

r_acc measures the correctness of the answer; r_thk reflects the rationality and completeness of the reasoning chain, encouraging the model to make full use of visual evidence when generating thought text, such as calculating temporal IoU and spatial IoU; r_fmt evaluates whether the reasoning format conforms to the specification.

The team emphasized that a single accuracy reward cannot support multimodal interpretable reasoning, because the model may "guess" the answer but ignore key details; only when the reasoning process itself is incorporated into the optimization objective will the model truly learn how to think in the visual world.

However, it is very challenging to optimize localization capabilities in both temporal and spatial dimensions using reinforcement learning, especially since spatial reward (IoU) depends on the accuracy of temporal prediction.

Specifically, if the time prediction is wrong, even if the spatial box position is correct, it cannot correspond to the ground truth. In other words, time prediction is a prerequisite for training stability.

However, if strict time constraints are used directly in temporal reward prediction, the model often does not receive a reward in the early stages of training, leading to learning stagnation. If loose constraints are always used, the model can receive a reward, but the temporal reward is prone to saturation, and the prediction cannot gradually converge to the precise position, so the calculation of spatial reward is still inaccurate.

Therefore, the team proposed an adaptive temporal proximity mechanism , which gradually adjusts the tolerance range of time rewards during training. The specific formula is as follows:

As training progresses, the standard deviation is gradually reduced from a large value to achieve convergence from "coarse localization" to "fine localization".

Meanwhile, our team proposed a time gating mechanism , which checks whether the predicted timestamp falls near the real timestamp before calculating the spatial reward. Only when the time prediction is close to the true value (less than the set threshold) will the IoU between the predicted box and the true box on the corresponding frame be calculated; otherwise, the spatial reward is 0.

Through this training method and reward design, the model can be trained in a more stable and efficient way.

Reasoning Enhancement

The spatiotemporal evidence proposed by the team can serve as a verifiable signal and be applied to test-time extensions.

Specifically, during the inference phase, the model generates multiple independent inference chains, each containing spatiotemporal evidence.

Extract the corresponding keyframe region from the inference chain and input it into the model again to score its relevance to the question (0, 1, and 2 points, respectively, indicating that it is not relevant to the question, may be helpful in answering the question, and is very helpful in answering the question).

Each answer is weighted according to its score, and the answer with the highest confidence level is output.

This mechanism effectively prevents voting from being misled by low-quality thought chains, improving the accuracy and robustness of reasoning.

Experimental results

Open-o3 Video achieves significant performance on multiple video inference and understanding benchmarks.

First, the team tested the model on V-STAR, a benchmark for spatiotemporal inference that comprehensively examines the model's performance across three dimensions: "what," "when," and "where."

As can be seen, Open-o3 Video has achieved significant improvements in both Temporal IoU (temporal alignment) and Visual IoU (spatial alignment), with an overall improvement of +14.4% in mAM and +24.2% in mLGM, surpassing large closed-source models such as GPT-4o and Gemini-2-Flash , fully demonstrating its significant advantages in spatiotemporal joint localization and inference consistency!

Furthermore, in four benchmark tests—VideoMME, WorldSense, VideoMMMU, and TVGBench—Open-o3 Video consistently outperformed baseline models and numerous video inference models.

It achieved a significant improvement of 4.1% in the VideoMME-Long subtask, reaching 54.9%. In the WorldSense and VideoMMMU partial perception tasks, it showed an improvement of more than 3% compared to the baseline model. In TVGBench, it achieved an mIoU of 20.8, also an improvement of 4.5%.

These results demonstrate that Open-o3 Video not only excels in spatiotemporal tasks requiring complex reasoning, but also exhibits strong generalization capabilities in traditional video recognition and temporal localization tasks.

More importantly, thanks to its explicit chain of evidence design, the model-generated answers are verifiable, providing higher interpretability and reliability at the same accuracy.

To further verify the impact of different training stages, data composition, and reward mechanisms on model performance, the team conducted a systematic ablation study .

The experimental results are shown in the table, which comprehensively evaluates the contributions of factors such as training strategy, reward design, data type and data scale to spatiotemporal inference performance.

As can be seen from Table 3, the two-stage training mechanism (SFT + RL) is crucial to improving model performance.

With only supervised learning (Pure SFT), the model has been able to learn inference formats with spatiotemporal labels, but its overall performance is still limited by the imitation of fixed labels.

While pure reinforcement learning (GSPO) can improve temporal and spatial consistency, its performance improvement is limited without training on CoT data.

When the two are combined, the model improves to 33.7% and 46.6% on mAM and mLGM, respectively.

This indicates that the structured supervision during the cold start phase provides the necessary inference template, while the GSPO-based reinforcement phase further optimizes the model's spatiotemporal alignment and evidence orientation, thereby achieving stable and interpretable inference capabilities.

Table 4 illustrates the roles of two key reward mechanisms: Adaptive Temporal Proximity and Temporal Gating.

If the adaptive nearest neighbor mechanism (w/o Ada.) is removed, the model's mLGM decreases by 1.4%; if gating is not used (w/o Gat.), the performance decreases by 1.7%.

This confirms the team's original design intention: the proximity mechanism can alleviate the problem of sparse rewards in the early stage of training, while the gating strategy can prevent the model from misjudging irrelevant objects at the wrong time frame .

The combination of the two effectively ensures the density and accuracy of the reward signal, enabling the model to gradually converge to a true spatiotemporally consistent reasoning mode.

Table 5 further validates the importance of spatiotemporal annotation data.

With the removal of spatio-temporal labeled samples (w/o spatio-temporal data), the model performance dropped significantly to mAM 28.3/mLGM 36.2; although there was a slight improvement after introducing the existing VideoEspresso data, it was still not as good as the high consistency corpus built by the team.

When using the complete STGR-annotated data, mLGM reached 46.6, indicating that the model indeed learned robust localization and inference capabilities from unified spatiotemporal supervision. This also indirectly verifies the value of STGR data in terms of consistency across language, space, and time.

Table 6 explores the impact of general video question-answering data volume on the overall performance of the model.

Experiments show that a moderate amount of general QA samples can effectively balance the model's language generation and evidence localization capabilities. When an additional 15k general VideoQA samples are added, the model achieves the optimal balance.

If the data scale is further expanded, the performance actually decreases slightly, indicating that too many general samples will dilute the supervision signal of spatiotemporal annotation.

Therefore, the team ultimately adopted a mixed data configuration of 15k data points to achieve the optimal trade-off between interpretable reasoning and general question answering.

In summary, the ablation experiments fully validated the three core design principles of Open-o3 Video: unified spatiotemporal data, a two-stage training mechanism, and an adaptive reward strategy, demonstrating their significant contributions to improving model interpretability and reliability.

These designs enable the model to stably generate traceable reasoning chains in complex video scenarios, achieving truly evidence-based multimodal reasoning .

As shown in Table 7, on both the WorldSense and VideoMMMU benchmarks, the confidence-based test-time extension strategy delivers a steady improvement, outperforming both the single-reasoning (Base) and simple majority voting schemes.

This indicates that explicit spatiotemporal evidence can not only provide supervision signals during the training phase, but also serve as a reliable confidence metric during the inference phase, helping the model make more robust judgments among diverse thought processes.

However, by generating multiple responses in parallel, the team also observed that the current model generates relatively few high-quality inference trajectories in actual operation when faced with relatively difficult problems.

This means that the model's spatiotemporal evidence extraction still needs further improvement, especially in longer videos and more complex and varied scenarios. This is also an important direction that the open-source community should explore in depth in the future.

Visualization results

Open-o3 Video can provide temporal and spatial evidence (timestamps and bounding boxes) during inference to support its reasoning process and final answer, as illustrated in the following visualization examples:

These examples demonstrate Open-o3 Video's outstanding performance in object appearance recognition, motion intent analysis, and weather inference.

The model performs just as well as other reasoning models and can provide some evidence to support the claims, making the responses more intuitive, reliable, and easy to verify.

Let's take a look at the demo.

The team believes that Open-o3 Video will drive video multimodal models from "being able to answer correctly" to "being able to locate and explain," enabling machines to truly have the ability to perform traceable reasoning in the spatiotemporal dimension.

In the future, the team will continue to improve the spatiotemporal reasoning data and post-training mechanism to provide strong spatiotemporal evidence support for question answering in longer videos and more complex scenarios.

In addition, all of the team's papers, code, and models are open source, and everyone is welcome to exchange ideas and discuss them!

Paper link: https://huggingface.co/papers/2510.20579

Code link: https://github.com/marinero4972/Open-o3-Video

Model link: https://huggingface.co/marinero4972/Open-o3-Video

This article is from the WeChat public account "Quantum Bit" , authored by the Open-o3 Video team, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments