Interesting.
DeepSeek just published a new paper about Scaling Law during inference, which led everyone to speculate whether R2 is coming soon.
However... Altman released a "change of plan" message:
Plan change: We might release o3 and o4-mini first in a few weeks.
As for the long-awaited GPT-5, Altman stated:
It will be in a few months, and the effect will be better than we initially envisioned.
Altman also explained the reason.
Essentially, smoothly integrating all content is much more difficult than they imagined, and they hope to ensure sufficient capabilities to support expected demands.
Let's just say that whenever DeepSeek makes a slight move, OpenAI has to take some action to keep up.
DeepSeek's New Paper
After this small episode, let's focus on DeepSeek's new paper.
The paper is titled Inference-Time Scaling for Generalist Reward Modeling, jointly proposed by DeepSeek and Tsinghua University.
The core highlight of this research is proposing a method called SPCT method (Self-Principled Critique Tuning)—
First proposing inference-time scaling through online reinforcement learning (RL) optimization principles and critique generation.
The reason for this research is that previously, reward models (RM) were used in RL to generate reward signals for large language models.
However, existing RMs show limited performance in general domains, especially when facing complex and diverse tasks.
Therefore, two key challenges emerged.
One is that universal RM needs flexibility (supporting single and multiple response scoring) and accuracy (high-quality rewards across domains).
The other is that existing RMs (such as scalar RM, semi-scalar RM) have poor inference-time scalability and cannot significantly improve performance by increasing computational resources.
To solve this problem, the DeepSeek and Tsinghua University team proposed SPCT.
Overall, this research mainly includes three core technical points.
First is the Generative Reward Model (GRM).
It adopts a pointwise GRM by generating text-form rewards (such as critiques) instead of a single scalar value, supporting flexible input (single/multiple responses) and inference-time scaling.
Where C is the generated critique, and fextract extracts the score from it.
Next is the key SPCT.
Mainly through online reinforcement learning (RL) training GRM to dynamically generate high-quality principles and critiques, thereby improving reward quality.
Overall, SPCT is a two-stage process:
- Rejective Fine-Tuning
: Cold start phase, generating initial data through sampling and rejection strategy.
- Rule-Based Online RL
: Using a regularized reward function to optimize principle and critique generation, encouraging the model to distinguish the best responses.
On this basis, the third technical point is inference-time scaling technology.
First, generating diverse principles and critiques through multiple samplings, aggregating rewards by voting to expand the reward space.
Then training an auxiliary model to filter low-quality samples, further improving the scaling effect.
Based on these methods, the team also tested the results.
On benchmarks like Reward Bench, PPE, and RMB, DeepSeek-GRM-27B significantly outperforms baseline methods (such as LLM-as-a-Judge, scalar RM), and performance further improves through inference-time scaling (32 samplings) (e.g., Reward Bench accuracy increased from 86.0% to 90.4%).
In summary, this research proves the effectiveness of inference-time scaling in universal RM, surpassing training-time scaling.
One More Thing
Besides releasing the "change of plan" message, Altman didn't forget to promote himself, mentioning two books he personally participated in:
- One is a book about Altman himself by Keach Hagey
- One is a book about OpenAI by Ashlee Vance
Paper Address:
https://arxiv.org/abs/2504.02495
Reference Links:
[1]https://x.com/sama/status/1908167621624856998
[2]https://techcrunch.com/2025/04/04/openai-says-itll-release-o3-after-all-delays-gpt-5/
[3]https://x.com/sama/status/1908163013192069460
This article is from the WeChat public account "Quantum Bit" (ID: QbitAI), author: Jin Lei, published by 36Kr with authorization.





