[Introduction] DreamPRM was developed by a research team at the University of California, San Diego and won first place on the authoritative mathematical reasoning evaluation list MMMU.
In recent years, large language models (LLMs) have made significant progress in reasoning capabilities. The introduction of the Process Reward Model (PRM) enables the model to obtain supervision at the intermediate steps of the reasoning chain, thereby more robustly selecting reasonable problem-solving paths.
These methods have achieved good results in text reasoning tasks, but they still face two prominent challenges when expanding to multimodal scenarios :
- Distribution shift : The multimodal input space is huge, and the training and inference distributions often differ significantly.
- Uneven data quality : Large-scale training sets inevitably contain noisy or low-quality samples, which reduces the effective supervision signal.
Therefore, how to effectively utilize high-quality samples in multimodal reasoning and suppress the negative impact of noisy samples has become an urgent problem to be solved.
To address this, the researchers designed a new training framework that uses a two-layer optimization framework to use the weights of data samples (Instance Weights) as learnable parameters to dynamically change the impact of data samples in training.
Paper address: https://arxiv.org/abs/2509.05542
Code address: https://github.com/coder-qicao/DreamPRM-1.5
MMMU Leaderboard
The first author of the paper is doctoral student Qi Cao, and the corresponding author is associate professor Pengtao Xie of the school.
From DreamPRM to DreamPRM-1.5, from "domain weighting" to "sample weighting"
Previously, researchers proposed the DreamPRM framework, which distributes weights between different data subsets through domain reweighting to improve training results.
On this basis, DreamPRM-1.5 further refines the weighting granularity to a single training sample :
- High-quality samples receive greater weight;
- Low-quality or noisy samples are weighted down.
This instance-level reweighting strategy enables the model to fully explore the potential value of each piece of data.
Two methods: Instance Table and Instance Net
Two model architectures of DreamPRM1.5
To achieve “sample-level weighting”, researchers designed two complementary schemes:
Instance Table
Give each training sample an independent weight parameter;
High flexibility, especially suitable for small-scale data sets;
The disadvantage is that the number of parameters is linked to the number of samples, and it is difficult to support it when the data is large.
Instance Net
Instead of storing the data in a table directly, a small MLP network is used to predict the weight of each data item.
The number of parameters is fixed and not limited by the data size;
More suitable for large-scale training and stronger generalization ability.
This is like two ways of taking study notes: Instance Table is like writing a comment for each question; Instance Net is like summarizing a set of rules for "grading questions based on their answers."
Core of the method: Bi-level Optimization
The training process of DreamPRM-1.5 adopts a two-layer optimization framework :
Lower layer optimization: Update PRM using sample weights:
Upper-level optimization: Evaluate inference performance on the metadata dataset and dynamically update sample weights based on feedback:
This design ensures that the learning of weights is not a static setting, but is driven by inference effects and dynamically adjusted, thereby enhancing the adaptability of the model in complex tasks.
Generative reward model, scoring mechanism for the reasoning process
In DreamPRM-1.5, researchers used a generative reward model to score each step in the reasoning process. Its core idea is:
- Scoring method : The model outputs "+" or "-" at each step, indicating whether the reasoning at that step is reasonable;
- Scoring mechanism : Calculate the probability of "+" through softmax and use it as the confidence of this step;
- Aggregation strategy : Aggregate (average) the step scores of the entire reasoning chain and compare them with the standard answer to guide the update of sample weights.
The advantage of this design is that it not only evaluates the rationality of the reasoning chain step by step, but also provides finer-grained signals for instance reweighting .
Experimental design and implementation details
Model base : InternVL3-1B is used as the basic model of PRM, and tested based on GPT-5-mini in the inference stage.
Training data : Sample data of different sizes (12k, 100k) from VisualPRM-400k to train Instance Table and Instance Net respectively
Meta data set: Using the standard split of MMMU-Pro (only using test set data to avoid overlap with the validation set), generate candidate inference chains as meta set for weight update.
Training process :
Cold start: First, perform a supervised fine-tuning (20k samples) to enable the model to stably output "+/-" labels;
Two-layer optimization: 100k steps of iteration are performed on this basis, using the AdamW optimizer and cosine learning rate scheduling.
Computing resources : Single NVIDIA A100 card, training completed in about 72 hours
Experimental results on the MMMU benchmark
The researchers systematically evaluated their method on the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark.
The benchmark covers 30 disciplines and 183 subfields, and the question types include multimodal inputs such as charts, maps, and chemical structures. It is one of the most challenging reasoning tests currently available.
Key results
GPT-5-mini w/ thinking (baseline): 80.0%
DreamPRM-1.5 (Instance Table): 84.6% (+4.6)
DreamPRM-1.5 (Instance Net): 83.6% (+3.6)
Comparative Analysis
No Selection : Using the same data without reweighting, only 79.1% of the results were obtained, which verifies the importance of instance weighting.
VisualPRM : Despite using the full 400k dataset, it only achieved 80.5%, indicating that data size cannot fully compensate for the quality difference;
Self-consistency : The classic test-time scaling method is 81.4%, which is still lower than DreamPRM-1.5.
Overall, DreamPRM-1.5 not only significantly surpasses multiple strong baselines based on GPT-5-mini, but also exceeds top closed-source models such as GPT-5 (84.2%) and Gemini 2.5 Pro Deep-Think (84.0%) in accuracy.
Conclusion and Outlook
DreamPRM-1.5 introduces instance-level reweighting into multimodal reasoning training, dynamically adjusting sample weights through two-layer optimization, enabling the model to better recognize and utilize high-quality data.
The main contributions are:
- Propose an instance-level reweighting framework , breaking through the limitation of weighting only at the domain level;
- Designed two complementary implementations: Instance Table and Instance Net , catering to both small-scale and large-scale training scenarios.
- Achieved new SOTA results on the MMMU benchmark , surpassing multiple closed-source large models.
This result suggests that the refined utilization of data quality is also an important aspect worthy of attention in future inference model research.
Smarter sample weighting and process scoring methods are expected to become key directions for promoting the further development of multimodal reasoning.
References:
https://arxiv.org/abs/2505.20241v2
This article is from the WeChat public account "Xinzhiyuan" , edited by LRST, and published by 36Kr with authorization.