UCSD's new method surpasses GPT-5 and Gemini in multimodal reasoning

This article is machine translated
Show original

[Introduction] DreamPRM was developed by a research team at the University of California, San Diego and won first place on the authoritative mathematical reasoning evaluation list MMMU.

In recent years, large language models (LLMs) have made significant progress in reasoning capabilities. The introduction of the Process Reward Model (PRM) enables the model to obtain supervision at the intermediate steps of the reasoning chain, thereby more robustly selecting reasonable problem-solving paths.

These methods have achieved good results in text reasoning tasks, but they still face two prominent challenges when expanding to multimodal scenarios :

  • Distribution shift : The multimodal input space is huge, and the training and inference distributions often differ significantly.
  • Uneven data quality : Large-scale training sets inevitably contain noisy or low-quality samples, which reduces the effective supervision signal.

Therefore, how to effectively utilize high-quality samples in multimodal reasoning and suppress the negative impact of noisy samples has become an urgent problem to be solved.

To address this, the researchers designed a new training framework that uses a two-layer optimization framework to use the weights of data samples (Instance Weights) as learnable parameters to dynamically change the impact of data samples in training.

Paper address: https://arxiv.org/abs/2509.05542

Code address: https://github.com/coder-qicao/DreamPRM-1.5

MMMU Leaderboard

The first author of the paper is doctoral student Qi Cao, and the corresponding author is associate professor Pengtao Xie of the school.

From DreamPRM to DreamPRM-1.5, from "domain weighting" to "sample weighting"

Previously, researchers proposed the DreamPRM framework, which distributes weights between different data subsets through domain reweighting to improve training results.

On this basis, DreamPRM-1.5 further refines the weighting granularity to a single training sample :

  • High-quality samples receive greater weight;
  • Low-quality or noisy samples are weighted down.

This instance-level reweighting strategy enables the model to fully explore the potential value of each piece of data.

Two methods: Instance Table and Instance Net

Two model architectures of DreamPRM1.5

To achieve “sample-level weighting”, researchers designed two complementary schemes:

Instance Table

Give each training sample an independent weight parameter;

High flexibility, especially suitable for small-scale data sets;

The disadvantage is that the number of parameters is linked to the number of samples, and it is difficult to support it when the data is large.

Instance Net

Instead of storing the data in a table directly, a small MLP network is used to predict the weight of each data item.

The number of parameters is fixed and not limited by the data size;

More suitable for large-scale training and stronger generalization ability.

This is like two ways of taking study notes: Instance Table is like writing a comment for each question; Instance Net is like summarizing a set of rules for "grading questions based on their answers."

Core of the method: Bi-level Optimization

The training process of DreamPRM-1.5 adopts a two-layer optimization framework :

Lower layer optimization: Update PRM using sample weights:

Upper-level optimization: Evaluate inference performance on the metadata dataset and dynamically update sample weights based on feedback:

This design ensures that the learning of weights is not a static setting, but is driven by inference effects and dynamically adjusted, thereby enhancing the adaptability of the model in complex tasks.

Generative reward model, scoring mechanism for the reasoning process

In DreamPRM-1.5, researchers used a generative reward model to score each step in the reasoning process. Its core idea is:

  • Scoring method : The model outputs "+" or "-" at each step, indicating whether the reasoning at that step is reasonable;
  • Scoring mechanism : Calculate the probability of "+" through softmax and use it as the confidence of this step;
  • Aggregation strategy : Aggregate (average) the step scores of the entire reasoning chain and compare them with the standard answer to guide the update of sample weights.

The advantage of this design is that it not only evaluates the rationality of the reasoning chain step by step, but also provides finer-grained signals for instance reweighting .

Experimental design and implementation details

Model base : InternVL3-1B is used as the basic model of PRM, and tested based on GPT-5-mini in the inference stage.

Training data : Sample data of different sizes (12k, 100k) from VisualPRM-400k to train Instance Table and Instance Net respectively

Meta data set: Using the standard split of MMMU-Pro (only using test set data to avoid overlap with the validation set), generate candidate inference chains as meta set for weight update.

Training process :

Cold start: First, perform a supervised fine-tuning (20k samples) to enable the model to stably output "+/-" labels;

Two-layer optimization: 100k steps of iteration are performed on this basis, using the AdamW optimizer and cosine learning rate scheduling.

Computing resources : Single NVIDIA A100 card, training completed in about 72 hours

Experimental results on the MMMU benchmark

The researchers systematically evaluated their method on the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark.

The benchmark covers 30 disciplines and 183 subfields, and the question types include multimodal inputs such as charts, maps, and chemical structures. It is one of the most challenging reasoning tests currently available.

Key results

GPT-5-mini w/ thinking (baseline): 80.0%

DreamPRM-1.5 (Instance Table): 84.6% (+4.6)

DreamPRM-1.5 (Instance Net): 83.6% (+3.6)

Comparative Analysis

No Selection : Using the same data without reweighting, only 79.1% of the results were obtained, which verifies the importance of instance weighting.

VisualPRM : Despite using the full 400k dataset, it only achieved 80.5%, indicating that data size cannot fully compensate for the quality difference;

Self-consistency : The classic test-time scaling method is 81.4%, which is still lower than DreamPRM-1.5.

Overall, DreamPRM-1.5 not only significantly surpasses multiple strong baselines based on GPT-5-mini, but also exceeds top closed-source models such as GPT-5 (84.2%) and Gemini 2.5 Pro Deep-Think (84.0%) in accuracy.

Conclusion and Outlook

DreamPRM-1.5 introduces instance-level reweighting into multimodal reasoning training, dynamically adjusting sample weights through two-layer optimization, enabling the model to better recognize and utilize high-quality data.

The main contributions are:

  • Propose an instance-level reweighting framework , breaking through the limitation of weighting only at the domain level;
  • Designed two complementary implementations: Instance Table and Instance Net , catering to both small-scale and large-scale training scenarios.
  • Achieved new SOTA results on the MMMU benchmark , surpassing multiple closed-source large models.

This result suggests that the refined utilization of data quality is also an important aspect worthy of attention in future inference model research.

Smarter sample weighting and process scoring methods are expected to become key directions for promoting the further development of multimodal reasoning.

References:

https://arxiv.org/abs/2505.20241v2

This article is from the WeChat public account "Xinzhiyuan" , edited by LRST, and published by 36Kr with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments