Multimodal version of DeepSeek-R1: The evaluation performance exceeds GPT-4o, and the modality penetration feeds back text reasoning ability, produced by Peking University and Hong Kong University of Science and Technology, and has been open sourced

02-06

This article is machine translated

Show original

Here is the English translation of the text, with the specified terms preserved:

What if the deep reasoning performance of DeepSeek-R1 that shocked Silicon Valley was applied to multi-modal scenarios?

Previously, DeepSeek's own Janus-Pro-7B did not combine reasoning ability, but now, a research team in China has achieved this -

Based on their self-developed multi-modal framework Align-Anything, the Peking University and HKUST team released a multi-modal version of DeepSeek-R1:

Align-DS-V, which outperforms GPT-4o on some visual understanding benchmark datasets.

When asked about the most suitable drink for weight loss while combining images and text, Align-DS-V precisely identifies the number of drinks in the image, the drink names, and that the "low-sugar original soy milk" is the most suitable drink for weight loss.

Moreover, it also points out that the original soy milk in the image is also suitable for drinking during the weight loss period.

More importantly, in the process of "giving eyes" to DeepSeek-R1, the researchers also found that modal penetration has an enhancing effect on the model's text modal reasoning capability.

Specifically, the team found in their attempt to make DeepSeek-R1 multi-modal that after multi-modal training, the model not only improved its performance on text modal tasks, but also showed improvements in scientific tasks, complex reasoning, and mathematical coding.

Notably, on the ARC-Challenge (5-shot), the score was improved from 21.4 in the single-modal setting to 40.5 in the multi-modal setting.

Based on this, the team believes that current multi-modal large models have powerful cross-modal penetration and fusion perception capabilities, and can achieve efficient reasoning and collaborative output in multiple modalities (such as images, text, audio, and video) by combining world knowledge and contextual learning abilities.

Through deep fusion of world knowledge, the model's reasoning boundaries in the text modality have been expanded.

Align-Anything, the full-modal alignment with emerging modal penetration capability

The information received by humans in daily life is often multi-modal, and how to further extend "strong reasoning and slow thinking" from the single text modality to more modalities or even full-modal scenarios is undoubtedly a trend.

On this basis, how to align full-modal large models with human intentions is also an extremely forward-looking and crucial challenge.

In single-text modal scenarios, many complex reasoning tasks can be supervised by rule-based rewards as carriers of human intentions and preferences.

When extending from the text modality to multi-modal or even full-modal scenarios, many problems will arise:

As the number of modalities increases, can traditional binary preferences or rule-based rewards capture the multi-dimensional preferences or hierarchical preferences of human intentions?

When expanding to the full-modal space, with more complex modal interactions, what improvements do RL methods need to make?

How to unify modal-specific and modal-shared information in reward signal modeling under different modalities?

...

The distribution of input and output spaces becomes more extensive, and the phenomenon of hallucination is aggravated, making full-modal alignment more complex.

To further promote multi-modal alignment research, the research team proposed the Align-Anything framework, dedicated to aligning full-modal large models with human intentions and values.

Here, full-modal includes any input and output modalities such as text-to-text, text-to-image, text-to-image-to-text, and text-to-video.

Overall, the framework has designed a highly modular, extensible, and user-friendly alignment training framework that supports fine-tuning of any modality model derived from the four basic modalities of text, image, video, and audio, and has verified the correctness of the framework's alignment algorithms.

The framework has the following features:

High modularity: Abstraction of different algorithm types and carefully designed APIs, allowing users to modify and customize code for different tasks, as well as advanced extension uses such as customized model and dataset registration;

Support for fine-tuning of cross-modal models: Includes the ability to fine-tune large models spanning multiple modalities such as LLaMA3.2, LLaVA, Chameleon, Qwen2-VL, Qwen2-Audio, and Diffusion;

Support for different alignment methods: Supports various alignment algorithms on any modality, including classic algorithms such as SFT, DPO, and PPO, as well as new algorithms such as ORPO, SimPO, and KTO;

Support for open and closed-source alignment evaluation: Supports more than 30 multi-modal benchmark evaluations, including multi-modal understanding evaluations such as MMBench and VideoMME, as well as multi-modal generation evaluations such as FID and HPSv2.

In other words, the Align-Anything team has contributed open-source efforts in four dimensions: data sets, algorithms, evaluations, and code libraries:

Data: A 200k dataset containing human language feedback and binary preferences, covering full modalities including images, text, videos, and speech.

Algorithms: A synthetic data paradigm learned from language feedback, which significantly improves the performance of RLHF post-training methods.

Evaluation: Evaluation of modal interaction and modal selection for full-modal models.

Code library: A code framework supporting full-modal training and evaluation of images, text, videos, and speech.

At the same time, to promote further development of full-modal alignment models, the research team released the first full-modal human preference dataset Align-Anything.

Unlike the existing preference datasets that focus on a single modality and are of varying quality, Align-Anything provides high-quality data, including any modality in the input and output.

This is intended to provide detailed human preference annotations and fine-grained language feedback for criticism and improvement, enabling comprehensive evaluation and improvement across modalities.

Align-DS-V: DeepSeek-R1 with Multi-Modal Scenarios

Next, the team began to explore the performance of DeepSeek-R1 in multi-modal scenarios.

Drawing on the training approach of LLaVA, by training a projection layer (Projector), the Align-Anything team mapped the output of the vision encoder to the language representation space, thus expanding the visual modality of DeepSeek-R1.

In the Align-Anything library, the team open-sourced the entire training process.

First, based on the DeepSeek-R1 series models, a "text + image -> text" architecture is constructed. For example, the following script:

In the new multi-modal model, the input image Xv is passed through the vision encoder to extract features, generating the intermediate representation Zv, which is then mapped through the projection layer to obtain the visual representation Hv.

Meanwhile, the language instruction Xq is processed to generate the language representation Hq.

These visual and language features are then jointly input to the language model, which combines the two types of information for reasoning and ultimately generates the text response.

After constructing the DeepSeek-R1 architecture with modal expansion, the specific training is divided into two steps:

Step 1, freeze all model parameters except the Projector, and pre-train the Projector to map the visual representation generated by the vision encoder to the language representation space.

Step 2, fine-tune the Projector and the large language model simultaneously to activate the multi-modal reasoning capability of the language model.

After successful training, the researchers named the multimodal version of the DeepSeek-R1 series models as Align-DS-V.

The following is the performance of Align-DS-V on different visual understanding evaluation sets (compared to GPT-4o).

It can be seen that Align-DS-V outperforms GPT-4o on some evaluation sets (such as llava-bench-coco).

In addition, more importantly, the team also found the effect of modal penetration on the improvement of the model's text modal reasoning ability.

Specifically, the team found in their attempt to fully modalize DeepSeek-R1 that after multimodal training, the model's performance on text modal tasks has improved, with improvements in scientific tasks, complex reasoning, mathematical coding, and other aspects.

Particularly significant is that on ARC-Challenge (5-shot), the score was raised from 21.4 in the single-modal to 40.5 in the multimodal.

Therefore, the team believes that based on the "slow thinking strong push ability" of continuous self-evolution, the model's capabilities have broken through the limitations of a single modality, with significantly improved cross-modal penetration depth.

Through deep integration of world knowledge, the model's reasoning boundaries in the text modality have been expanded.

To verify the capability of the full-modal reasoning large model in vertical domain applications, the R&D team has localized Align-DS-V for value alignment in the Hong Kong region, enabling Align-DS-V to adapt to Cantonese/English/Mandarin mixed language input.

This process deeply integrates Hong Kong local life scenarios such as MTR dynamics, typhoon warnings, and Octopus payment.

When faced with image-text math problems containing traditional Chinese characters, Align-DS-V can accurately associate image and text modal information.

As shown in the figure, it gradually uses rigorous mathematical deduction to demonstrate the solution process, demonstrating its credible prospects for application in industries such as education.

Co-developed, open-sourced, and maintained by Peking University & HKUST

Align-Anything and Align-DS-V were co-developed by Peking University and the Hong Kong University of Science and Technology.

Currently, the Align-Anything framework and the multimodal version of DeepSeek-R1, Align-DS-V, have both been open-sourced, and the team will work together to maintain them in the long term (direct link at the end of the article).

The Peking University Alignment Team in the joint research team focuses on the safe interaction and value alignment of artificial intelligence systems.

The team's guiding professor is Assistant Professor Yang Yaodong of the Peking University Institute of Artificial Intelligence.

The Hong Kong Generative AI R&D Center (HKGAI) in the joint research team was established in October 2023, dedicated to promoting the development of Hong Kong's artificial intelligence ecosystem.

The center is led by Guo Yike, Chief Vice President of the Hong Kong University of Science and Technology.

Quantum Bit understands that based on Align-DS-V, the Peking University-Lingchu joint laboratory has already started deeper exploration in the field of VLA (Vision Language Action Model).

The VLA model being developed by Lingchu uses multimodal large models for alignment and fine-tuning on the brain-end, and outputs action Token to the cerebellum-end controller; then the cerebellum-end controller outputs specific robot control instructions based on the input Token and other modal information.

Both of these processes require the use of post-training and fine-tuning technologies for multimodal large models.

The Peking University-Lingchu joint laboratory stated that the multimodal strong reasoning capability of Align-DS-V is the core of the brain-end of the VLA model, and the next research and training plan is to use the cross-modal penetration capability of the multimodal reasoning model to achieve action penetration, and ultimately realize a truly efficient VLA model.

The same post-training technology can also be applied to the fine-tuning of the cerebellum-end controller to achieve higher success rate, generalization, and robustness.

Align-Anything framework open source address: https://github.com/PKU-Alignment/align-anything

Align-DS-V open source address: https://huggingface.co/PKU-Alignment/Align-DS-V

This article is from the WeChat public account "Quantum Bit", authored by the Align-DS-V team and authorized for publication by 36Kr.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content