Multimodal Large Language Models (MLLM) are rapidly rising, evolving from understanding a single modality to now being able to simultaneously understand and generate multiple modalities such as images, text, audio, and even video.
In the past, the evaluation of multimodal large models often involved accumulating scores across multiple tasks. However, simply measuring model strength by "higher scores on more tasks" is unreliable, and outstanding performance in certain tasks does not necessarily mean the model is closer to human intelligence across all domains.
Therefore, at this stage of the AI competition (a consensus view sparked by recent OpenAI researcher Yao Shunyu), designing a scientific assessment mechanism has become the core key to determining success.
A recently accepted paper at ICML'25 (Spotlight) titled "On Path to Multimodal Generalist: General-Level and General-Bench" proposed a brand new evaluation framework called General-Level and a corresponding dataset General-Bench, bringing a foundational answer and breakthrough to this topic.
This evaluation framework has been implemented in the community: The project team constructed a super-large-scale evaluation benchmark covering over 700 tasks, 5 common modalities, 29 domains, and over 320,000 test data, along with the most comprehensive Leaderboard for multimodal generalist models, providing infrastructure for fair, just, and comprehensive comparisons of different multimodal generalist large models.
General-Level Assessment Algorithm: Five-Tier Ranking System and Synergy Effect
The General-Level evaluation framework introduces a five-tier ranking system, similar to "rank promotion", to measure the generalist capabilities of multimodal models.
The core of General-Level assessment is the Synergy effect, which refers to the model's ability to transfer and enhance knowledge learned from one modality or task to another modality or task, simply put, achieving an effect of 1+1 > 2.
The model's ranks from low to high are: Level-1 Professional Expert, Level-2 Generalist Newcomer (no synergy), Level-3 Task Synergy, Level-4 Paradigm Synergy, Level-5 Full Modal Complete Synergy. The higher the rank, the stronger the "general intelligence" displayed by the model and the higher the synergy effect level achieved.
General-Level determines the model's rank by examining synergy effects at different levels:
Scope-A: Full Spectrum Heroes Leaderboard: "Comprehensive Multi-Modal Talent" Competition.
This is the most challenging and comprehensive main leaderboard: participating models must undergo a complete General-Bench test, covering all supported modalities and all categorical tasks.
Scope-A aims to select truly versatile multi-modal foundational models, examining their comprehensive capabilities in complex comprehensive scenarios.
Scope-B: Modal Unification Heroes Leaderboard: "Single Modal Talent" Competition.
Scope-B includes several sub-leaderboards, each targeting specific modalities or limited modal combinations.
Specifically, Scope-B divides into 7 parallel leaderboards: 4 single-modal leaderboards (such as pure vision, pure audio, pure video, pure 3D), and 3 modal combination leaderboards (like image+text, video+text cross-modal combinations).
Participating models only need to complete multi-task evaluation within the selected modal range, without involving data from other modalities.
Scope-C: Understanding/Generation Heroes Leaderboard: "Paradigm Capability" Group Competition.
Scope-C further subdivides evaluation into understanding tasks and generation tasks two major paradigms, with separate leaderboards in each modality. Specifically, in image, video, audio, and text modalities, each has an "understanding capability leaderboard" and a "generation capability leaderboard", totaling 8 leaderboards.
Scope-C evaluation emphasizes cross-task paradigm transfer within the same modality: for example, if a model performs excellently on the visual understanding leaderboard, it indicates the ability to share knowledge across various understanding tasks like visual classification and detection; a high score on the visual generation leaderboard means universal capabilities across generation tasks (description, drawing).
By limiting the task paradigm scope, Scope-C has lower resource requirements (three-star difficulty), making it very suitable for lightweight models or teams with limited resources.
Scope-D: Skill Expertise Leaderboard: "Detailed Skill" Arena.
This is the most granular leaderboard with the lowest participation threshold. Scope-D further clusters tasks in General-Bench by specific skills or task types, with each small category forming its own leaderboard.
For example: "Visual Question Answering (VQA) Leaderboard", "Image Captioning Leaderboard", "Speech Recognition Leaderboard", "3D Object Detection Leaderboard", etc., with each leaderboard covering a group of closely related tasks.
Participating models can submit results targeting only one skill category, thus comparing with other models in their most proficient narrow domain.
This skill leaderboard mechanism encourages models to develop progressively: first achieving excellence in a single skill point, then gradually challenging broader multi-task and multi-modal evaluations.
Leaderboard link is available at the end of the article.
[The rest of the translation continues in the same manner, maintaining the specified translation rules for specific terms.]On the contrary, some open-source models have flourished through multi-task training and entered the Level-2 category, such as SEED-LLaMA and Unified-IO. The capabilities of models at this level are mainly distributed across the image modality, with average single-modal scores ranging approximately between 10-20 points, indicating significant room for improvement.
The current champions and runners-up of Level-2 are Unified-io-2-XXL, AnyGPT, and NExT-GPT-V1.5.
Level-3 (Task Collaboration)
There are fewer multimodal large models at this level, but they have defeated specialized models in several tasks, demonstrating a performance leap brought by collaborative learning.
Many new models after 2024 have been promoted to this level, including open-source models like Sa2VA-26B, LLaVA-One-Vision-72B, and Qwen2-VL-72B series. These models typically have hundreds of billions of parameters and have undergone massive multimodal and multi-task training, thus surpassing traditional single-task SOTA performance on some Benchmarks.
This proves the value of synergy: unified multi-task training can help models learn more universal representations and mutually promote performance across related tasks.
In contrast, some closed-source large models like OpenAI's GPT4-o, GPT4-V, and Anthropic's Claude-3.5 do not perform as prominently at Level-3.
The overall average score range for Level-3 models continues to decrease compared to Level-2, indicating a more challenging scoring situation at this level.
Level-4 (Paradigm Collaboration)
Models reaching this level are currently rare.
According to the Leaderboard (as of the evaluation date in December 24), only a few models have been rated as Level-4, such as the massive Mini-Gemini, Vitron-V1, and Emu2-37B open-source prototype models.
These models have made breakthroughs in cross-paradigm reasoning, possessing excellent understanding and generation capabilities, and able to integrate both.
For example, the Mini-Gemini model leads in both image understanding and generation, ranking top in the paradigm collaboration score on the Leaderboard.
The emergence of Level-4 means we are a step closer to a truly cross-modal reasoning AI. However, the average score for Level-4 models is currently very low. This reveals the enormous challenge of building a comprehensively collaborative AI across modalities: achieving breakthroughs in both understanding and generation across multiple modalities is extremely difficult.
Level-5 (Full Modal Total Collaboration)
This level remains vacant, with no model able to achieve it.
This is not surprising, as surpassing experts in all modalities and tasks while simultaneously enhancing language intelligence currently exceeds the capabilities of existing technology.
The General-Level team speculates that the next milestone might come from a "multimodal version" of GPT-5, which could potentially first demonstrate full modal collaboration, thus changing the current state of Level-5.
However, before that day arrives, the Level-5 position on the Leaderboard will continue to remain vacant, reminding us that we are still quite far from true AGI.
The current Leaderboard has sparked heated discussion in the AI research community. Many researchers believe that such a unified, multidimensional evaluation platform is urgently needed in the multimodal field: it is not only unprecedented in scale (covering 700+ tasks), comprehensive in system (with levels and sub-categories), but also open and transparent, providing a reference for collective progress in the industry.
On social media and forums, people are discussing the Leaderboard results: some are surprised that the open-source model Qwen2.5-VL-72B can defeat many closed-source giants, proving the potential of the open-source community; others analyze the shortcomings of GPT-4V in complex audio-visual tasks and explore how to compensate for them.
The Leaderboard data is also being used to guide research directions: which tasks are weak points for most models, and which modal combinations have not been well resolved are now clear at a glance.
It can be foreseen that as more models join the ranking, the leaderboard will continue to update, which is not just a competition, but also a continuous accumulation of valuable research insights.
The launch of the General-Level evaluation framework and its Leaderboard marks a new stage in multimodal generalist AI research. As the authors hope in their paper, the evaluation system built by this project will become a solid infrastructure to help the industry measure the progress of general artificial intelligence more scientifically.
Through unified standard level evaluation, researchers can objectively compare the advantages and disadvantages of different models and find directions for further improvement; through large-scale multi-task Benchmarks, they can comprehensively examine the capability gaps of models in different domains, accelerate problem discovery, and iterate improvements. All of this is of great significance in promoting the next generation of multimodal foundation models and moving towards true AGI.
More precious is that the General-Level project adheres to an open and shared attitude, welcoming broad community participation. Whether you have a new model proposal or unique datasets, you can participate: submit model results to the ranking and compete with top global models, or contribute new evaluation data to enrich the task diversity of General-Bench.
Each dataset contribution will be acknowledged on the official website and cited in the technical report.
Project Homepage:
https://generalist.top/
Leaderboard:
https://generalist.top/leaderboard
Paper Address:
https://arxiv.org/abs/2505.04620
Benchmark:
https://huggingface.co/General-Level
This article is from the WeChat public account "Quantum Bit", author: Focus on Frontier Technology, published by 36Kr with authorization.




