Qwen2.5 ascends to the global open source throne, 72B model beats LIama3 405B, easily outperforming GPT-4o-mini

avatar
36kr
09-19
This article is machine translated
Show original

Defeating LIama3! Qwen2.5 ascends to the throne of global open source.

The latter surpasses LIama3 405B in multi-tasking with only one-fifth of the parameter scale.

The performance of various tasks is also far superior to other models of the same category.

Compared with the previous generation, it has achieved almost comprehensive improvements, especially in general tasks, mathematics and coding.

It is worth noting that this Qwen can be said to be the largest open source in history. The basic model directly released 7 parameter models, including six or seven mathematical and code models.

Models like the 14B, 32B, and lightweight Turbo outperform the GPT-4o-mini.

Except for the 3B and 72B models, all open source models are licensed under the Apache 2.0 license.

Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B and 72B

Qwen2.5-Coder: 1.5B, 7B and 32B (on the way)

Qwen2.5-Math: 1.5B, 7B, and 72B.

It's so dazzling that some netizens have already started using it.

Qwen2.5 72B is comparable to LIama3.1 405B

Compared with the Qwen2 series, the Qwen2.5 series has been upgraded in several major aspects.

First, fully open source .

Their research shows that users have a strong interest in models in the 10B-30B parameter range for production and 3B scale for mobile applications.

Therefore, based on the original open source models of the same size (0.5/1.5/7/72B), new 14B, 32B and 3B models have been added.

At the same time, Tongyi also launched Qwen-Plus and Qwen-Turbo versions, which can be experienced through the API service of Alibaba Cloud Big Model Service Platform.

As you can see, more than half of the models support 128K contexts and can generate up to 8K contexts.

In their comprehensive evaluation, all models have achieved a leap in capabilities compared to the previous generation, for example, Qwen2.5-32B outperforms Qwen2-72B, and Qwen2.5-14B outperforms Qwen2-57B-A14B.

Secondly, the pre-training dataset is larger and of higher quality , expanding from the original 7 trillion tokens to a maximum of 18 trillion tokens.

Then there are the enhancements in many aspects, such as acquiring more knowledge, mathematical coding capabilities, and being more in line with human preferences.

In addition, there are significant improvements in instruction tracking, long text generation (from 1k to more than 8K tokens), structured data understanding (such as tables), and structured output generation (especially JSON).

Let’s see the actual effect.

Table comprehension

Generating JSON Output

In addition, the Qwen2.5 model is generally more adaptable to the diversity of system prompts, which enhances the chatbot's role-playing implementation and condition setting capabilities.

Then let’s take a look at the specific model capabilities.

As we have seen in the previous article, the flagship model has made significant improvements in various tasks.

For small models like 0.5B, 1.5B, and 3B, the performance is roughly as follows:

Notably, the Qwen2.5-0.5B model outperformed the Gemma2-2.6B on a variety of math and coding tasks.

In addition, Qwen2.5 also demonstrated the model performance after instruction tuning. 72B-Instruct surpassed the larger Llama-3.1-405B in several key tasks, especially in mathematics (MATH: 83.1), coding (LiveCodeBench: 55.5) and chatting (Arena-Hard: 81.2).

There are also 32B-Instruct, 14B-Instruct and Qwen2.5-Turbo, which show capabilities comparable to GPT-4o-mini.

The largest open source in Qwen history

In addition to the basic model, Qwen also released code and mathematical professional models.

Qwen2.5-Coder provides three model sizes: 1.5B, 7B, and 32B versions (coming soon).

There are two main improvements: the expansion of code training data size and the enhancement of encoding capabilities.

Qwen2.5-Coder is trained on larger-scale code data, including source code, text code base data, and synthetic data, totaling 5.5 trillion tokens.

It supports 128K contexts and covers 92 programming languages. The open source 7B version even surpasses larger models such as DeepSeek-Coder-V2-Lite and Codestral, becoming one of the most powerful basic code models currently available.

As for the mathematical model, Qwen2.5-Math mainly supports solving English and Chinese mathematical problems through CoT and TIR.

This family of models is not currently recommended for use with other tasks.

The Qwen2.5-Math series is open source, including the basic model Qwen2.5-Math-1.5B/7B/72B, the instruction tuning model Qwen2.5-Math-1.5B/7B/72B-Instruct, and the math reward model Qwen2.5-Math-RM-72B.

Unlike the Qwen2-Math series, which only supports the use of Chain of Thought (CoT) to solve English math problems, the Qwen2.5-Math series extends support for the use of Chain of Thought and Tool Integrated Reasoning (TIR) to solve Chinese and English math problems.

Compared with the previous version, they mainly did these three things to achieve the basic model upgrade.

Leverage the Qwen2-Math-72B-Instruct model to synthesize additional high-quality math pre-training data.

Collect more high-quality mathematical data, especially Chinese data, across multiple time periods, from online resources, books, and codes.

The Qwen2.5 series basic model is used for parameter initialization, demonstrating more powerful language understanding, code generation, and text reasoning capabilities.

Ultimately, the ability was improved. For example, 1.5B/7B/72B improved by 3.4, 12.2, and 19.8 points respectively in the college entrance examination mathematics test.

Well, the above is the Qwen2.5 series, a complete set of open source that can be called the "largest scale in history".

It's not called strawberry, it's called kiwi

Lin Junyang, head of Alibaba Tongyi Open Source, also shared some details behind the scenes.

He first said that the Qwen2.5 project started the moment Qwen2 was open sourced.

In the process, they realized many problems and mistakes.

For example, in terms of pre-training, they only focused on improving the quality and quantity of pre-training data, using many familiar methods.

For example, a text classifier is used to recall high-quality data, and an LLM scorer is used to score the data, so that a balance can be achieved between quality and quantity.

In addition to creating expert models, the team also used them to generate synthetic data.

During the later stages of training, user feedback helps them solve problems one by one. At the same time, they are also exploring RLHF methods, especially online learning methods.

Regarding future upgrades and updates, he said that he was inspired by o1 and believed that in-depth research on reasoning ability should be conducted.

It is worth mentioning that when Qwen2.5 was being preheated, their team revealed that it would not be called Strawberry, but Kiwi.

Okay, now the kiwi fruit can be used quickly.

Reference Links:

[1]https://x.com/JustinLin610/status/1836461575965938104

[2]https://x.com/Alibaba_Qwen/status/1836449414220779584[3]https://qwenlm.github.io/blog/qwen2.5/

[4]https://qwenlm.github.io/blog/qwen2.5-llm/

[5]https://qwenlm.github.io/blog/qwen2.5-coder/

[6]https://qwenlm.github.io/blog/qwen2.5-math/

This article comes from the WeChat public account "Quantum位" , author: Bai Xiaojiao, and is authorized to be published by 36氪.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments