The big models fought one-on-one for 750,000 rounds, GPT-4 won the championship, and Llama 3 ranked fifth

This article is machine translated
Show original

Regarding Llama 3, new test results are out:

The large model evaluation community LMSYS released a large model ranking list, with Llama 3 ranking fifth and tied for first place in English with GPT-4.

Different from other benchmarks, this list is based on one-on-one model battles, with questions set and scores given by evaluators across the network.

In the end, Llama 3 took the fifth place on the list, ahead of three different versions of GPT-4 and Claude 3 Super Cup Opus.

In the English individual list, Llama 3 surpassed Claude and tied with GPT-4.

Meta's chief scientist LeCun was very happy with the result. He retweeted the tweet and left a "Nice".

Soumith Chintala, the father of PyTorch, also excitedly expressed that such achievements are incredible and he is proud of Meta.

The 400B version of Llama 3 hasn’t come out yet, but it took fifth place with just 70B parameters… I still remember when GPT-4 was released last March, it was almost impossible to achieve the same performance. … The popularity of AI now is incredible, and I’m very proud of my colleagues at Meta AI for achieving this success.

So, what specific results does this list show?

Nearly 90 models battled 750,000 rounds

As of the latest list release, LMSYS has collected nearly 750,000 large-model solo battle results, involving 89 models.

Among them, Llama 3 participated 12,700 times, and GPT-4 had multiple different versions, with the most participating 68,000 times.

The figure below shows the number of competitions and winning rates of some popular models. Neither of the two indicators in the figure counts the number of draws.

In terms of the list, LMSYS is divided into a general list and multiple sub-lists. GPT-4-Turbo ranks first, tied with the earlier 1106 version and Claude 3 Super Cup Opus.

Another version (0125) of GPT-4 came in second, followed by Llama 3.

But what’s more interesting is that the newer 0125 performs worse than the old version 1106.

In the English individual list, Llama 3's score was tied with the two GPT-4 versions, and even surpassed the 0125 version.

The first place in the Chinese proficiency ranking is shared by Claude 3 Opus and GPT-4-1106, while Llama 3 is ranked outside the top 20.

In addition to language skills, the list also has long text and code ability rankings, and Llama 3 also ranks among the top.

However, what are the specific "rules of the game" of LMSYS?

Large model evaluation that everyone can participate in

This is a large-scale model test that everyone can participate in. The questions and evaluation criteria are decided by the participants themselves.

The specific "competitive" process is divided into two modes: battle and side-by-side.

In battle mode, after entering the question in the test interface, the system will randomly call two models in the library, and the tester does not know who the system has drawn. Only "Model A" and "Model B" are displayed in the interface.

After the model outputs the answer, the evaluator needs to choose which one is better, or it is a tie. Of course, if the performance of the model does not meet expectations, there are corresponding options.

Only after a selection is made is the model's identity revealed.

In side-by-side mode, the user selects a specific model to compete with, and the rest of the test process is the same as the battle mode.

However, only the voting results in the anonymous mode of the battle will be counted, and if the model accidentally reveals its identity during the conversation, the results will be invalid.

According to the Win Rate of each model to other models, we can draw the following graph:

The final ranking is obtained by converting Win Rate data into scores through the Elo evaluation system.

The Elo rating system is a method of calculating the relative skill levels of players, designed by American physics professor Arpad Elo.

Specifically for LMSYS, under the initial conditions, the scores (R) of all models are set to 1000, and then the expected winning rate (E) is calculated according to the following formula.

As the test progresses, the score will be revised based on the actual score (S). S has three values: 1, 0, and 0.5, corresponding to win, lose, and draw respectively.

The correction algorithm is shown in the following formula, where K is a coefficient that needs to be adjusted by the tester according to actual conditions.

Finally, after all valid data are taken into account in the calculation, the Elo score of the model is obtained.

However, during the actual operation, the LMSYS team found that the stability of this algorithm was insufficient, so they used statistical methods to correct it.

They used the Bootstrap method for repeated sampling to obtain more stable results and estimated confidence intervals.

The final revised Elo score becomes the basis for ranking in the list.

One More Thing

Llama 3 can already run on the large model inference platform Groq (not Musk's Grok).

The biggest highlight of this platform is its speed. Previously, the Mixtral model was used to run at a speed of nearly 500 tokens per second.

Llama 3 is also quite fast. The 70B version can run at about 300 tokens per second, and the 8B version is close to 800.

Reference Links:

[1]https://lmsys.org/blog/2023-05-03-arena/

[2] https://chat.lmsys.org/?leaderboard

[3] https://twitter.com/lmsysorg/status/1782483699449332144

This article comes from the WeChat public account "Quantum Bit" (ID: QbitAI) , author: Cressey, and is authorized to be published by 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments