The new king of open source AI that claims to have defeated GPT-4o is accused of fraud. Don't be superstitious about the list of large models

avatar
36kr
09-11
This article is machine translated
Show original

Have you ever thought about this question: How do AI models rank according to seniority?

Just like the college entrance examination for humans, they also have their own examinations - benchmark tests.

However, there are only a few subjects in the college entrance examination, but there are many types of benchmark tests. Some test general knowledge, while others specialize in a certain ability, including mathematics, coding, reading comprehension, and everything else.

Benchmark rankings when Google released Gemini

The advantage of benchmark tests is that they are intuitive. By pulling up the list, the scores are clear at a glance, which is more effective in attracting users than long paragraphs of text.

However, the accuracy of benchmarks is uncertain, and the credibility of benchmarks has been further reduced due to a recent suspected fraud incident.

The new king of open source models was "debunked" in a blink of an eye

On September 6, the appearance of Reflection 70B seemed like a miracle. It came from HyperWrite, a little-known New York startup, but it claimed to be the "world's top open source model."

How did developer Matt Shumer prove this? With data.

In multiple benchmark tests, with only 70B parameters, it beat GPT-4o, Claude 3.5 Sonnet, Llama 3.1 405B and other big players. It is more cost-effective than the top closed-source models, which instantly amazed everyone.

Reflection 70B did not spring out of a stone, claiming to be based on Meta’s Llama 3.1 70B, took three weeks to train, and uses a new technology called Reflection-Tuning, which allows AI to detect errors in its own reasoning and correct them before answering.

Using an analogy with human thinking, this is a bit like the transition from System 1 to System 2 in "Thinking, Fast and Slow", reminding AI to take it easy and not speak without thinking, but to slow down its reasoning, reduce hallucinations, and give more reasonable answers.

However, doubts soon arose.

On September 8, third-party evaluation agency Artificial Analysis said that they were unable to reproduce the results of the benchmark test.

For example, the score of one of the benchmark tests, MMLU, is the same as that of Reflection 70B and Llama 3 70B, but significantly lower than that of Llama 3.1 70B, not to mention GPT-4o.

Matt Shumer responded to the doubts, explaining that the third-party results were worse because there was a problem with the weights of Reflection 70B when it was uploaded to Hugging Face, causing the model to perform worse than the internal API version.

The reason was a bit lame, and the two sides clashed back and forth. Later, Artificial Analysis stated that they had obtained access to the private API and the performance was indeed good, but still not up to the level officially announced.

Soon after, netizens from X and Reddit also joined the anti-counterfeiting team, questioning whether Reflection 70B was LoRA trained directly on the basic test set and that its basic model was Llama 3, so it could score high on the list but was not actually capable.

Some people even accused Reflection 70B of being a shell of Claude and a scam from the beginning to the end.

On September 11, in response to public opinion, Matt Shumer's team issued a statement denying the shelling of Claude. It is not clear why the benchmark test scores could not be reproduced.

The scores are inflated, which may be due to errors from the beginning, data pollution, or configuration errors. Please give them some more time.

There is no final conclusion to the incident yet, but it at least illustrates one thing: the credibility of the AI list needs to be questioned. Using high scores to manipulate the list for self-marketing is very confusing to the public who do not know the truth.

Various large-scale exams and human ranking anxiety

Let’s get back to the most basic question: How to evaluate the performance of a large model?

A relatively simple and crude way is to look at the number of parameters. For example, Llama 3.1 has multiple versions. 8B is suitable for deployment and development on consumer-grade GPUs, and 70B is suitable for large-scale AI native applications.

If the parameter quantity is the "factory setting" that reflects the upper limit of the model's capabilities, then the benchmark test is an "exam" that evaluates the model's actual performance in specific tasks. There are at least dozens of such tests, with different focuses and their scores are not interoperable.

MMLU, also known as Large-Scale Multi-Task Language Understanding, released in 2020, is currently the most mainstream English evaluation dataset.

It contains about 16,000 multiple-choice questions covering 57 subjects such as mathematics, physics, history, law, and medicine, ranging in difficulty from high school to expert. It is a general intelligence test. The more questions the model answers correctly, the higher its level.

Last December, Google said that Gemini Ultra scored 90.0% in MMLU, higher than GPT-4.

However, they did not hide the fact that Gemini and GPT-4 use different methods, the former is CoT (step-by-step reasoning) and the latter is 5-shot, so this score may not be objective enough.

Of course, there are also benchmarks that test the various sub-capabilities of large models, too many to list.

GSM8K mainly tests elementary school mathematics, MATH also tests mathematics, but is more competitive, including algebra, geometry, and calculus, etc. HumanEval tests Python programming.

In addition to mathematics, physics and chemistry, AI also does "reading comprehension". DROP allows the model to read paragraphs and perform complex reasoning based on the information contained therein. In contrast, HellaSwag focuses on common sense reasoning and combines it with life scenarios.

Test questions for the HellaSwag benchmark

Although mostly in English, the Chinese big model also has its own benchmark test, such as C-Eval, which was jointly completed by Shanghai Jiao Tong University, Tsinghua University, and the University of Edinburgh, covering nearly 14,000 questions in 52 subjects such as calculus.

Chinese benchmark SuperCLUE tests logic and reasoning

So who are the "markers"? There are roughly three types: one is automated programs, such as programming benchmarks, where the code generated by the model is automatically executed to verify whether it is correct or not; the second is to use more powerful models such as GPT-4 as judges; the third is manual.

Mixed boxing is much more comprehensive than the Four Books, Five Classics and Six Arts. But benchmarking also has serious hidden dangers. The company behind it is "both the referee and the athlete", which is very similar to the situation where a teacher is afraid that his students will cheat.

One hidden danger is that questions can easily be leaked, causing the model to "copy answers".

If the benchmark test set is public, the model may have "seen" these questions or answers during training, resulting in unrealistic performance results because the model may not have answered the questions by reasoning, but rather memorized the answers.

This involves the problems of data leakage and overfitting, which leads to an overestimation of the model's capabilities.

Research from Renmin University and other universities indicates that data related to the evaluation set is occasionally used for model training.

Another hidden danger is cheating in various ways, which leaves a lot of room for human manipulation.

Reflection 70B When X was being discussed in full swing, Jim Fan, a senior research scientist at NVIDIA, posted: It is not difficult to manipulate benchmarks.

For example, starting with the "question bank", the model is trained based on rewritten examples from the test set. Rewriting the questions in the test set in different formats, wordings, and languages can allow a 13B model to beat GPT-4 in benchmark tests such as MMLU, GSM8K, and HumanEval, and reverse the situation.

At the same time, you can also change the "way of answering questions" to increase the computing power of reasoning. Through self-reflection, tree of thought, etc., you can let the model slow down reasoning and reason multiple times, thereby improving accuracy.

Jim Fan’s attitude is clear:

It's amazing that people are still excited about MMLU or HumanEval scores in September 2024. These benchmarks are so broken that manipulating them can be an assignment for undergraduates.

In addition, the difficulty of benchmark tests may not necessarily keep up with the speed of AI development because they are usually static and single, but AI is running wild.

Dan Hendrycks, an AI safety researcher who helped develop MMLU, told Nytimes in April that MMLU may have a shelf life of one or two years and will soon be replaced by a different, more difficult test.

In the Battle of Hundred Models, the ranking anxiety of human society was passed on to AI. Under various secret operations, the AI rankings became a marketing tool, but it is mixed and not so credible.

Users will vote on which AI model is the best

But many times, things are easier to handle if we have data and standards.

Benchmarking is a structured scoring framework that can be used as a factor in user model selection and can also help improve models. C-Eval, which does Chinese benchmarking, even said: "Our most important goal is to assist model development."

Benchmark tests have their value, the key is how to make them more authoritative and credible.

We already know that if the test set is used for model training, it may cause the model to "cheat" in benchmark tests. Some third-party evaluations start from this gap.

The SEAL research lab of data annotation company Scale AI emphasizes the privacy of its own data sets. It is easy to understand that only a "closed-book exam" can reveal the truth.

Currently, SEAL can test the model's coding, instruction tracking, mathematics, and multilingual abilities, and more assessment dimensions will be added in the future.

SEAL's coding ability ranking in August this year

In addition to the question-solving and scoring mode, there is also a more down-to-earth benchmark test: the arena.

One of the representatives is Chatbot Arena, which was initiated by LMSYS, a non-profit organization of researchers from Carnegie Mellon University, University of California, Berkeley and others.

It pits anonymous, random AI models against each other, with users voting to select the best models, which are then ranked using the Elo rating system commonly used in competitive games such as chess.

Specifically, we can ask two randomly selected anonymous models A and B questions online, and then vote for the two answers, preferring A, preferring B, a tie, or not liking either one. Only then can we see the true face of models A and B.

The question I asked was "Which is bigger, 9.9 or 9.11?", which has stumped many AIs before. Both models gave the wrong answer. I clicked "dislike" and found out that the lucky winners were GPT-4o and France's Mixtral.

The strengths of Chatbot Arena are obvious. The questions raised by a large number of users are definitely much more complex and flexible than the test sets created in the laboratory. Everyone can see, touch and use them, and the rankings are closer to the needs of the real world.

Unlike some benchmark tests that test advanced mathematics and test whether the output is safe, it is actually closer to research and far from the needs of most users.

Currently, Chatbot Arena has collected more than 1 million votes. Musk's xAI has also used Chatbot Arena's ranking endorsement.

But some people hold opposing views, believing that Chatbot Arena will be affected by the biases of a minority of users. Different people have different tastes. Some users may prefer longer answers, while others appreciate conciseness. There is no best in writing, so how can you compare?

Therefore, Chatbot Arena recently made an adjustment to distinguish between the two indicators of "style" and "content". "Content" refers to what is said, and "style" refers to how it is said. By controlling the impact of conversation length and format, the ranking has changed.

In short, no matter how you measure, benchmark tests cannot be guaranteed to be accurate, nor should they be relied upon superstitiously; they are merely a reference, just like the college entrance examination can only reflect part of a student’s abilities.

Of course, the most unsatisfactory behavior is to subjectively manipulate the benchmarks to endorse yourself, simply pursuing flashy rankings.

Going back to the original intention, we all want to use AI to solve real problems, develop products, write code, generate pictures, do psychological counseling to gain some emotional value... Benchmark tests cannot help you answer which AI speaks better.

Fake things can never be real, voting with your feet is the simplest truth. Those more subjective and personal feelings and experiences still need to be exchanged for our practice.

This article comes from the WeChat public account "APPSO" , author: APPSO, and is authorized to be published by 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments