Llama 3.1 magnet link leaked in advance, the throne of open source model changed overnight, GPT-4o was surpassed

avatar
36kr
07-23
This article is machine translated
Show original

History repeats itself, Llama 3.1 405B leaked ahead of schedule!

Now, benchmarks and magnet links are flying everywhere.

In addition to the largest 405B, Meta also upgraded the 8B and 70B models released in early May, and increased the context length to 128K.

At this point, the model version has officially upgraded from Llama 3 to Llama 3.1.

According to the information provided by the magnet link, the size of the new model is 763.48 GiB (about 820 GB).

From the leaked "benchmark test", we can see that even the 8B small model is very powerful, and the 70B model can surpass GPT-4o in many benchmarks.

Developers were also shocked when they saw the test results. Topology CEO Aidan McLau exclaimed:

If the Llama 3-405B benchmark is true, it will

- Become the best model in the world

- Adjustable for everyone

- Cheaper than GPT-4o!

HyperWriteAI CEO Matt Schumer predicted: It will become the SOTA among open source models. (Even 70B can compete with GPT-4o, not to mention that this is before the instruction fine-tuning.)

Imagine a GPT-4o-level model running at 330 tokens per second for 10 times cheaper. That’s mind-blowing.

Tomorrow is going to be a wild day!

And Zuckerberg's words hinted at the arrival of 405B - a quiet moment before the big week.

Many netizens asked OpenAI online: When will the new model be released?

Llama 3.1 family, coming tomorrow

According to the leaked model card, Llama 3.1 will be released on the 23rd.

The licenses are "Custom Commercial License" and "Llama 3.1 Community License".

Leaked Model Card: https://pastebin.com/9jGkYbXY

Specifically, the multilingual large model Llama 3.1 series is a set of pre-trained and instruction-fine-tuned generative models, including three parameter sizes of 8B, 70B, and 405B.

The Llama 3.1 plain text model (8B, 70B, 405B) after instruction fine-tuning is optimized for multilingual conversation use cases.

In addition to English, it can also support 7 languages, including German, French, Italian, Portuguese, Hindi, Spanish and Thai.

It is reported that the new capabilities of Llama 3.1 include longer context, support for multi-language input and output, and integration of developers with third-party tools.

Benchmarks

A benchmark graph on GitHub (now 404) shows the excellent performance of Llama 3.1 in benchmark tests.

Specifically, in the benchmark evaluation of benchmark pre-trained models, Llama 3.1 405B set new records in general tasks, knowledge reasoning, and reading comprehension.

Especially in the MMLU and SQuAD segmentation benchmarks, the improvement is most obvious.

At the same time, the Llama 3.1 8B and 70B parameter versions have been slightly improved compared to Llama 3. However, in some indicators, the 70B Llama 3.1 is not as good as the previous generation.

In addition, in the instruction fine-tuning model, it can be seen that Llama 3.1 405B is stronger than the pre-trained model. In terms of reasoning, code, mathematics, tool usage, and multi-language benchmarks, it crushes the fine-tuned 8B and 70B versions.

The Llama 3.1 8B and 70B fine-tuned models also achieved significant performance improvements in multiple capability tasks.

Some netizens have also compiled the benchmarks of other leading models. Through comparison, it can be seen that Claude 3.5 Sonnet is the king of all benchmarks.

The fine-tuned version of Llama 3.1 405B performs best only on the math benchmark MMLU Pro, beating all the large models with a score of 73.3%.

In addition, 405B is comparable to GPT-4o on the GPQA (graduate-level expertise and reasoning), mathematics, DROP (reading comprehension), MGSM (multilingual mathematics), HumanEval (programming), and BBH (knowledge assessment) benchmarks.

Moreover, the 405B is significantly ahead of the latest GPT-4o mini model.

Llama 3.1 is an autoregressive language model that uses an optimized Transformer architecture. The adjusted version uses SFT and RLHF to match human preferences for safety.

For Llama 3.1 series models, token counts refer only to pre-training data.

All versions of the models use Grouped Query Attention (GQA) to improve the scalability of inference.

15T token training data

Like Llama 3, Llama 3.1 is pre-trained on approximately 15 trillion tokens from publicly available sources.

The fine-tuning data includes publicly available instruction datasets, as well as over 25 million synthetic samples, and pre-training data until December 2023.

Commercial research is possible

Llama 3.1 supports commercial and research use in multiple languages.

The plain text model fine-tuned by instructions is suitable for chat assistants, while the pre-trained model can be adapted to various natural language generation tasks. The Llama 3.1 model collection also supports using its model output to improve other models, including synthetic data generation and model distillation.

Violation of usage laws and regulations, prohibited by the usage policy and the Llama 3.1 Community License, and use outside of supported languages are all out of scope.

The team also emphasized that in addition to the 8 supported languages, Llama 3.1 has been trained on a wider set of languages. Developers can fine-tune it and apply it to other languages, provided that they comply with policies such as community licenses and ensure that it is used safely and responsibly.

39.3 million GPU hours of training

During pre-training, Meta used a custom training library, Meta's custom GPU cluster, and production infrastructure. Fine-tuning, annotation, and evaluation are also performed on the production infrastructure.

The training used a total of 39.3 million GPU hours of computing time, with the hardware type being H100-80GB (TDP of 700W).

Training time is the total GPU time required to train each model, and power consumption is the peak power capacity of each GPU device, adjusted for power usage efficiency.

The total location-based greenhouse gas emissions from the training were estimated to be 11,390 tonnes of carbon dioxide equivalent (CO2eq).

Meta emphasized that it has maintained net zero greenhouse gas emissions since 2020, and 100% of its electricity is generated from renewable resources, so the total greenhouse gas emissions based on market benchmarks are 0 tons of carbon dioxide equivalent.

Significant risks

Meta also conducted tests on major risks.

These include CBRNE (chemical, biological, radiological, nuclear and explosive materials) usefulness, child safety and cyber attacks.

In the cyber attack space, the team investigated whether LLMs could improve human capabilities in hacking tasks, including skill level and speed.

The research focuses on evaluating the ability of LLMs to be used as autonomous agents in cyber attack operations, especially when attacked by ransomware.

The main goal is to evaluate whether these models can effectively perform complex cyber attacks as independent agents without human intervention.

Netizens went wild, witnessing history once again

After the magnetic link is released, netizens who can't wait start downloading it directly, but this may require a long wait.

Some netizens are waiting for the release of Llama 3.1 405B tomorrow to witness history once again!

The gap between the open source model and the closed source model has narrowed again.

Someone also tested the classic trap question "Which is bigger, 9.11 or 9.9?", and Llama 3.1-405B actually answered it correctly.

For the "GPU poor", 820GB is too much to run on a laptop.

References:

https://x.com/bindureddy/status/1815443198459990098

https://x.com/kimmonismus/status/1815314833236984274

https://x.com/mattshumer_/status/1815453195717742838

https://x.com/swishfever/status/1815512729286815756

This article comes from the WeChat public account "Xinzhiyuan" , author: Xinzhiyuan, published by 36Kr with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments