Here is the English translation:
Longing for a night, the Chinese large model has shown its muscles fiercely in the international arena.
Recently, the domestic large model manufacturer DeepSeek announced that the first version of DeepSeek-V3 has been launched and open-sourced simultaneously.
Multiple benchmark test results show that DeepSeek-V3 surpasses other open-source models such as Qwen2.5-72B and Llama-3.1-405B, and is on par with GPT-4o and Claude-3.5-Sonnet in performance.
The technical report mentions that the pre-training stage of the model only took 2 months with 2,048 GPUs, and the cost was only $5.576 million.
Low-cost creation of high value. The DeepSeek-V3, which can be called the pride of the nation, has directly won the endorsement of a large number of overseas professional AI personnel.
Wake up, Deepseek, you are really hot now.
Experience address: chat.deepseek.com
Hands-on experience with DeepSeek-V3, this time it's really different
First, let's take a look at the impressive scorecard provided by DeepSeek:
- Encyclopedia knowledge: The level of DeepSeek-V3 in knowledge-based tasks (MMLU, MMLU-Pro, GPQA, SimpleQA) has significantly improved compared to the previous generation DeepSeek-V2.5, approaching the current best-performing model Claude-3.5-Sonnet-1022.
- Long text: In long text evaluation, DeepSeek-V3 outperforms other models on average in DROP, FRAMES and LongBench v2.
- Code: DeepSeek-V3 is far ahead of all non-o1 class models in algorithmic code scenarios (Codeforces), and approaches Claude-3.5-Sonnet-1022 in engineering code scenarios (SWE-Bench Verified).
- Mathematics: On the American Mathematics Competition (AIME 2024, MATH) and the National High School Mathematics League (CNMO 2024), DeepSeek-V3 significantly outperforms all open-source and closed-source models.
- Chinese ability: DeepSeek-V3 performs similarly to Qwen2.5-72B on educational evaluations like C-Eval and pronoun disambiguation, but is more leading in factual knowledge C-SimpleQA.
After the release of DeepSeek-V3, it immediately caused a huge response at home and abroad.
Former Google Search member Deedy directly stated that DeepSeek V3 represents the strongest open-source large model in the world, no doubt about it.
The high efficiency of DeepSeek-V3 has also been endorsed by former OpenAI genius Andrej Karpathy:
"(DeepSeek) Does this mean that developing frontier-level LLMs does not require large GPU clusters? Not necessarily, but you must ensure efficient use of resources. This achievement is a good example, showing that there is still a lot of room for optimization in data and algorithms."
Meta AI research scientist Tian Yuandong excitedly tweeted twice:
"After reading the report, I found their amazing breakthrough on H800 from scratch 🤯
FP8 pre-training, MoE, achieving strong performance on a very limited budget, guiding startup through CoT distillation... Wow, this is truly remarkable work 👏👏 👍👍"
X netizen Tom Dörr played it around and directly exclaimed that Deepseek V3 is too smart, it even understands what I'm saying without explanation, it feels like there's a ghost in the machine."
Don't be impatient, there are more experts.
Some netizens directly stacked 4/8 M4 Mac minis together to run DeepSeek-V3. There are also developers who used DeepSeek-V3 to quickly create a small game.
Compared to ChatGPT, Claude and the like overseas, DeepSeek-V3 is free for everyone, and can be used in China right now. I've already tried it out for everyone.
Really, the response speed of DeepSeek-V3 is beyond my expectation.
The previous v2.5 version could generate 20 tokens per second (about 7-8 Chinese characters), while the new v3 version has directly accelerated to 60 tokens per second, tripling the speed.
For example, v2.5 is like the pace of normal speech, while v3's speed is like a well-trained broadcaster rapidly reporting.
However, DeepSeek-V3 does not support multi-modal input and output, so we'll have to wait patiently.
After the experience, "which is bigger, 9.8 and 9.11" and "how many r's are in strawberry" are no longer a challenge for it.
Let's try something more challenging.
"I have 6 eggs, 2 were broken, 2 were fried, 2 were eaten, how many are left?"
DeepSeek-V3 is fast, but it still fell into the trap of brain teasers (2 eggs), while GPT-4o successfully answered (4 eggs), this round, GPT-4o won.
Emotional intelligence tests have been very popular on the X platform recently, and we've tried them too.
It seems that both GPT-4o and DeepSeek-V3 seem to like the number "42".
Logical questions didn't confuse GPT-4o and DeepSeek-V3 either.
"If tomorrow is sunny, then I will go camping outdoors today. If I go camping outdoors today, does that mean tomorrow must be sunny?"
As for whether DeepSeek-V3 is biased, we also asked GPT-4o to give it and Claude-3.5-Sonnet a math problem.
"Let the function f(x,y) = x^3 + 3xy^2 - 3x - y^3 + 2y. Find the gradient of the function at the point (1,1), and determine whether the point is an extremum point. If it is an extremum point, please judge whether it is a local maximum, local minimum, or saddle point."
After a moment, DeepSeek-V3 and Claude-3.5-Sonnet respectively gave their answers.

Who says AI can only burn money, what has DeepSeek-V3 done right?
Flipping through the technical report of DeepSeek-V3, the only word I see throughout is 'innovation'.
DeepSeek-V3 is a self-developed MoE model, with 671B parameters and 37B activated, pre-trained on 14.8T Tokens.
The MoE architecture is not difficult to understand, it's like a company with experts in different departments (such as finance, technology, marketing, etc.), each expert is proficient in their own field, but they don't need to handle all the work.
Similarly, each "expert" in the MoE model is specialized in handling specific types of tasks, and when a task comes up, it can intelligently mobilize the most suitable expert to solve the specific problem.
Based on the high efficiency of its predecessor DeepSeek-V2, this model integrates Multi-head Latent Attention (MLA) and DeepSeekMoE architecture, achieving efficient inference and cost-optimized training.

The report also mentions that DeepSeek-V3 has introduced two key innovations.
One is a load balancing strategy without auxiliary loss, and the other is a Multi-Token Prediction (MTP) training objective.
Two thousand GPUs, two months of time, DeepSeek has proven the importance of technical innovation in the most elegant way.
Specifically, the model completed pre-training on over 14.8 trillion diverse and high-quality Tokens, and then further optimized its performance through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages.

The pre-training stage took less than two months on a cluster of 2048 H800 GPUs, totaling 2,664,000 GPU hours.
Through the optimization of algorithms, frameworks and hardware co-design, the total training cost of DeepSeek-V3 is $5.576 million, including the pre-training, context length extension, and subsequent training stages.

The contribution and acknowledgment list of the technical report are all Chinese names.
For more details, please check the technical report: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
For developers, the pricing of the DeepSeek-V3 API service will also be adjusted to 0.5 RMB per million input Tokens (cache hit) / 2 RMB (cache miss), and 8 RMB per million output Tokens.

More importantly, DeepSeek, which pursues inclusive AGI, has taken the lead in opening up the native FP8-trained weights of DeepSeek-V3.
Thanks to the support of the open-source community, SGLang and LMDeploy have already supported the native FP8 inference of the V3 model in a timely manner, while TensorRT-LLM and MindIE have implemented BF16 inference.
In addition, to facilitate community adaptation and expansion of application scenarios, DeepSeek also provides a conversion script from FP8 to BF16.
For model weight download and more local deployment information, please refer to: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base
It can be said that this year's Santa Claus comes from China's DeepSeek.
And this Christmas gift from China has allowed the world to witness the speed of AI in China.
Overseas, there is Meta, and in China, there are domestic manufacturers like DeepSeek, Zhipu, and Mianbi, and China's presence in the open-source community is also constantly rising.

While more companies are pouring their hearts into the prosperity of China's open-source, they are also calling for a return to this pure and altruistic spirit.
If the renewed downtime of ChatGPT this morning reminds us of the importance of AI model diversification, then next time, we will have another reliable choice.
That is China's DeepSeek-V3.
One more thing
Recently, the ChatGPT o3 chat record fabricator has become a hot topic, and we have also generated a chat interface.
Since o3 said so, we can't help but believe it now (manual dog head).

Here is the experience link: https://chatgpt-meme-generator.vercel.app/
This article is from the WeChat public account "APPSO", author: Discovering Tomorrow's Products, 36Kr is authorized to publish.