Illustrated text and pictures: How was DeepSeek R1 made?

TechFlow

02-17

This article is machine translated

Show original

Here is the English translation of the text, with the specified terms preserved:

Interpreting the training process of DeepSeek - R1 based on the technical report released by DeepSeek.

Author: Jiangxinling,Powering AI

Image source: Generated by Boundless AI

How does DeepSeek train its R1 reasoning model?

This article is mainly based on the technical report released by DeepSeek, interpreting the training process of DeepSeek - R1; focusing on the four strategies for constructing and improving the reasoning model.

The original text is from researcher Sebastian Raschka, published at:

https://magazine.sebastianraschka.com/p/understanding-reasoning-llms

This article will summarize the core training part of the R1 reasoning model.

First, based on the technical report released by DeepSeek, here is a diagram of the R1 training process.

Let's go through the process shown in the above image:

(1) DeepSeek - R1 - Zero: This model is based on the DeepSeek - V3 base model released last December. It is trained using reinforcement learning (RL) with two types of reward mechanisms. This method is called "cold start" training, as it does not include a supervised fine-tuning (SFT) step, which is usually part of human feedback reinforcement learning (RLHF).

(2) DeepSeek - R1: This is DeepSeek's flagship reasoning model, built upon the DeepSeek - R1 - Zero. The team optimized it through an additional supervised fine-tuning stage and further reinforcement learning training, improving the "cold start" R1 - Zero model.

(3) DeepSeek - R1 - Distill: The DeepSeek team used the supervised fine-tuning data generated in the previous steps to fine-tune the Qwen and Llama models, enhancing their reasoning capabilities. Although this is not traditional distillation, the process involves using the outputs of the larger 671B DeepSeek - R1 model to train the smaller models (Llama 8B and 70B, as well as Qwen 1.5B - 30B).

The following will introduce the four main methods for constructing and improving the reasoning model.

1. Inference-time scaling

One way to improve the reasoning capabilities of LLMs (or any capabilities in general) is through inference-time scaling - increasing computational resources during the inference process to improve output quality.

Analogously, just as humans can often provide better answers when given more time to think about complex problems, we can employ techniques to encourage LLMs to "think" more deeply when generating answers.

A simple way to implement inference-time scaling is through clever prompt engineering. A classic example is Chain-of-Thought (CoT) Prompting, where phrases like "think through step-by-step" are added to the input prompt. This encourages the model to generate intermediate reasoning steps, rather than jumping directly to the final answer, often leading to more accurate results on more complex questions. (Note that for relatively simple knowledge-based questions like "What is the capital of France?", this strategy is not meaningful, and this is a practical heuristic for determining whether a reasoning model is appropriate for a given input query.)

The aforementioned CoT method can be viewed as inference-time scaling, as it increases the reasoning cost by generating more output tokens.

Another approach to inference-time scaling is the use of voting and search strategies. A simple example is majority voting, where the LLM generates multiple answers, and the correct answer is selected through a majority vote. Similarly, we can use beam search and other search algorithms to generate better responses.

I recommend the paper "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters".

Different search-based methods rely on process-reward-based models to select the best answer.

The DeepSeek R1 technical report states that their model did not use inference-time scaling techniques. However, this technology is often implemented at the application layer on top of LLMs, so DeepSeek may have applied it in their applications.

I speculate that OpenAI's o1 and o3 models use inference-time scaling, which could explain their relatively higher usage costs compared to models like GPT-4. In addition to inference-time scaling, o1 and o3 are likely trained through a reinforcement learning process similar to that of DeepSeek R1.

2. Pure Reinforcement Learning

A particularly noteworthy point in the DeepSeek R1 paper is their finding that reasoning can emerge from pure reinforcement learning. Let's explore what this means.

As mentioned earlier, DeepSeek developed three R1 models. The first is DeepSeek - R1 - Zero, which is built upon the DeepSeek - V3 base model. Unlike the typical reinforcement learning process, which usually includes a supervised fine-tuning (SFT) step before reinforcement learning, DeepSeek - R1 - Zero is trained entirely through reinforcement learning, without an initial supervised fine-tuning/SFT stage, as shown in the image below.

Nevertheless, this reinforcement learning process is similar to the human feedback reinforcement learning (RLHF) method commonly used for preference fine-tuning of LLMs. However, as mentioned earlier, the key difference in DeepSeek - R1 - Zero is that they skip the supervised fine-tuning (SFT) stage used for instruction adjustment. This is why they call it "pure" reinforcement learning.

In terms of rewards, they did not use a reward model trained on human preferences, but rather adopted two types of rewards: accuracy rewards and format rewards.

Accuracy rewards use a LeetCode compiler to verify programming answers and a deterministic system to evaluate math answers.
Format rewards rely on an LLM evaluator to ensure that the answers follow the expected format, such as placing reasoning steps within labels.

Surprisingly, this approach is sufficient to allow the LLM to evolve basic reasoning skills. Researchers observed an "aha moment" where the model started generating traces of reasoning in its answers, even though it was not explicitly trained for this, as shown in the image from the R1 technical report.

While R1 - Zero is not the top-tier reasoning model, as shown in the image, it does exhibit reasoning capabilities by generating intermediate "thinking" steps. This demonstrates the feasibility of using pure reinforcement learning to develop reasoning models, and DeepSeek is the first team to showcase (or at least publish) this approach.

3. Supervised Fine-tuning and Reinforcement Learning (SFT + RL)

Next, let's look at the development process of DeepSeek's flagship reasoning model, DeepSeek - R1, which can be considered a textbook example of building a reasoning model.

This model builds upon the DeepSeek - R1 - Zero, incorporating more supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance its reasoning performance.

It's worth noting that the inclusion of a supervised fine-tuning stage before reinforcement learning is common in the standard human feedback reinforcement learning (RLHF) process. OpenAI's o1 model is likely developed using a similar approach.

As shown in the image, the DeepSeek team used the DeepSeek - R1 - Zero to generate what they call "cold start" supervised fine-tuning (SFT) data. The term "cold start" means that this data is generated by the DeepSeek - R1 - Zero model, which has not been trained on any supervised fine-tuning data.

Using this cold start SFT data, DeepSeek first trained the model through instruction fine-tuning, followed by another reinforcement learning (RL) stage. This RL stage retained the accuracy rewards and format rewards used in the DeepSeek - R1 - Zero RL process. However, they added a consistency reward to prevent the model from switching between multiple languages within a single response.

After the RL stage, they entered another round of SFT data collection. In this stage, they generated 600,000 Chain-of-Thought (CoT) SFT examples using the latest model checkpoint, and an additional 200,000 knowledge-based SFT examples using the DeepSeek - V3 base model.

Here is the English translation of the text, with the specified terms preserved:

Then, these 60,000 + 20,000 SFT samples were used to perform instruction finetuning on the DeepSeek - V3 base model, followed by a final round of RL. At this stage, for math and programming problems, they again used rule-based methods to determine accuracy rewards, while for other types of problems, they used human preference labels. In summary, this is very similar to the conventional human feedback reinforcement learning (RLHF), except that the SFT data contains (more) examples of thought chains. Additionally, the RL has verifiable rewards beyond the human preference-based rewards.

The final model, DeepSeek - R1, due to the additional SFT and RL stages, has a significant performance improvement compared to DeepSeek - R1 - Zero, as shown in the table below.

4. Supervised Finetuning (SFT) and Distillation

So far, we have introduced three key methods for building and improving reasoning models:

1/ Reasoning-time Scaling, which is a technique that can enhance reasoning capabilities without the need to train or otherwise modify the underlying model.

2/ Pure RL, as employed in DeepSeek - R1 - Zero, which demonstrates that reasoning can emerge as a learned behavior without the need for supervised finetuning.

3/ Supervised Finetuning (SFT) + Reinforcement Learning (RL), which resulted in the DeepSeek reasoning model DeepSeek - R1.

There is one more - model "Distillation". DeepSeek has also released smaller models trained through what they call a distillation process. In the context of LLMs, distillation may not necessarily follow the classic knowledge distillation methods used in deep learning. Traditionally, in knowledge distillation, a smaller "student" model is trained on the logit outputs of a larger "teacher" model as well as the target dataset.

However, here the distillation refers to performing instruction finetuning on smaller LLMs, such as Llama 8B and 70B models, as well as Qwen 2.5B (0.5B - 32B), on the supervised finetuning (SFT) dataset generated by the larger LLMs, such as DeepSeek - V3 and DeepSeek - R1 checkpoints. In fact, the supervised finetuning data/SFT data used for this distillation process is the same as the dataset described in the previous section for training DeepSeek - R1.

To illustrate this process, I have highlighted the distillation part in the diagram below.

Why did they develop these distilled models? There are two key reasons:

1/ Smaller models are more efficient. This means they have lower running costs and can also run on lower-end hardware, which is particularly appealing to many researchers and hobbyists.

2/ As a case study of pure supervised finetuning (SFT). These distilled models are an interesting benchmark, showcasing what level of performance can be achieved through pure supervised finetuning without reinforcement learning.

The table below compares the performance of these distilled models with other popular models, as well as DeepSeek - R1 - Zero and DeepSeek - R1.

As we can see, while the distilled models are several orders of magnitude smaller than DeepSeek - R1, they are significantly more powerful than DeepSeek - R1 - Zero, though still weaker than DeepSeek - R1. It is also interesting to note that these models perform reasonably well compared to o1 - mini (suspecting o1 - mini may be a similar distilled version of o1).

There is another interesting comparison worth mentioning. The DeepSeek team tested whether the emergent reasoning behavior observed in DeepSeek - R1 - Zero could also appear in smaller models. To investigate this, they directly applied the same pure reinforcement learning method from DeepSeek - R1 - Zero to Qwen - 32B.

The table below summarizes the results of this experiment, where QwQ - 32B - Preview is a reference reasoning model based on the Qwen 2.5 32B developed by the Qwen team. This comparison provides some additional insights into whether pure reinforcement learning alone can induce reasoning capabilities in models much smaller than DeepSeek - R1 - Zero.

Interestingly, the results suggest that for smaller models, distillation is much more effective than pure reinforcement learning. This aligns with the notion that relying solely on reinforcement learning may not be sufficient to induce strong reasoning capabilities in models of this scale, and that supervised finetuning on high-quality reasoning data may be a more effective strategy when dealing with small models.

Conclusion

We have discussed four different strategies for building and improving reasoning models:

Reasoning-time Scaling: No additional training required, but increases inference cost. The cost of large-scale deployment would be higher as the user count or query volume grows. However, this remains a simple and effective method for enhancing the performance of already powerful models. I strongly suspect that o1 has employed reasoning-time scaling, which also explains why the cost per token generation is higher compared to DeepSeek - R1.
Pure RL: Interesting from a research perspective, as it allows us to delve into the emergent process of reasoning as a learned behavior. However, in practical model development, the combination of reinforcement learning and supervised finetuning (RL + SFT) is a better choice, as it can build stronger reasoning models. I also strongly suspect that o1 has been trained using RL + SFT. More specifically, I believe o1 started from a weaker, smaller base model than DeepSeek - R1, but made up for the gap through RL + SFT and reasoning-time scaling.
As mentioned above, RL + SFT is the key method for building high-performance reasoning models. DeepSeek - R1 has shown us an excellent blueprint for achieving this goal.
Distillation: Is an attractive approach, particularly for creating smaller and more efficient models. However, its limitation is that it cannot drive innovation or produce the next generation of reasoning models. For example, distillation always relies on the existing stronger models to generate the supervised finetuning (SFT) data.

Looking ahead, an interesting direction I anticipate is the combination of RL + SFT (method 3) with reasoning-time scaling (method 1). This is likely what OpenAI's o1 is doing, except that o1 may be based on a weaker base model than DeepSeek - R1, which explains why DeepSeek - R1 has outstanding reasoning performance at a relatively lower cost.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

All-in station

VNeID begins integrating blockchain technology.

ODAILY

A step-by-step guide to participating in the CZ-supported event predict.fun

Bitpush

The "slot machines" hidden in convenience stores are draining Americans' savings.