DeepSeek R1 is a large language model which employs test-time compute to generate a response. Unlike many decoder-based models in the past which simply continue the given text (and may be fine-tuned for conversation), R1 generates reasoning tokens before the final answer is given. According to the researchers, its performance is on par with OpenAI’s O1 model.

Terminology

First, I will briefly describe some terminology related to training techniques:

  • Supervised fine-tuning (SFT) is a process which uses input/output pairs to directly fine-tune a model. In a reinforcement learning setting, SFT can help to mitigate cold start issues by providing initial policy behavior prior to RL training. The downside of SFT is that the input/output pairs can be expensive to acquire.

  • Preference fine-tuning involves training a model using preferences between different outputs for the same input. Usually these preferences are human-generated. Relative quality assessments are easier for humans to make, but are still a valuable training signal.

  • Rejection sampling fine-tuning (RFT) is a method for generating synthetic training data. First, a model produces k diverse candidate completions for a prompt using temperature sampling. Next, a reward model ranks or selects the best completions for training data.

    • In the simplest version, this technique simply generates k (reasoning, answer) pairs for each prompt using a nonzero temperature (e.g. 0.7) and filters out incorrect answers. Paper
  • Large scale reinforcement learning (RL), as DeepSeek refers to it, is a training method which which assigns synthetic labels to a response based on easily-verified outcomes from reasoning problems such as correctness and formatting.

    • This is in contrast to a supervised learning approach, which would typically enforce strict token-level matching.

Here are a few terms related to model architecture:

  • Mixture of Experts (MoE) is an architecture which uses a “router” to select from a pool of candidate “expert” networks.

    • In a typical decoder transformer such as GPT-2 is simply a tall stack of decoder blocks, where 100% of the weights are activated for each token. In this paradigm, making the model bigger either means making bigger blocks or a taller stack of them, which both increase inference costs.
    • However, replacing simple decoder blocks with MoE blocks allows models to include more parameters without significantly increasing inference costs, since only a fraction of experts are active at any one time.
  • Multi-head Latent Attention (MLA) is an alternative to traditional multihead attention which improves inference efficiency by using a low-rank approximation for the KV cache.

The release of R1 follows other papers from DeepSeek which go into extensive depth regarding many of the primatives discussed above. They’re worth a read:

  • DeepSeek-V2 introduces Multi-head Latent Attention (MLA) and several Mixture of Experts (MoE) optimizations to improve load balancing at the device level and expert level.

  • DeepSeekMath discusses Rejection Sampling Fine-Tuning (RFT) and compares Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO).

  • DeepSeek-V3 describes a multi-token training objective, improvements to MoE load balancing, SFT, RL for fine-tuning, and framework optimizations related to training efficiency.

  • DeepSeek-R1 discusses training techniques (large-scale RL, SFT, and RFT) for training R1-Zero and R1.

How DeepSeek R1-Zero was trained

The researchers apply RL directly to the base model (DeepSeek v3) without any cold start data to generate DeepSeek R1-Zero. However, the researchers note poor formatting/readability and the mixing multiple languages in the reasoning outputs.

In the training process, the researchers reused the Group Relative Policy Optimization (GRPO) algorithm which they introduced in their DeepSeekMath paper. This algorithm requires less computational overhead since it does not require a critic, and is stable as it measures output performance relative to the group. Responses received rewards based on accuracy and correct formatting.

R1-Zero was required to adhere to the format <think></think><answer></answer> when responding.

How DeepSeek R1 was trained

To address the usability issues in R1 Zero, the researchers expanded on R1 Zero with a four step process:

  1. Supervised fine-tuning with a small number of R1-Zero outputs, cleaned by human annotators
  2. Large scale RL, in the same way as R1 Zero. The language mixing issue returned, but the researchers mitigated it with a language consistency reward during training. This reward did result in a slight drop in performance, however.
  3. Supervised fine-tuning data collected via rejection sampling. These prompts cover both reasoning and non-reasoning domains.
    • For reasoning domains (600k examples), only the correct response was retained.
    • For non-reasoning domains (200k samples) like writing, factual QA, translation, etc. the researchers reused techniques from DeepSeek v3: synthetic responses cleaned up with human annotators.
  4. Further RL refinement for “helpfullness and harmlessness”:
    • For reasoning tasks, the researchers used rule-based rewards, as in R1-Zero.
    • For non-reasoning tasks, the researchers used reward models to capture preferences.
    • To optimize helpfulness, only the final summary was evaluated.
    • To optimize harmlessness, both the reasoning tokens and final summary were evaluated.

Further Distillation

Using the 800k inputs from step 3 above, the researchers generated outputs from R1. They then performed supervised fine-tuning on Qwen (1.5B, 7B, 14B, 32B) and Llama (8B and 70B).