Notes on Deepseek R1

DeepSeek R1 is a large language model which employs test-time compute to generate a response. Unlike many past decoder-based models that simply continue the given text (and may be fine-tuned for conversation), R1 generates reasoning tokens before producing a final answer. DeepSeek researchers report that its reasoning performance is on par with OpenAI’s O1 model.

Terminology

First, I will briefly describe some terminology related to training techniques:

Supervised fine-tuning (SFT) is a process that uses input/output pairs to directly fine-tune a model. In a reinforcement learning setting, SFT can help to mitigate cold start issues by providing initial policy behavior prior to RL training. The downside of SFT is that the input/output pairs can be expensive to acquire.
Preference fine-tuning involves training a model using preferences between different outputs for the same input. Usually these preferences are human-generated. Relative quality assessments are easier for humans to make and still provide a valuable training signal.
Rejection sampling fine-tuning (RFT) is a method for generating synthetic training data. First, a model produces k diverse candidate completions for a prompt using temperature sampling. Next, a reward model ranks or selects the best completions for training data.
- In the simplest version, this technique generates k (reasoning, answer) pairs for each prompt using a nonzero temperature (e.g. 0.7) and filters out incorrect answers. Paper
Large scale reinforcement learning (RL), as DeepSeek refers to it, is a training method which assigns synthetic labels to a response based on easily-verified outcomes from reasoning problems such as correctness and formatting.
- This is in contrast to a supervised learning approach, which would typically enforce strict token-level matching.

Here are a few terms related to model architecture:

Mixture of Experts (MoE) is an architecture that uses a “router” to select from a pool of candidate “expert” networks.
- In a typical decoder transformer, such as GPT-2, consists of a tall stack of decoder blocks, where 100% of the weights are activated for each token. In this paradigm, making the model bigger either means making bigger blocks or a taller stack of them, which both increase inference costs.
- However, replacing standard decoder blocks with MoE blocks allows models to incorporate more parameters without significantly increasing inference costs, as only a fraction of experts are active at any one time.
Multi-head Latent Attention (MLA) is an alternative to traditional multihead attention which improves inference efficiency by using a low-rank approximation for the KV cache.

Paper links

The release of R1 follows other DeepSeek papers that explore many of the primitives discussed above in extensive depth. They’re worth a read:

DeepSeek-V2 introduces Multi-head Latent Attention (MLA) and several Mixture of Experts (MoE) optimizations to improve load balancing at the device level and expert level.
DeepSeekMath discusses Rejection Sampling Fine-Tuning (RFT) and compares Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO).
DeepSeek-V3 describes a multi-token training objective, improvements to MoE load balancing, SFT, RL for fine-tuning, and framework optimizations related to training efficiency.
DeepSeek-R1 discusses training techniques (large-scale RL, SFT, and RFT) for training R1-Zero and R1.

How DeepSeek R1-Zero was trained

The researchers apply RL directly to the base model (DeepSeek v3) without any cold start data to generate DeepSeek R1-Zero. However, the researchers noted issues with poor formatting and readability, as well as the mixing of multiple languages in the reasoning outputs.

In the training process, the researchers reused the Group Relative Policy Optimization (GRPO) algorithm which they introduced in their DeepSeekMath paper. This algorithm requires less computational overhead since it does not require a critic, and is stable as it measures output performance relative to the group. Responses received rewards based on accuracy and correct formatting.

R1-Zero was required to adhere to the format <think></think><answer></answer> when generating responses.

How DeepSeek R1 was trained

To address the usability issues in R1 Zero, the researchers expanded on R1 Zero with a four step process:

Supervised fine-tuning using a small set of R1-Zero outputs, cleaned by human annotators.
Large scale RL, in the same way as R1 Zero. The language mixing issue returned, but the researchers mitigated it with a language consistency reward during training. This reward did result in a slight drop in performance, however.
Supervised fine-tuning data collected via rejection sampling. These prompts cover both reasoning and non-reasoning domains.
- For reasoning domains (600k examples), only the correct response was retained.
- For non-reasoning domains (200k samples) like writing, factual QA, translation, etc. the researchers reused techniques from DeepSeek v3: synthetic responses cleaned up with human annotators.
Further RL refinement for “helpfulness and harmlessness”:
- For reasoning tasks, the researchers applied rule-based rewards, as in R1-Zero.
- For non-reasoning tasks, the researchers used reward models to capture preferences.
- To optimize helpfulness, only the final summary was evaluated, whereas to optimize harmlessness, both the reasoning tokens and the final summary were assessed.

Further Distillation

Using the 800k inputs from step 3 above, the researchers generated outputs from R1. They then applied supervised fine-tuning to Qwen (1.5B, 7B, 14B, 32B) and Llama (8B and 70B).

Discussion

In their experiments, the researchers performed large-scale RL training on Qwen-32B-Base and compared with a 32B model distilled from DeepSeek R1-Zero, which was also trained using RL only. The distilled model performed significantly better.
- This demonstrates that distillation outperforms large-scale RL training from scratch and suggests that RL alone cannot surpass larger models.
The researchers also experimented with a Process Reward Model (PRM). However, identifying fine-grained reasoning steps was challenging in practice. Additionally, the PRM itself was prone to reward hacking, and thus complicated the overall training pipeline.
- The researchers do note that PRM may be useful for reranking top-N responses from the model or assisting in guided search.
Intuitively, Monte Carlo Tree Search (MCTS) seemed like a promising route for improving reasoning. MCTS involves learning a model to search a tree of possibilities, such as a game tree in chess or go.
- Unlike turn-based games, which have a discrete set of possible actions, reasoning problems exist in a continuous search space.
- Additionally, because MCTS relies on an accurate value model for tree exploration, a poor value model will degrade training performance. However, MCTS may still be useful at inference time.