DeepSeek-R1 Advancing Reasoning Capabilities Through Pure Reinforcement Learning

Mar 20, 2024

DeepSeek-R1: Advancing Reasoning Capabilities Through Pure Reinforcement Learning

DeepSeek recently released their DeepSeek-R1 model, achieving reasoning capabilities on par with OpenAI’s o1 models through pure reinforcement learning. Let’s explore how they did it and what Hugging Face is doing with Open-R1.

What is DeepSeek-R1?

If you’ve ever struggled with a tough math problem, you know how useful it is to think longer and work through it carefully. OpenAI’s o1 model showed that when LLMs are trained to do the same—by using more compute during inference—they get significantly better at solving reasoning tasks like mathematics, coding, and logic.

However, the recipe behind OpenAI’s reasoning models has been a well kept secret. That is, until last week, when DeepSeek released their DeepSeek-R1 model and promptly broke the internet (and the stock market!).

Besides performing as well or better than o1, the DeepSeek-R1 release was accompanied by a detailed tech report outlining their training recipe. This recipe involved several innovations, most notably the application of pure reinforcement learning to teach a base language model how to reason without any human supervision.

The Training Process

DeepSeek-R1 is built on the foundation of DeepSeek-V3, a 671B parameter Mixture of Experts (MoE) model that performs on par with models like Sonnet 3.5 and GPT-4o. What’s especially impressive is how cost-efficient it was to train—just $5.5M—thanks to architectural optimizations.

The training process involved two key models:

DeepSeek-R1-Zero: This model skipped supervised fine-tuning entirely and relied on pure reinforcement learning using Group Relative Policy Optimization (GRPO). A simple reward system guided the model based on answer accuracy and structure. While it developed strong reasoning skills, its outputs often lacked clarity.
DeepSeek-R1: This model started with a “cold start” phase using carefully crafted examples to improve clarity. It then went through multiple rounds of RL and refinement, including rejecting low-quality outputs using both human preference and verifiable rewards.

The Open-R1 Project

While DeepSeek released their model weights, the datasets and training code remain closed. This prompted Hugging Face to launch the Open-R1 project, which aims to:

Replicate R1-Distill models by distilling reasoning datasets from DeepSeek-R1
Recreate the pure RL pipeline used for R1-Zero
Demonstrate the complete training pipeline from base model → SFT → RL

The project will focus on:

Creating synthetic datasets for fine-tuning LLMs into reasoning models
Developing training recipes for building similar models from scratch
Exploring applications beyond math into areas like code and medicine

Key Innovations and Results

Some notable achievements of DeepSeek-R1 include:

79.8% Pass@1 on AIME 2024, surpassing OpenAI-o1-1217
97.3% score on MATH-500
2,029 Elo rating on Codeforces (outperforming 96.3% of human participants)
Strong performance on knowledge benchmarks like MMLU (90.8%) and MMLU-Pro (84.0%)

Looking Forward

The release of DeepSeek-R1 represents a significant step forward in open-source AI development. By demonstrating that pure reinforcement learning can create powerful reasoning models, it opens new possibilities for advancing AI capabilities without relying on extensive human supervision.

The Open-R1 project aims to make these advances even more accessible to the research community, potentially accelerating progress in areas like mathematical reasoning, coding, and scientific problem-solving.

Try Deepseek on Netwrck

Comment and share

How Big Multiplayer Chess Works

Mar 21, 2017

Big Multiplayer Chess is a multiplayer free for all chess variant where many players on a large board can move pawns in any direction and can slide castles, bishops and queens upto 8 places.

Some metrics or heuristics must be used to score how good a board configuration is for a player,

Network usage

Analysing App Engine logs with BigQuery

Mar 02, 2016

I created webfiddle.net which lets you easily add your own CSS and JavaScript to the web and share the results.

Part of the product includes a proxy server which injects your code, webfiddle.net is currently going fairly viral (2M requests in the last 5 days) and costing too much!

Optimizing python code

Feb 26, 2016

1
2
3

def is_letter(character):
    return character in 
        'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

I noticed this function would loop from a-Z so we should speed that up using a set!

Migrate to AWS Elastic Beanstalk v3

Feb 25, 2016

v3 is FAR better than v2

You installed v2 by adding some directory to your PATH, remove that and run sudo pip install awsebcli

Running a HTTP Proxy Server At Scale

Feb 24, 2016

I wanted an easy way of designing greasemonkey scripts, chrome extensions and i would often report bugs and tell people how they can fix them.

I created WebFiddle.net which would run some extra CSS and JS on any page, i didn’t want anyone to have to install an extension to use it so you could easily share it to show someone a problem.

Working As A Remote Contract Code Monkey

Feb 24, 2016

I have been coding remotely from New Zealand for around 5 months now working 11am until 7pm matching Melbourne time, I have a Google hangout stand-up at midday and email my progress around at the end of each day, I’m also remotely included in retros, product showcases and planning meetings. It has been working pretty well.

99designs WordSmashing Logo/Icon Design Contest

Aug 30, 2015

I ran a 99designs logo design contest one for my game, Word Smashing which was in desperate need of some design love.

JavaScript, this, new and Object Oriented Programming

Jul 24, 2015

Whats wrong with this and new?

DeepSeek-R1 Advancing Reasoning Capabilities Through Pure Reinforcement Learning

DeepSeek-R1: Advancing Reasoning Capabilities Through Pure Reinforcement Learning

What is DeepSeek-R1?

The Training Process

The Open-R1 Project

Key Innovations and Results

Looking Forward

How Big Multiplayer Chess Works

Top Programming Problems

Network usage

Analysing App Engine logs with BigQuery

Optimizing python code

Migrate to AWS Elastic Beanstalk v3

Running a HTTP Proxy Server At Scale

Working As A Remote Contract Code Monkey

99designs WordSmashing Logo/Icon Design Contest

JavaScript, this, new and Object Oriented Programming

Lee Penkman

Nerd/Geek, Crypto/Software/Games/VFX/ML, Multiple hat wearer

Image Processing/ML Engineer @ Canva

Sydney