[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Channel:

Yannic Kilcher

Subscribers:

297,000

Published on January 26, 2025 2:03:48 PM ● Video Link: https://www.youtube.com/watch?v=bAWV_yrqx4w

Duration: 0:00

155,685 views

4,304

#deepseek #llm #grpo

GRPO is one of the core advancements used in Deepseek-R1, but was introduced already last year in this paper that uses a combination of new RL techniques and iterative data collection to achieve remarkable performance on mathematics benchmarks with just a 7B model.

Paper: https://arxiv.org/abs/2402.03300

Abstract:
Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

Authors: Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo

Links:
Homepage: https://ykilcher.com/
Merch:
YouTube:
Twitter: https://twitter.com/ykilcher
Discord: https://ykilcher.com/discord
LinkedIn: https://www.linkedin.com/in/ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2025-07-23	Context Rot: How Increasing Input Tokens Impacts LLM Performance (Paper Analysis)
2025-07-19	Energy-Based Transformers are Scalable Learners and Thinkers (Paper Review)
2025-05-03	On the Biology of a Large Language Model (Part 2)
2025-04-05	On the Biology of a Large Language Model (Part 1)
2025-01-26	[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
2024-12-26	Traditional Holiday Live Stream
2024-12-24	Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)
2024-12-10	Safety Alignment Should be Made More Than Just a Few Tokens Deep (Paper Explained)
2024-11-23	TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained)
2024-10-19	GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
2024-10-12	Were RNNs All We Needed? (Paper Explained)
2024-10-05	Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Paper)
2024-08-04	Privacy Backdoors: Stealing Data with Corrupted Pretrained Models (Paper Explained)
2024-07-08	Scalable MatMul-free Language Modeling (Paper Explained)
2024-06-26	Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (Paper Explained)
2024-06-01	xLSTM: Extended Long Short-Term Memory
2024-05-21	[ML News] OpenAI is in hot waters (GPT-4o, Ilya Leaving, Scarlett Johansson legal action)
2024-05-01	ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)
2024-04-30	[ML News] Chips, Robots, and Models
2024-04-28	TransformerFAM: Feedback attention is working memory
2024-04-27	[ML News] Devin exposed \| NeurIPS track for high school students

Channel	Latest
Rankiro Jd	6 hours ago
DomgodHilarious	6 hours ago
Orion Whittle	6 hours ago
DEC Gaming	6 hours ago
FriedBadger	7 hours ago
UviGamer	7 hours ago
라이구스	7 hours ago
Lord Cess	7 hours ago
Preston U-ie	7 hours ago
Malliem	7 hours ago
黑雪heixueGames	7 hours ago
It Is WHAT It Is	7 hours ago
ceassare	7 hours ago
DonaldDucc	7 hours ago
Nexus Entertainment	8 hours ago
Crow Bro	8 hours ago
ColMiCaryy	8 hours ago
Iyuzdank room	8 hours ago
PumasRevenge	8 hours ago
AFXS	8 hours ago
Convicção Games	8 hours ago
Eryn	8 hours ago
Suzuki Zuriko \| VTuber	8 hours ago
joshseki	8 hours ago
PandoraPikiyo	8 hours ago