GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Channel:

Yannic Kilcher

Subscribers:

292,000

Published on October 19, 2024 3:59:43 PM ● Video Link: https://www.youtube.com/watch?v=Bs6eyNQjGpo

Duration: 0:00

20,418 views

649

This paper (by Apple) questions the mathematical reasoning abilities of current LLMs and designs a synthetic template-based dataset distribution to investigate various aspects around LLM performance of high-school level math questions.

Paper: https://arxiv.org/abs/2410.05229

Abstract:
Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of this http URL findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

Authors: Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

Links:
Homepage: https://ykilcher.com/
Merch:
YouTube:
Twitter: https://twitter.com/ykilcher
Discord: https://ykilcher.com/discord
LinkedIn: https://www.linkedin.com/in/ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2 days ago	Context Rot: How Increasing Input Tokens Impacts LLM Performance (Paper Analysis)
6 days ago	Energy-Based Transformers are Scalable Learners and Thinkers (Paper Review)
2025-05-03	On the Biology of a Large Language Model (Part 2)
2025-04-05	On the Biology of a Large Language Model (Part 1)
2025-01-26	[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
2024-12-26	Traditional Holiday Live Stream
2024-12-24	Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)
2024-12-10	Safety Alignment Should be Made More Than Just a Few Tokens Deep (Paper Explained)
2024-11-23	TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained)
2024-10-19	GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
2024-10-12	Were RNNs All We Needed? (Paper Explained)
2024-10-05	Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Paper)
2024-08-04	Privacy Backdoors: Stealing Data with Corrupted Pretrained Models (Paper Explained)
2024-07-08	Scalable MatMul-free Language Modeling (Paper Explained)
2024-06-26	Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (Paper Explained)
2024-06-01	xLSTM: Extended Long Short-Term Memory
2024-05-21	[ML News] OpenAI is in hot waters (GPT-4o, Ilya Leaving, Scarlett Johansson legal action)
2024-05-01	ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)
2024-04-30	[ML News] Chips, Robots, and Models
2024-04-28	TransformerFAM: Feedback attention is working memory
2024-04-27	[ML News] Devin exposed \| NeurIPS track for high school students

Channel	Latest
Japancommercials4U2	8 hours ago
ZetFar	9 hours ago
byFargo	9 hours ago
JSON Gameplay	9 hours ago
Mati Clips	9 hours ago
Orochinagi Gaming	9 hours ago
DefyByDefault	9 hours ago
CJR Gaming	9 hours ago
VIA X	9 hours ago
Rizsuja	9 hours ago
RTV Dukagjini	9 hours ago
JANAKULA	9 hours ago
Le Parisien	9 hours ago
pale kof stine	9 hours ago
Mr Nandu Tech	9 hours ago
Elvis COD	10 hours ago
Kang Opik Real	10 hours ago
dammad71	10 hours ago
Stonecoldpes6	10 hours ago
Andrii Mironets	10 hours ago
TopherTime	10 hours ago
Qrei	10 hours ago
РыбаКит	10 hours ago
DBLUE	10 hours ago
PsykoSeb	10 hours ago