Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Paper)

Channel:

Yannic Kilcher

Subscribers:

291,000

Published on October 5, 2024 9:55:47 PM ● Video Link: https://www.youtube.com/watch?v=AfAmwIP2ntY

Duration: 0:00

29,338 views

775

How can one best use extra FLOPS at test time?

Paper: https://arxiv.org/abs/2408.03314

Abstract:
Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

Authors: Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar

Links:
Homepage: https://ykilcher.com/
Merch:
YouTube:
Twitter: https://twitter.com/ykilcher
Discord: https://ykilcher.com/discord
LinkedIn: https://www.linkedin.com/in/ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2025-05-03	On the Biology of a Large Language Model (Part 2)
2025-04-05	On the Biology of a Large Language Model (Part 1)
2025-01-26	[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
2024-12-26	Traditional Holiday Live Stream
2024-12-24	Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)
2024-12-10	Safety Alignment Should be Made More Than Just a Few Tokens Deep (Paper Explained)
2024-11-23	TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained)
2024-10-19	GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
2024-10-12	Were RNNs All We Needed? (Paper Explained)
2024-10-05	Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Paper)
2024-08-04	Privacy Backdoors: Stealing Data with Corrupted Pretrained Models (Paper Explained)
2024-07-08	Scalable MatMul-free Language Modeling (Paper Explained)
2024-06-26	Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (Paper Explained)
2024-06-01	xLSTM: Extended Long Short-Term Memory
2024-05-21	[ML News] OpenAI is in hot waters (GPT-4o, Ilya Leaving, Scarlett Johansson legal action)
2024-05-01	ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)
2024-04-30	[ML News] Chips, Robots, and Models
2024-04-28	TransformerFAM: Feedback attention is working memory
2024-04-27	[ML News] Devin exposed \| NeurIPS track for high school students
2024-04-24	Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
2024-04-23	[ML News] Llama 3 changes the game

Channel	Latest
SINASEFE	6 hours ago
GaplekBehemoth	6 hours ago
MisrraVB - League of Legends	6 hours ago
backyarD_D Play Records	6 hours ago
Comics, Toys & Travels	6 hours ago
Joe Bartolozzi	6 hours ago
O Tung Sahur 🪵	7 hours ago
Thiodar	7 hours ago
Jaeger Supreme	7 hours ago
Beaglerush VODs	7 hours ago
Rebel Reindeer	7 hours ago
MrHeroGames	7 hours ago
StupidlyEPIC	7 hours ago
Cawiska	7 hours ago
Ibai	7 hours ago
NeoSuko	7 hours ago
VODJJ	7 hours ago
SoosKratoS	7 hours ago
mixwell	7 hours ago
Levinci	7 hours ago
migiwaeden	8 hours ago
Vaush	8 hours ago
Wandi Tutorial	8 hours ago
Koiaku	8 hours ago
Paulo Damasio TV	8 hours ago