← ML Research Wiki / 2408.03314

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell ♦ [email protected], Jaehoon Lee Google DeepMind, Kelvin Xu Equal advising 1 UC Berkeley Google DeepMind, Aviral Kumar Equal advising 1 UC Berkeley Google DeepMind (2024)

Paper Information

arXiv ID

2408.03314

Venue

arXiv.org

Domain

Natural Language Processing

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Work done during an internship at Google DeepMind Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language.In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt?Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute.Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods.Moreover, current work largely provides negative results for a number of these strategies.In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and(2) updating the model's distribution over a response adaptively, given the prompt at test time.We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt.Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4× compared to a best-of-N baseline.Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14× larger model.

Summary

This paper investigates the impact of optimizing test-time computational scaling for large language models (LLMs) compared to merely increasing model parameters. The authors analyze methods for improving performance on challenging prompts using fixed inference-time compute, focusing on two main strategies: refining proposal distributions through iterative revisions and searching using process-based verifiers. The study shows that an adaptive 'compute-optimal' strategy significantly enhances test-time compute efficiency, achieving up to 4x better performance when compared to standard methods. The authors also explore the implications of their findings for the balance between pretraining and inference compute expenditures, suggesting that in certain scenarios leveraging test-time compute can outperform the benefits of increasing model sizes.

Methods

This paper employs the following methods:

Best-of-N sampling
Beam search
Lookahead search
Sequential revisions
Adaptive scaling

Models Used

PaLM 2-S*

Datasets

The following datasets were used in this research:

MATH

Evaluation Metrics

Accuracy

Results

Improved test-time compute scaling by more than 4x
Substituted test-time compute effectively for larger model parameters on easier to intermediate questions
Showed the effectiveness of an optimal allocation of test-time compute based on question difficulty

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Large Language Models Test-Time Computation Scaling Laws Prompt Adaptation Verifier Models Revisions

Papers Using Similar Methods

Learning Dependency-Based Compositional Semantics (2011)
Compositional Explanations of Neurons (2020)
Do Massively Pretrained Language Models Make Better Storytellers? (2019)
Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech (2018)
review articles (2016)

External Resources

Funding: Not specified
References: 32
Influential Citations: 36

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers