← ML Research Wiki / 2305.20050

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, Openai (2023)

Paper Information

arXiv ID

2305.20050

Venue

International Conference on Learning Representations

Domain

Artificial intelligence

SOTA Claim

Yes

Reproducibility

7/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even stateof-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

Summary

The paper investigates the efficacy of two supervision methods for training reward models in large language models: outcome supervision (ORM) and process supervision (PRM). It emphasizes that while current language models show improved multi-step reasoning, they often generate incorrect results. The authors find that process supervision significantly enhances performance on the MATH dataset, achieving a 78.2% success rate for solving problems from a subset of this dataset. They also introduce the PRM800K dataset, consisting of 800,000 step-level human feedback labels, which supports the training of their reward model. Active learning strategies further enhance the effectiveness of process supervision by selecting the most informative examples for human labeling. The paper concludes that process supervision not only improves model reliability but also leads to safer, more interpretable reasoning in AI systems.

Methods

This paper employs the following methods:

Outcome Supervision
Process Supervision

Models Used

GPT-4

Datasets

The following datasets were used in this research:

MATH
PRM800K
MathMix

Evaluation Metrics

Accuracy

Results

Process supervision achieves 78.2% problem-solving accuracy on MATH test set.
Active learning improves data efficiency by 2.6x.

Limitations

The authors identified the following limitations:

Potential test set contamination due to overlap with pre-training data.
Possible issues in manual labeling processes.

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

large language models multi-step reasoning reward models process supervision outcome supervision active learning math dataset

Papers Using Similar Methods

External Resources

Funding: OpenAI
References: 33
Influential Citations: 103

Let's Verify Step by Step

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers