← ML Research Wiki / 2506.17211

BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning

(2025)

Paper Information

arXiv ID

2506.17211

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Small language models (SLMs) struggle to learn complex reasoning behaviors, especially when high-quality traces are scarce or difficult to learn from.The standard training approach combines a supervised fine-tuning (SFT) stage, often to distill capabilities of a larger model, followed by a reinforcement learning (RL) stage such as Group Relative Policy Optimization (GRPO).In this paper, we investigate the fundamental limitations of this SFT + RL paradigm and propose methods to overcome them.Under a suitable theoretical model, we demonstrate that the SFT + RL strategy can fail completely when (1) the expert's traces are too difficult for the small model to express, or (2) the small model's initialization has exponentially small likelihood of success.To address these, we introduce BREAD: a GRPO variant that unifies the SFT and RL stages via partial expert guidance and branched rollouts.When self-generated traces fail, BREAD adaptively inserts short expert prefixes/hints, allowing the small model to complete the rest of the reasoning path, and ensuring that each update includes at least one successful trace.This mechanism both densifies the reward signal and induces a natural learning curriculum.BREAD requires fewer than 40% of ground-truth traces, consistently outperforming standard GRPO while speeding up the training by about 3ˆ.Importantly, we demonstrate that BREAD helps the model solve problems that are otherwise unsolvable by the SFT + RL strategy, highlighting how branched rollouts and expert guidance can substantially boost SLM reasoning.

Summary

This paper introduces BREAD (Branched Rollouts from Expert Anchors), a novel algorithm that integrates supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance the reasoning capabilities of small language models (SLMs). The authors discuss the limitations of traditional SFT + RL methods, particularly when expert traces are too complex for SLMs to learn from. BREAD improves upon these methods by using partial expert guidance and branched rollouts, allowing SLMs to learn effectively from fewer than 40% of ground-truth traces while achieving competitive accuracy and significant reductions in training compute time. The approach is validated through theoretical insights and empirical results demonstrating BREAD's superiority in solving challenging reasoning problems compared to standard GRPO and SFT+RL methods. The findings also suggest that BREAD can facilitate curriculum learning and significantly enhance sample efficiency during model training.

Methods

This paper employs the following methods:

BREAD
GRPO

Models Used

Qwen-2.5-3B-Instruct
Qwen-2.5-1.5B-Instruct
DeepSeek-R1

Datasets

The following datasets were used in this research:

MATH
NuminaMath-CoT
GPQA
S1K
S1K-1.1

Evaluation Metrics

Accuracy

Results

BREAD outperforms GRPO
BREAD achieves higher accuracy with fewer trace tokens
BREAD reduces training compute by about 75%

Limitations

The authors identified the following limitations:

Assumes availability of strong expert models for effective training
May fail to provide reward signals from partial traces
Expert traces must be digestible for SLMs

Technical Requirements

Number of GPUs: 8
GPU Type: L40S 40GB
Compute Requirements: All of experiments are done with 8 L40S 40GB GPUs except the training starting from Qwen2.5-3B-Instruct as the base model, which requires 8 80G H100 GPUs.

Papers Using Similar Methods

External Resources

References: 42

BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers