Small language models (SLMs) struggle to learn complex reasoning behaviors, especially when high-quality traces are scarce or difficult to learn from.The standard training approach combines a supervised fine-tuning (SFT) stage, often to distill capabilities of a larger model, followed by a reinforcement learning (RL) stage such as Group Relative Policy Optimization (GRPO).In this paper, we investigate the fundamental limitations of this SFT + RL paradigm and propose methods to overcome them.Under a suitable theoretical model, we demonstrate that the SFT + RL strategy can fail completely when (1) the expert's traces are too difficult for the small model to express, or (2) the small model's initialization has exponentially small likelihood of success.To address these, we introduce BREAD: a GRPO variant that unifies the SFT and RL stages via partial expert guidance and branched rollouts.When self-generated traces fail, BREAD adaptively inserts short expert prefixes/hints, allowing the small model to complete the rest of the reasoning path, and ensuring that each update includes at least one successful trace.This mechanism both densifies the reward signal and induces a natural learning curriculum.BREAD requires fewer than 40% of ground-truth traces, consistently outperforming standard GRPO while speeding up the training by about 3ˆ.Importantly, we demonstrate that BREAD helps the model solve problems that are otherwise unsolvable by the SFT + RL strategy, highlighting how branched rollouts and expert guidance can substantially boost SLM reasoning.
This paper introduces BREAD (Branched Rollouts from Expert Anchors), a novel algorithm that integrates supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance the reasoning capabilities of small language models (SLMs). The authors discuss the limitations of traditional SFT + RL methods, particularly when expert traces are too complex for SLMs to learn from. BREAD improves upon these methods by using partial expert guidance and branched rollouts, allowing SLMs to learn effectively from fewer than 40% of ground-truth traces while achieving competitive accuracy and significant reductions in training compute time. The approach is validated through theoretical insights and empirical results demonstrating BREAD's superiority in solving challenging reasoning problems compared to standard GRPO and SFT+RL methods. The findings also suggest that BREAD can facilitate curriculum learning and significantly enhance sample efficiency during model training.
This paper employs the following methods:
- Qwen-2.5-3B-Instruct
- Qwen-2.5-1.5B-Instruct
- DeepSeek-R1
The following datasets were used in this research:
- MATH
- NuminaMath-CoT
- GPQA
- S1K
- S1K-1.1
- BREAD outperforms GRPO
- BREAD achieves higher accuracy with fewer trace tokens
- BREAD reduces training compute by about 75%
The authors identified the following limitations:
- Assumes availability of strong expert models for effective training
- May fail to provide reward signals from partial traces
- Expert traces must be digestible for SLMs
- Number of GPUs: 8
- GPU Type: L40S 40GB
- Compute Requirements: All of experiments are done with 8 L40S 40GB GPUs except the training starting from Qwen2.5-3B-Instruct as the base model, which requires 8 80G H100 GPUs.