← ML Research Wiki / 2304.05128

TEACHING LARGE LANGUAGE MODELS TO SELF- DEBUG

Xinyun Chen Google DeepMind, Maxwell Lin [email protected] Berkeley, Nathanael Schärli Google DeepMind, Denny Zhou [email protected] Google DeepMind (2023)

Paper Information
arXiv ID
Venue
International Conference on Learning Representations
Domain
Not specified

Abstract

Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose SELF-DEBUGGING, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that SELF-DEBUGGING can teach the large language model to perform rubber duck debugging; i.e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by investigating the execution results and explaining the generated code in natural language. SELF-DEBUGGING achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, SELF-DEBUGGING with code explanation consistently improves the baseline by 2 − 3%, and improves the prediction accuracy on problems of the hardest level by 9%. On TransCoder and MBPP where unit tests are available, SELF-DEBUGGING improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, SELF-DEBUGGING notably improves sample efficiency, and can match or outperform baseline models that generate more than 10× candidate programs.arXiv:2304.05128v2 [cs.CL] 5 Oct 2023Teaching Large Language Models to Self-Debug works suggest that such large language models are not yet capable of correcting code when lacking external feedback, such as unit tests or human instructions(Chen et al., 2023a).In this work, we propose SELF-DEBUGGING, where we teach the large language model to debug its own predicted code via few-shot prompting. Without any additional model training, SELF-DEBUGGING instructs the model to execute the code, then generate a feedback message based on the code and its execution result. Different from prior works on utilizing human feedback for code repair, where the feedback message explains the code errors and how to fix them (Chen et al., 2023a; Austin et al., 2021), SELF-DEBUGGING teaches the model to identify the implementation errors via investigating into execution results and explaining the code by itself. This debugging process is reminiscent of rubber duck debugging for human programmers, where explaining the code line-byline in natural language to a rubber duck significantly boosts debugging efficiency without expert guidance (Hunt & Thomas, 2000).Figure 1illustrates the full procedure of SELF-DEBUGGING.We evaluate SELF-DEBUGGING on a variety of models, including code-davinci-002 (Chen et al., 2021a), gpt-3.5-turbo, gpt-4 (OpenAI, 2023) in the GPT model family, as well as StarCoder(Li et al., 2023b), a strong open-source LLM for code generation. SELF-DEBUGGING achieves the state-of-the-art performance on different types of code generation tasks, including text-to-SQL generation, code translation and text-to-Python generation. On the Spider benchmark (Yu et al., 2018) for text-to-SQL generation where there are no unit tests in the problem description, with code explanation, SELF-DEBUGGING consistently improves the baseline by 2 − 3% with different numbers of initial programs, and improves the prediction accuracy on the most complicated SQL queries by 9%. On both TransCoder for code translation (Roziere et al., 2020) and MBPP for text-to-Python generation , utilizing unit tests along with code explanation boosts the accuracy by up to 12%, and code explanation alone without debugging also consistently improves the code translation performance by 2 − 3%. Meanwhile, SELF-DEBUGGING improves sample efficiency, and can match or outperform baseline models that sample more than 10× predictions. Our work indicates that besides improving their ability to generate code from scratch, teaching large language models to perform SELF-DEBUGGING without human guidance is another promising path to enhance coding capability and reduce the sampling cost required to accomplish challenging tasks.

Summary

This paper introduces a technique called SELF-DEBUGGING for large language models (LLMs) to enhance code generation through inherent debugging capabilities. The authors demonstrate that by implementing few-shot demonstrations, these models can identify and correct their programming mistakes without human intervention. The approach mimics rubber duck debugging, where the model explains the code it generates, thereby improving accuracy, especially on complex tasks. The results indicate that SELF-DEBUGGING outperforms existing methods on benchmarks such as Spider, TransCoder, and MBPP, effectively teaching models to debug themselves and thus improving sample efficiency and reducing overall prediction costs.

Methods

This paper employs the following methods:

  • SELF-DEBUGGING
  • rubber duck debugging

Models Used

  • code-davinci-002
  • gpt-3.5-turbo
  • gpt-4
  • StarCoder

Datasets

The following datasets were used in this research:

  • Spider
  • TransCoder
  • MBPP

Evaluation Metrics

  • Accuracy

Results

  • Achieves state-of-the-art performance on Spider, TransCoder, and MBPP benchmarks
  • Improves prediction accuracy on hard problems by 9%
  • Outperforms baseline models by up to 12% on benchmarks with unit tests

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

External Resources