← ML Research Wiki / 2305.01210

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu [email protected] University of Illinois Urbana-Champaign Nanjing University, Steven Chunqiu [email protected] University of Illinois Urbana-Champaign Nanjing University, Xia University of Illinois Urbana-Champaign Nanjing University, Yuyao Wang [email protected] University of Illinois Urbana-Champaign Nanjing University, Lingming Zhang [email protected] University of Illinois Urbana-Champaign Nanjing University (2023)

Paper Information

arXiv ID

2305.01210

Venue

Neural Information Processing Systems

Domain

Natural Language Processing / Software Engineering

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code.Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis.However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code.Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct?To answer this, we propose EvalPlus -a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code.EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM-and mutation-based strategies.While EvalPlus is general, we extend the test-cases of the popular HUMANEVAL benchmark by 80× to build HUMANEVAL + .Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HUMANEVAL + is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%.We also surprisingly found that test insufficiency can lead to mis-ranking.For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HUMANEVAL + , while none of them could on HUMANEVAL.Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing.We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.

Summary

The paper presents EvalPlus, a framework designed to rigorously evaluate the functional correctness of code generated by Large Language Models (LLMs). The authors highlight the limitations of current programming benchmarks, particularly manual test cases that are insufficient in quantity and quality. They extend the existing HUMANEVAL benchmark by 80 times, creating HUMANEVAL +, which features automated test input generation through a combination of LLM and mutation-based strategies. The extensive evaluation of 26 LLMs (e.g., GPT-4, ChatGPT) reveals that the new benchmark catches significant amounts of undetected incorrect code. Results indicate that previous benchmarks, including HUMANEVAL, may misrepresent LLM performance due to testing insufficiency, prompting a reevaluation of code generation assessment methods. The paper includes contributions related to automatic test generation techniques, systematic evaluation of LLMs, and the revision of ground-truth solutions in HUMANEVAL.

Methods

This paper employs the following methods:

EvalPlus
differential testing
type-aware mutation

Models Used

GPT-4
ChatGPT
WizardCoder-CodeLlama
Phind-CodeLlama
CodeGen
INCODER
StarCoder

Datasets

The following datasets were used in this research:

HUMANEVAL
HUMANEVAL +

Evaluation Metrics

pass@1
pass@10
pass@100
pass@1*

Results

HUMANEVAL + identifies 19.3-28.9% more incorrect code than HUMANEVAL
WizardCoder-CodeLlama and Phind-CodeLlama outperform ChatGPT on HUMANEVAL + but not on HUMANEVAL

Limitations

The authors identified the following limitations:

Current programming benchmarks use insufficient test cases
Manual tests often fail to capture corner cases
Ground-truth solutions can be incorrect

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Large Language Models Code Generation Automated Testing Benchmarking EvalPlus

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 92
Influential Citations: 66

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers