Venue
Neural Information Processing Systems
Domain
Natural Language Processing / Software Engineering
Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code.Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis.However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code.Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct?To answer this, we propose EvalPlus -a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code.EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM-and mutation-based strategies.While EvalPlus is general, we extend the test-cases of the popular HUMANEVAL benchmark by 80× to build HUMANEVAL + .Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HUMANEVAL + is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%.We also surprisingly found that test insufficiency can lead to mis-ranking.For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HUMANEVAL + , while none of them could on HUMANEVAL.Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing.We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.
The paper presents EvalPlus, a framework designed to rigorously evaluate the functional correctness of code generated by Large Language Models (LLMs). The authors highlight the limitations of current programming benchmarks, particularly manual test cases that are insufficient in quantity and quality. They extend the existing HUMANEVAL benchmark by 80 times, creating HUMANEVAL +, which features automated test input generation through a combination of LLM and mutation-based strategies. The extensive evaluation of 26 LLMs (e.g., GPT-4, ChatGPT) reveals that the new benchmark catches significant amounts of undetected incorrect code. Results indicate that previous benchmarks, including HUMANEVAL, may misrepresent LLM performance due to testing insufficiency, prompting a reevaluation of code generation assessment methods. The paper includes contributions related to automatic test generation techniques, systematic evaluation of LLMs, and the revision of ground-truth solutions in HUMANEVAL.
This paper employs the following methods:
- EvalPlus
- differential testing
- type-aware mutation
- GPT-4
- ChatGPT
- WizardCoder-CodeLlama
- Phind-CodeLlama
- CodeGen
- INCODER
- StarCoder
The following datasets were used in this research:
- pass@1
- pass@10
- pass@100
- pass@1*
- HUMANEVAL + identifies 19.3-28.9% more incorrect code than HUMANEVAL
- WizardCoder-CodeLlama and Phind-CodeLlama outperform ChatGPT on HUMANEVAL + but not on HUMANEVAL
The authors identified the following limitations:
- Current programming benchmarks use insufficient test cases
- Manual tests often fail to capture corner cases
- Ground-truth solutions can be incorrect
- Number of GPUs: None specified
- GPU Type: None specified
Large Language Models
Code Generation
Automated Testing
Benchmarking
EvalPlus