This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code". It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some
comparable to simple software interview questions.
Source: Evaluating Large Language Models Trained on Code
Image Source: Evaluating Large Language Models Trained on Code
Variants: STS Benchmark, HumanEval, HumanEval!, humaneval (0-shots), MTEB Benchmark, AllNLI Triplet
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Code Generation | EG-CFG (DeepSeek-V3-0324) | Execution Guided Line-by-Line Code Generation | 2025-06-12 |
Code Generation | QualityFlow (Sonnet-3.5) | QualityFlow: An Agentic Workflow for … | 2025-01-20 |
Code Generation | Phi-2 | Planning-Driven Programming: A Large Language … | 2024-11-21 |
Code Generation | DeepSeek-R1 (MGDebugger) | From Code to Correctness: Closing … | 2024-10-02 |
Code Generation | Mistral 7B | MapCoder: Multi-Agent Code Generation for … | 2024-05-18 |
Code Generation | LLaMA 3 | Debug like a Human: A … | 2024-02-25 |
Code Generation | L2MAC (GPT-4) | L2MAC: Large Language Model Automatic … | 2023-10-02 |
Recent papers with results on this dataset: