HumanEval

Dataset Information
Modalities
Texts
Languages
Arabic
Introduced
2021
License
Unknown
Homepage

Overview

This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code". It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some
comparable to simple software interview questions.

Source: Evaluating Large Language Models Trained on Code
Image Source: Evaluating Large Language Models Trained on Code

Variants: STS Benchmark, HumanEval, HumanEval!, humaneval (0-shots), MTEB Benchmark, AllNLI Triplet

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Code Generation EG-CFG (DeepSeek-V3-0324) Execution Guided Line-by-Line Code Generation 2025-06-12
Code Generation QualityFlow (Sonnet-3.5) QualityFlow: An Agentic Workflow for … 2025-01-20
Code Generation Phi-2 Planning-Driven Programming: A Large Language … 2024-11-21
Code Generation DeepSeek-R1 (MGDebugger) From Code to Correctness: Closing … 2024-10-02
Code Generation Mistral 7B MapCoder: Multi-Agent Code Generation for … 2024-05-18
Code Generation LLaMA 3 Debug like a Human: A … 2024-02-25
Code Generation L2MAC (GPT-4) L2MAC: Large Language Model Automatic … 2023-10-02

Research Papers

Recent papers with results on this dataset: