ML Research Wiki / Benchmarks / Code Generation / MBPP

MBPP

Code Generation Benchmark

Performance Over Time

📊 Showing 95 results | 📏 Metric: Accuracy

Top Performing Models

Rank Model Paper Accuracy Date Code
1 EG-CFG (DeepSeek-V3-0324) Execution Guided Line-by-Line Code Generation 96.60 2025-06-12 📦 boazlavon/eg_cfg
2 QualityFlow (Sonnet-3.5) QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks 94.20 2025-01-20 -
3 o1-mini + MapCoder (Hamming.ai) 📚 MapCoder: Multi-Agent Code Generation for Competitive Problem Solving 93.20 2024-05-18 📦 md-ashraful-pramanik/mapcoder 📦 Luoji-zju/Agents4PLC_release
4 MGDebugger (DeepSeek-V3-0324) From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging 92.40 2024-10-02 📦 YerbaPage/MGDebugger
5 GPT-4 + AgentCoder AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation 91.80 2023-12-20 📦 huangd1999/AgentCoder
6 CodeSim (GPT4o) CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging 90.70 2025-02-08 📦 kagnlp/CodeGenerator
7 GPT-3.5 Turbo (ChatGPT) + AgentCoder AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation 89.90 2023-12-20 📦 huangd1999/AgentCoder
8 MapCoder (GPT-4o) MapCoder: Multi-Agent Code Generation for Competitive Problem Solving 89.70 2024-05-18 📦 md-ashraful-pramanik/mapcoder 📦 Luoji-zju/Agents4PLC_release
9 GPT-4 (ChatGPT Plus) How Does Naming Affect LLMs on Code Analysis Tasks? 87.50 2023-07-24 -
10 LPW (GPT-4o) Planning-Driven Programming: A Large Language Model Programming Workflow 84.80 2024-11-21 📦 you68681/lpw

All Papers (95)