RES-Q

RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale

Dataset Information
Modalities
Texts
Languages
English
Introduced
2024
License
MIT
Homepage

Overview

RES-Q is a natural language instruction-based benchmark for evaluating $\textbf{R}$epository $\textbf{E}$diting $\textbf{S}$ystems, which consists of 100 handcrafted repository editing tasks derived from real GitHub commits. Given an edit instruction and a code repository, RES-Q evaluates an LLM system’s ability to interpret edit instructions, gather information, and construct appropriate edits to the repository.

Variants: RES-Q

Associated Benchmarks

This dataset is used in 1 benchmark:

Recent Benchmark Submissions

Task Model Paper Date
Code Generation QurrentOS-coder + Claude 3.5 Sonnet RES-Q: Evaluating Code-Editing Large Language … 2024-06-24
Code Generation QurrentOS-coder + GPT-4o RES-Q: Evaluating Code-Editing Large Language … 2024-06-24
Code Generation QurrentOS-coder + GPT-4 Turbo RES-Q: Evaluating Code-Editing Large Language … 2024-06-24
Code Generation QurrentOS-coder + Claude 3 Opus RES-Q: Evaluating Code-Editing Large Language … 2024-06-24
Code Generation QurrentOS-coder + GPT-4 RES-Q: Evaluating Code-Editing Large Language … 2024-06-24
Code Generation QurrentOS-coder + Gemini 1.5 Pro RES-Q: Evaluating Code-Editing Large Language … 2024-06-24
Code Generation QurrentOS-coder + DeepSeek-Coder-V2 RES-Q: Evaluating Code-Editing Large Language … 2024-06-24
Code Generation QurrentOS-coder + Llama 3 70b RES-Q: Evaluating Code-Editing Large Language … 2024-06-24
Code Generation QurrentOS-coder + Qwen-72B-Instruct RES-Q: Evaluating Code-Editing Large Language … 2024-06-24

Research Papers

Recent papers with results on this dataset: