RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale
RES-Q is a natural language instruction-based benchmark for evaluating $\textbf{R}$epository $\textbf{E}$diting $\textbf{S}$ystems, which consists of 100 handcrafted repository editing tasks derived from real GitHub commits. Given an edit instruction and a code repository, RES-Q evaluates an LLM system’s ability to interpret edit instructions, gather information, and construct appropriate edits to the repository.
Variants: RES-Q
This dataset is used in 1 benchmark:
Task | Model | Paper | Date |
---|---|---|---|
Code Generation | QurrentOS-coder + Claude 3.5 Sonnet | RES-Q: Evaluating Code-Editing Large Language … | 2024-06-24 |
Code Generation | QurrentOS-coder + GPT-4o | RES-Q: Evaluating Code-Editing Large Language … | 2024-06-24 |
Code Generation | QurrentOS-coder + GPT-4 Turbo | RES-Q: Evaluating Code-Editing Large Language … | 2024-06-24 |
Code Generation | QurrentOS-coder + Claude 3 Opus | RES-Q: Evaluating Code-Editing Large Language … | 2024-06-24 |
Code Generation | QurrentOS-coder + GPT-4 | RES-Q: Evaluating Code-Editing Large Language … | 2024-06-24 |
Code Generation | QurrentOS-coder + Gemini 1.5 Pro | RES-Q: Evaluating Code-Editing Large Language … | 2024-06-24 |
Code Generation | QurrentOS-coder + DeepSeek-Coder-V2 | RES-Q: Evaluating Code-Editing Large Language … | 2024-06-24 |
Code Generation | QurrentOS-coder + Llama 3 70b | RES-Q: Evaluating Code-Editing Large Language … | 2024-06-24 |
Code Generation | QurrentOS-coder + Qwen-72B-Instruct | RES-Q: Evaluating Code-Editing Large Language … | 2024-06-24 |
Recent papers with results on this dataset: