← ML Research Wiki / 2506.17208

Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM-and Agent-Based Repair Systems

(2025)

Paper Information

arXiv ID

2506.17208

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems.SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories.Its public leaderboards-SWE-Bench Lite and SWE-Bench Verified-have become central platforms for tracking progress and comparing solutions.However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear.In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (68 entries) and Verified (79 entries) leaderboards, analyzing 67 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture.Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5/3.7),the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.

Summary

This paper presents a comprehensive analysis of the SWE-Bench leaderboards, focusing on submissions to both SWE-Bench Lite and SWE-Bench Verified leaderboards. The findings reveal that a significant number of entries utilize proprietary large language models (LLMs), particularly Claude 3.5/3.7, and feature diverse architectures ranging from agentic to non-agentic design patterns. The study highlights the dominance of industry submissions, with key contributions from small and large tech companies, as well as individual developers. The paper also discusses the end-to-end software maintenance pipeline and evaluates the submissions across various performance metrics, including precision and resolved rates. The analysis emphasizes the increasing acceptance of SWE-Bench as a standard benchmark for AI-driven software engineering solutions and identifies future avenues for enhancement.

Methods

This paper employs the following methods:

Automated Program Repair (APR)
Data Collection
Content Analysis
Statistical Testing

Models Used

Claude 3.5
Claude 3.7
GPT-4
OpenAI GPT-4o
LLaMA 3

Datasets

The following datasets were used in this research:

SWE-Bench Lite
SWE-Bench Verified
None specified

Evaluation Metrics

% Resolved
None specified

Results

Majority of high-performing submissions rely on proprietary LLMs
Industry submissions are significantly higher in SWE-Bench Verified compared to Lite
Diversity of architectural designs leads to competitive performance across various systems

Limitations

The authors identified the following limitations:

Many submissions lack detailed documentation
Variations in performance results due to overfitting
Potential biases in test suite evaluations

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified
Compute Requirements: None specified

Papers Using Similar Methods

External Resources

References: 57

Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM-and Agent-Based Repair Systems

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Related Papers