- đź“… 2026.01: Paper update! We add 34 papers from 2025.10 to 2025.12. Now, we have included 160 papers in this survey.
- đź“… 2025.12: We release the first survey on agentic software issue resolution!
- đź“… 2025.10: We summarize 126 papers about issue resolution, from 2023.10 to 2025.10!
We classified this survey into three main parts: Benchmarks, Technologies and Empirical Studies.
Up to 2026-01-06, automated issue solving technologies can be mainly surveyed from 2 perspectives:
Scaffold Design and Learning Strategy.
For Benchmarks, we summarized the existing benchmarks into 2 categories for their different tasks.
@End-To-End
@Reproduction Test Generation
@Localization
| Literature | Name | Scope | Journal/Conference | Time | Link |
|---|---|---|---|---|---|
| SWE-bench: Can Language Models Resolve Real-World GitHub Issues? | SWE-bench | End-To-End | ICLR'24 | 2023-10 | Paper Code |
| SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents | SWT-Bench | Reproduction Test Generation | NeurIPS'24 | 2024-06 | Paper Code |
| SWE-bench-java: A GitHub Issue Resolving Benchmark for Java | Muti-SWE-bench | End-To-End | ARXIV | 2024-08 | Paper Code |
| SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? | SWE-bench Mutimodal | End-To-End | ICLR'25 | 2024-10 | Paper Code |
| SWE-Bench+: Enhanced Coding Benchmark for LLMs | SWE-Bench+ | End-To-End | ARXIV | 2024-10 | Paper |
| TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark | TestGenEval | Reproduction Test Generation | ICLR'25 | 2024-10 | Paper Code |
| A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models | FAUN-Eval | End-To-End | ARXIV | 2024-11 | Paper |
| TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved? | TDD-Bench | Reproduction Test Generation | ARXIV | 2024-11 | Paper Code |
| CodeV: Issue Resolving with Visual Data | Visual SWE-bench | End-To-End | ACL Findings'25 | 2024-12 | Paper Code |
| Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving | Muti-SWE-bench | End-To-End | ARXIV | 2025-04 | Paper Code |
| LiveSWEBench | LiveSWEBench | End-To-End | BLOG | 2025-04 | link Code |
| LocAgent: Graph-Guided LLM Agents for Code Localization | LocBench | Localization | ARXIV | 2025-03 | Paper Code |
| Automated Benchmark Generation for Repository-Level Coding Tasks | SWEE-Bench/SWA-Bench | End-To-End | ARXIV | 2025-03 | Paper |
| FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation | FEA-Bench | End-To-End | ACL'25 | 2025-03 | Paper Code |
| OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution | OmniGIRL | End-To-End | ISSTA'25 | 2025-05 | Paper Code |
| - | SWE-bench Multilingual | End-To-End | BLOG | 2025-05 | link Code |
| SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents | SWE-PolyBench | End-To-End | ARXIV | 2025-04 | Paper Code |
| SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents | SWE-rebench | End-To-End | ARXIV | 2025-05 | Paper Code |
| GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents | GSO | End-To-End | NeurIPS'25 | 2025-05 | Paper Code |
| SWE-bench Goes Live! | SWE-bench-Live | End-To-End | ARXIV | 2025-05 | Paper Code |
| UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench | UTBoost | - | ARXIV | 2025-06 | Paper Code |
| SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks | SWE-Factory | End-To-End | ARXIV | 2025-06 | Paper Code |
| SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving | Swing-Arena | End-To-End | ARXIV | 2025-06 | Paper Code |
| SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation | SPICE | - | ASE'25 | 2025-07 | Paper |
| SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks | SWE-MERA | End-To-End | ARXIV | 2025-07 | Paper |
| SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? | SWE-Perf | End-To-End | ARXIV | 2025-07 | Paper Code |
| NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition | NoCode-bench | End-To-End | ARXIV | 2025-08 | Paper Code |
| SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? | SWE-Bench Pro | End-To-End | ARXIV | 2025-09 | Paper Code |
| SWE-QA: Can Language Models Answer Repository-level Code Questions? | SWE-QA-Bench | QA | ARXIV | 2025-09 | Paper Code |
| A Benchmark for Localizing Code and Non-Code Issues in Software Projects | MULocBench | Localization | ARXIV | 2025-10 | Paper Code |
| SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models | SWE-Compass | End-To-End | ARXIV | 2025-11 | Paper Code |
| SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads? | SWE-fficiency | End-To-End | ARXIV | 2025-11 | Paper Code |
| SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories | SWE-Bench++ | End-To-End | ARXIV | 2025-12 | Paper Code |
| SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios | SWE-EVO | End-To-End | ARXIV | 2025-12 | Paper Code |
From the perspective of Design Paradigms, we can classify them into 2 categories following benchmarks:
@End-To-End
@Single-Phased
For End-To-End Method, we can further classify them into 2 categories:
@Agent-Based Method
@Pipeline-Based Method
| Literature | Name | Journal/Conference | Time | Label | URL |
|---|---|---|---|---|---|
| SWE-bench: Can Language Models Resolve Real-World GitHub Issues? | BM25 RAG | ICLR 2024 | 2023-10 | @Pipeline | Paper Code |
| SWE-agent: Agent-computer interfaces enable automated software engineering | SWE-Agent | NeurIPS 2024 | 2024-05 | @Agent | Paper Code |
| Autocoderover: Autonomous program improvement | AutoCodeRover | ISSTA 2024 | 2024-04 | @Agent | Paper Code |
| CodeR: Issue Resolving with Multi-Agent and Task Graphs | CodeR | Arxiv | 2024-06 | @Agent | Paper Code |
| Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration | LingmaAgent/RepoUnderstander | FSE Companion 2025 | 2024-06 | @Agent | Paper Code |
| MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution | MAGIS | NeurIPS 2024 | 2024-03 | @Agent | Paper Code |
| MASAI: Modular Architecture for Software-engineering AI Agents | MASAI | Arxiv | 2024-06 | @Agent | Paper |
| OpenDevin: An Open Platform forAI Software Developers as Generalist Agents | OpenDevin(AllHands) | Arxiv | 2024-06 | @Agent | Paper |
| Agentless: Demystifying llm-based software engineering agents | Agentless | FSE 2025 | 2024-07 | @Pipeline | Paper Code |
| OpenHands: An Open Platform for AI Software Developers as Generalist Agents | OpenHands | ICLR 2025 | 2024-07 | @Agent | Paper Code |
| Specrover: Code intent extraction via llms | SpecRover (AutoCodeRover-v2) | ICSE 2025 | 2024-08 | @Agent | Paper Code |
| CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases | CodexGraph | Arxiv | 2024-08 | @Agent | Paper Code |
| SuperCoder2.0: Technical Report on Exploring the feasibility of LLMs as Autonomous Programmer | SuperCoder | Arxiv | 2024-09 | @Agent | Paper |
| Hyperagent: Generalist software engineering agents to solve coding tasks at scale | HyperAgent | Arxiv | 2024-09 | @Agent | Paper |
| RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph | RepoGraph | ICLR 2025 | 2024-10 | @Pipeline | Paper Code |
| SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement | SWE-Search | ICLR 2025 | 2024-10 | @Agent | Paper Code |
| OpenHands: An Open Platform for AI Software Developers as Generalist Agents | OpenHands CodeAct | ICLR 2025 | 2024-10 | @Agent | Paper Code |
| - | Composio SWE-Kit | Blog | 2024-10 | @Pipeline | Link Code |
| Infant Agent: A Tool-Integrated, Logic-Driven Agent with Cost-Effective API Usage | Infant Agent | Arxiv | 2024-11 | @Agent | Paper |
| MarsCode Agent: AI-native Automated Bug Fixing | MarsCode Agent | Arxiv | 2024-11 | @Agent | Paper |
| Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement | SWESynInfer | FSE 2025 Industry | 2024-11 | @Pipeline | Paper Code |
| - | Nebius AI | Blog | 2024-11 | @Agent | Paper |
| CodeV: Issue Resolving with Visual Data | CodeV | Arxiv | 2024-12 | @Pipeline | Paper Code |
| - | Aide | Blog | 2024-12 | @Agent | Link |
| Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments | Learn-By-Interact | Arxiv | 2025-01 | @Agent | Paper |
| PatchPilot: A Stable and Cost-Efficient Agentic Patching Framework | PatchPilot | ICML 2025 | 2025-02 | @Pipeline | Paper Code |
| CodeMonkeys: Scaling Test-Time Compute for Software Engineering | CodeMonkeys | Arxiv | 2025-02 | @Pipeline | Paper Code |
| SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution | Agentless Mini | ARXIV | 2025-02 | @Pipeline | Paper Code |
| - | Agentless Lite | Blog | 2025-02 | @Pipeline | Code |
| - | Syntheo | Blog | 2025-02 | @Agent | Link |
| SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution | SWE-Fixer | ACL Findings 2025 | 2025-02 | @Pipeline | Paper |
| - | AgentScope | Blog | 2025-03 | @Agent | Link |
| DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal | DARS | Arxiv | 2025-03 | @Agent | Paper Code |
| Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs | KGCompass | Arxiv | 2025-03 | @Pipeline | Paper |
| - | Augment Agent v0 | Blog | 2025-03 | @Agent | Link Code |
| - | CORTEXA | Blog | 2025-03 | @Pipeline | Link |
| - | Refact.ai | Blog | 2025-03 | @Agent | Link Code |
| - | Lingxi | Blog | 2025-04 | @Agent | Link Code |
| - | Trae IDE | Blog | 2025-05 | @Agent | Link |
| - | devlo | Blog | 2025-05 | @Agent | Link |
| Putting It All into Context: Simplifying Agents with LCLMs | LCLM | Arxiv | 2025-05 | @Pipeline | Paper |
| Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks | CGM-SWE-PY | NeurIPS'25 | 2025-05 | @Pipeline | Paper |
| InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction | InfantAgent-Next | Arxiv | 2025-05 | @Agent | Paper Code |
| Coding Agents with Multimodal Browsing are Generalist Problem Solvers | OpenHands-Versa | Arxiv | 2025-06 | @Agent | Paper Code |
| EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair | EXPEREPAIR | Arxiv | 2025-06 | @Agent | Paper |
| Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing | GUIRepair | ASE'25 | 2025-06 | @Pipeline | Paper |
| SemAgent: A Semantics Aware Program Repair Agent | SemAgent | Arxiv | 2025-06 | @Pipeline | Paper |
| Nemotron-Cortexa: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity | Nemotron-Cortexa | ICML'25 | 2025-06 | @Pipeline | Paper Code |
| Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving | Agent KB | Arxiv | 2025-07 | @Agent | Paper Code |
| Prometheus: Unified Knowledge Graphs for Issue Resolution in Multilingual Codebases | Prometheus | Arxiv | 2025-07 | @Agent | Paper Code |
| SWE-Exp: Experience-Driven Software Issue Resolution | SWE-Exp | Arxiv | 2025-07 | @Agent | Paper Code |
| SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution | SWE-Debate | Arxiv | 2025-07 | @Agent | Paper Code |
| Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling | Trae Agent | Arxiv | 2025-07 | @Agent | Paper Code |
| SynFix: Dependency-Aware Program Repair via RelationGraph Analysis | SynFix | ACL Findings'25 | 2025-07 | @Pipeline | Paper |
| SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents | SE-Agent | NeurIPS'25 | 2025-08 | @Agent | Paper Code |
| CoreThink: A Symbolic Reasoning Layer to reason over Long Horizon Tasks with LLMs | CoreThink | Arxiv | 2025-09 | @Agent | Paper |
| Improving the Efficiency of LLM Agent Systems through Trajectory Reduction | AgentDiet | FSE'26 | 2025-09 | @Agent | Paper |
| Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs | Lita | Arxiv | 2025-10 | @Agent | Paper |
| Lingxi: Repository-Level Issue Resolution Framework Enhanced by Procedural Knowledge Guided Scaling | Lingxi | Arxiv | 2025-10 | @Agent | Paper Code |
| SIADAFIX: issue description response for adaptive program repair | SIADAFIX | Arxiv | 2025-10 | @Agent | Paper Code |
| TOM-SWE: User Mental Modeling For Software Engineering Agents | TOM-SWE | Arxiv | 2025-10 | @Agent | Paper Code |
| TDFlow: Agentic Workflows for Test Driven Software Engineering | TDFlow | Arxiv | 2025-10 | @Pipeline | Paper |
| Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? | Live-SWE-agent | Arxiv | 2025-11 | @Agent | Paper Code |
| InfCode: Adversarial Iterative Refinement of Tests and Patches for Reliable Software Issue Resolution | InfCode | Arxiv | 2025-11 | @Agent | Paper |
| Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale | CCA | Arxiv | 2025-12 | @Agent | Paper |
For Single-Phased Method, we discuss them in 3 categories separately:
@Localization
@Reproduction @Regression
where, @Reproduction indicates the reproduction test generation, @Regression indicates the regression test
selection.
| Literature | Name | Journal/Conference | Time | URL |
|---|---|---|---|---|
| BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning | BLAZE | Arxiv | 2024-08 | Paper Code |
| OrcaLoca: An LLM Agent Framework for Software Issue Localization | OrcaLoca | ICML 2025 | 2025-02 | Paper Code |
| Bridging Bug Localization and Issue Fixing: A Hierarchical Localization Framework Leveraging Large Language Models | BugCerberus | Arxiv | 2025-02 | Paper |
| LocAgent: Graph-Guided LLM Agents for Code Localization | LocAgent | ACL 2025 | 2025-03 | Paper Code |
| CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching | CoSIL | ASE 2025 | 2025-03 | Paper Code |
| CORNSTACK: HIGH-QUALITY CONTRASTIVE DATA FOR BETTER CODE RETRIEVAL AND RERANKING | CoRNStack | ICLR 2025 | 2025-03 | Paper Code |
| SweRank: Software Issue Localization with Code Ranking | SweRank | Arxiv | 2025-05 | Paper Code |
| CoRet: Improved Retriever for Code Editing | CoRet | Arxiv | 2025-06 | Paper |
| SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization | SACL | Arxiv | 2025-07 | Paper |
| Meta-RAG on Large Codebases Using Code Summarization | Meta-RAG | Arxiv | 2025-08 | Paper |
| Tool-integrated Reinforcement Learning for Repo Deep Search | RepoSearcher | Arxiv | 2025-08 | Paper |
| Improving Code Localization with Repository Memory | RepoMem | Arxiv | 2025-10 | Paper |
| Hierarchical Reward Modeling for Fault Localization in Large Code Repositories | HiLoRM | EMNLP Findings 2026 | 2025-11 | Paper Code |
| SweRank+: Multilingual, Multi-Turn Code Ranking for Software Issue Localization | SweRank+ | Arxiv | 2025-12 | Paper Code |
| One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents | RepoNavigator | Arxiv | 2025-12 | Paper |
| GraphLocator: Graph-guided Causal Reasoning for Issue Localization | GraphLocator | FSE 2026 | 2025-12 | Paper |
| Literature | Name | Journal/Conference | Time | URL |
|---|---|---|---|---|
| AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions | AEGIS | FSE 2025 Industry | 2024-11 | Paper |
| LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues | EvoCoder | ARXIV | 2024-11 | Paper |
| Agentic Bug Reproduction for Effective Automated Program Repair at Google | BRT Agent | Arxiv | 2025-02 | Paper |
| Otter: Generating Tests from Issues to Validate SWE Patches | Otter | ICML 2025 | 2025-02 | Paper |
| Issue2Test: Generating Reproducing Test Cases from Issue Reports | Issue2Test | Arxiv | 2025-03 | Paper |
| AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests | AssertFlip | Arxiv | 2025-07 | Paper |
| Execution-Feedback Driven Test Generation from SWE Issues | Otter++ | Arxiv | 2025-08 | Paper |
| Automated Generation of Issue-Reproducing Tests by Combining LLMs and Search-Based Testing | BLAST | Arxiv | 2025-09 | Paper Code |
| Literature | Name | Journal/Conference | Time | URL |
|---|---|---|---|---|
| When Old Meets New: Evaluating the Impact of Regression Tests on SWE Issue Resolution | TestPrune | Arxiv | 2025-10 | Paper |
From the perspective of Learning Strategy, we discuss them in 2 aspects:
@Data
@Training
| Literature | Name | Journal/Conference | Time | URL |
|---|---|---|---|---|
| R2E: Turning any GitHub Repository into a Programming Agent Environment | R2E | ICML 2024 | 2024-07 | Paper Code |
| Training Software Engineering Agents and Verifiers with SWE-Gym | SWE-Gym | ICML 2025 | 2024-12 | Paper Code |
| R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents | R2E-Gym | ARXIV | 2024-04 | Paper Code |
| SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs | SWE-Synth | ARXIV | 2024-04 | Paper Code |
| SWE-smith: Scaling Data for Software Engineering Agents | SWE-smith | NeurIPS 2025 | 2024-04 | Paper Code |
| SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks | SWE-Factory | ARXIV | 2025-06 | Paper Code |
| SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling | SWE-Dev | ACL Findings 2025 | 2025-06 | Paper Code |
| SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development | SWE-Dev | ARXIV | 2025-06 | Paper Code |
| Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs | Skywork-SWE | ARXIV | 2025-06 | Paper |
| SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories | SWE-Mirror | ARXIV | 2025-09 | Paper |
| SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving | SWE-Lego | ARXIV | 2026-01 | Paper |
For Training-Based Method, we can further classify them into 2 categories:
@SFT-Based Method
@RL-Based Method
We only display @RL if the method use both SFT and RL techniques.
| Literature | Name | Evaluation Method | Journal/Conference | Time | Label | URL |
|---|---|---|---|---|---|---|
| Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement | Lingma SWE-GPT | SWESynInfer | FSE 2025 Industry | 2024-11 | @SFT | Paper |
| Repository Structure-Aware Training Makes SLMs Better Issue Resolver | ReSAT | Agentless | ARXIV | 2024-12 | @SFT | Paper |
| SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution | SWE-Fixer | SWE-Fixer | ARXIV | 2025-02 | @SFT | Paper |
| SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution | SWE-RL | Agentless Mini | ARXIV | 2025-02 | @RL | Paper Code |
| SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning | SoRFT | Agentless | ACL 2025 | 2025-02 | @RL | Paper |
| SEAlign: Alignment Training for Software Engineering Agent | SEAlign | OpenHands | ARXIV | 2025-03 | @SFT | Paper |
| Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute | SWE-Reasoner | SWE-SynInfer+ | ARXIV | 2025-04 | @RL | Paper Code |
| Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models | Co-PatcheR | PatchPilot | ARXIV | 2025-05 | @SFT | Paper Code |
| Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering | EvoScale | Satori-SWE | ARXIV | 2025-05 | @RL | Paper Code |
| Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards | Agent-RLVR | Agentless | ARXIV | 2025-06 | @RL | Paper |
| MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution | MCTS-Refined | Agentless-1.0 | ASE 2025 | 2025-06 | @SFT | Paper |
| - | DeepSWE | - | Blog | 2025-07 | @RL | Link |
| - | SWE-Swiss | - | Blog | 2025-08 | @RL | Link Code |
| RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale | RepoForge | OpenHands | ARXIV | 2025-08 | @RL | Paper |
| Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning | - | - | ARXIV | 2025-08 | @RL | Paper |
| Devstral: Fine-tuning Language Models for Coding Agent Applications | Devstral-Small | OpenHands | ARXIV | 2025-08 | @RL | Paper |
| When Agents go Astray: Course-Correcting SWE Agents with PRMs | SWE-PRM | - | ARXIV | 2025-09 | @RL | Paper |
| Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents | Kimi-Dev | Kimi-Dev | ARXIV | 2025-09 | @RL | Paper |
| CWM: An Open-Weights LLM for Research on Code Generation with World Models | CWM | CWM | ARXIV | 2025-09 | @RL | Paper |
| Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization | EntroPO | R2E | ARXIV | 2025-09 | @SFT | Paper Code |
| KAT-Coder Technical Report | KAT-Coder | Claude Code | ARXIV | 2025-09 | @RL | Paper |
| BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills | BugPilot | R2E | ARXIV | 2025-10 | @RL | Paper |
| Think-Search-Patch: A Retrieval-Augmented Reasoning Framework for Repository-Level Code Repair | TSP | TSP | EMNLP 2025 | 2025-11 | @SFT | Paper Code |
| Training Versatile Coding Agents in Synthetic Environments | SWE-Playground | OpenHands | ARXIV | 2025-12 | @SFT | Paper Code |
| Toward Training Superintelligent Software Agents through Self-Play SWE-RL | Self-Play SWE-RL | bash+editor | ARXIV | 2025-12 | @RL | Paper |
| Context as a Tool: Context Management for Long-Horizon SWE-Agents | CAT/SWE-Compressor | OpenHands | Arxiv | 2025-12 | @SFT | Paper |
| SWE-RM: Execution-free Feedback For Software Engineering Agents | SWE-RM | OpenHands | Arxiv | 2025-12 | @RL | Paper |
| Literature | Journal/Conference | Time | URL |
|---|---|---|---|
| Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents | ICLR 2025 | 2024-08 | Paper Code |
| Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios | SANER | 2024-10 | Paper |
| An Empirical Study on LLM-based Agents for Automated Bug Fixing | ARXIV | 2024-11 | Paper |
| Large Language Model Critics for Execution-Free Evaluation of Code Changes | ARXIV | 2025-01 | Paper |
| Interactive Agents to Overcome Ambiguity in Software Engineering | ARXIV | 2025-02 | Paper |
| Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution | ARXIV | 2025-03 | Paper |
| Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study | ICSE 2026 | 2025-03 | Paper |
| SWE-Bench-CL: Continual Learning for Coding Agents | ARXIV | 2025-06 | Paper |
| The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason | ARXIV | 2025-06 | Paper |
| Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems | ARXIV | 2025-06 | Paper |
| PAGENT: Learning to Patch Software Engineering Agents | ARXIV | 2025-06 | Paper |
| Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories | ARXIV | 2025-06 | Paper |
| Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench | ARXIV | 2025-06 | Paper |
| SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints | ARXIV | 2025-09 | Paper |
| An Empirical Study on Failures in Automated Issue Solving | ARXIV | 2025-09 | Paper |
| Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation | ARXIV | 2025-10 | Paper |
| More with Less: An Empirical Study of Turn-Control Strategies for Efficient Coding Agents | ICSE 2026 | 2025-10 | Paper |
| Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories | ARXIV | 2025-10 | Paper |
| SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents | ARXIV | 2025-11 | Paper |
| Process-Centric Analysis of Agentic Software Systems | ARXIV | 2025-12 | Paper |
| SWEnergy: An Empirical Study on Energy Efficiency in Agentic Issue Resolution Frameworks with SLMs | ARXIV | 2025-12 | Paper |
| Does SWE-Bench-Verified Test Agent Ability or Model Memory? | ARXIV | 2025-12 | Paper |