Evaluating Agent-Based Program Repair Using Modern Language Models
Evaluating Agent-based Program Repair at Google 🔗

Agent-based program repair is being evaluated for its effectiveness in automatically fixing complex bugs using modern language models (LLMs). The study focuses on a dataset of 178 bugs from Google's issue tracking system, which includes both human-reported and machine-reported issues. An agent called Passerine was developed to work within Google's development environment, demonstrating a repair success rate of 73% for machine-reported bugs and 25.6% for human-reported ones. Manual reviews indicated that a significant portion of the patches were semantically equivalent to the correct solutions. This research establishes a performance baseline for agent-based repair methods in an industrial context, highlighting differences in bug characteristics compared to the SWE-Bench dataset.
What is the main focus of the research?
The research focuses on evaluating the effectiveness of agent-based program repair methods in fixing bugs within Google's enterprise environment.
What were the success rates for bug repairs?
Passerine achieved a success rate of 73% for machine-reported bugs and 25.6% for human-reported bugs.
How does this study compare to the SWE-Bench dataset?
This study highlights that the bugs in the Google dataset differ from those in SWE-Bench in terms of language diversity, size, and types of changes required for fixes.