TLDR.Chat

Evaluating Agent-Based Program Repair Using Modern Language Models

Evaluating Agent-based Program Repair at Google 🔗

Agent-based program repair offers to automatically resolve complex bugs end-to-end by combining the planning, tool use, and code generation abilities of modern LLMs. Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench, a collection of bugs from highly-rated GitHub Python projects. In addition, various agentic approaches such as SWE-Agent have been proposed to solve bugs in this benchmark. This paper explores the viability of using an agentic approach to address bugs in an enterprise context. To investigate this, we curate an evaluation set of 178 bugs drawn from Google's issue tracking system. This dataset spans both human-reported (78) and machine-reported bugs (100). To establish a repair performance baseline on this benchmark, we implement Passerine, an agent similar in spirit to SWE-Agent that can work within Google's development environment. We show that with 20 trajectory samples and Gemini 1.5 Pro, Passerine can produce a patch that passes bug tests (i.e., plausible) for 73% of machine-reported and 25.6% of human-reported bugs in our evaluation set. After manual examination, we found that 43% of machine-reported bugs and 17.9% of human-reported bugs have at least one patch that is semantically equivalent to the ground-truth patch. These results establish a baseline on an industrially relevant benchmark, which as we show, contains bugs drawn from a different distribution -- in terms of language diversity, size, and spread of changes, etc. -- compared to those in the popular SWE-Bench dataset.

Agent-based program repair is being evaluated for its effectiveness in automatically fixing complex bugs using modern language models (LLMs). The study focuses on a dataset of 178 bugs from Google's issue tracking system, which includes both human-reported and machine-reported issues. An agent called Passerine was developed to work within Google's development environment, demonstrating a repair success rate of 73% for machine-reported bugs and 25.6% for human-reported ones. Manual reviews indicated that a significant portion of the patches were semantically equivalent to the correct solutions. This research establishes a performance baseline for agent-based repair methods in an industrial context, highlighting differences in bug characteristics compared to the SWE-Bench dataset.

What is the main focus of the research?

The research focuses on evaluating the effectiveness of agent-based program repair methods in fixing bugs within Google's enterprise environment.

What were the success rates for bug repairs?

Passerine achieved a success rate of 73% for machine-reported bugs and 25.6% for human-reported bugs.

How does this study compare to the SWE-Bench dataset?

This study highlights that the bugs in the Google dataset differ from those in SWE-Bench in terms of language diversity, size, and types of changes required for fixes.

Related