Measuring AI Ability to Complete Long Software Tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan

March, 2025

Abstract

We introduce a metric, the 50%-task-completion time horizon, that measures the ability of an AI software engineering agent relative to humans by the difficulty of tasks it can complete. We measure difficulty in terms of how long a task takes human professionals, and define the time horizon as the task duration at which the AI agent is expected to complete 50% of tasks. We create a benchmark of real-world, professional tasks and measure the time horizon of several frontier models. We find that, as of February 2025, the longest time horizon is approximately 50 minutes, achieved by Claude 3.7 Sonnet. We also estimate that the frontier AI time horizon has been doubling approximately every seven months since 2019. Our analysis indicates the primary drivers of improvement have been enhanced reliability, error adaptation, logical reasoning, and tool use.

Type

Conference paper

Publication

NeurIPS 2025

AI Evaluations

Measuring AI Ability to Complete Long Software Tasks

Abstract

Lawrence Chan

PhD Candidate