Measuring AI Ability to Complete Long Software Tasks

Abstract

We introduce a metric, the 50%-task-completion time horizon, that measures the ability of an AI software engineering agent relative to humans by the difficulty of tasks it can complete. We measure difficulty in terms of how long a task takes human professionals, and define the time horizon as the task duration at which the AI agent is expected to complete 50% of tasks. We create a benchmark of real-world, professional tasks and measure the time horizon of several frontier models. We find that, as of February 2025, the longest time horizon is approximately 50 minutes, achieved by Claude 3.7 Sonnet. We also estimate that the frontier AI time horizon has been doubling approximately every seven months since 2019. Our analysis indicates the primary drivers of improvement have been enhanced reliability, error adaptation, logical reasoning, and tool use.

Publication
NeurIPS 2025
Lawrence Chan
Lawrence Chan
PhD Candidate

I do AI Alignment research.