AI Evaluations

Measuring AI Ability to Complete Long Software Tasks
We introduce a task-completion time horizon metric to benchmark frontier AI on software engineering tasks, finding that AI time horizons have been doubling approximately every seven months since 2019.
HCAST: Human-Calibrated Autonomy Software Tasks
We present HCAST, a benchmark of 189 tasks with over 1500 hours of human baselines, finding that current AI agents succeed 70-80% on tasks taking humans less than one hour but under 20% on tasks taking more than 4 hours.