AI Evaluations

Measuring AI Ability to Complete Long Software Tasks

We introduce a task-completion time horizon metric to benchmark frontier AI on software engineering tasks, finding that AI time horizons have been doubling approximately every seven months since 2019.

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan

HCAST: Human-Calibrated Autonomy Software Tasks

We present HCAST, a benchmark of 189 tasks with over 1500 hours of human baselines, finding that current AI agents succeed 70-80% on tasks taking humans less than one hour but under 20% on tasks taking more than 4 hours.

David Rein, Joel Becker, Amy Deng, Seraphina Nix, Chris Canal, Daniel O'Connel, Pip Arnott, Ryan Bloom, Thomas Broadley, Katharyn Garcia, Brian Goodrich, Max Hasin, Sami Jawhar, Megan Kinniment, Thomas Kwa, Aron Lajko, Nate Rush, Lucas Jun Koba Sato, Sydney Von Arx, Ben West, Lawrence Chan, Elizabeth Barnes