3

HCAST: Human-Calibrated Autonomy Software Tasks
We present HCAST, a benchmark of 189 tasks with over 1500 hours of human baselines, finding that current AI agents succeed 70-80% on tasks taking humans less than one hour but under 20% on tasks taking more than 4 hours.
Benefits of assistance over reward learning
We illustrate the benefits of agents that try to assist humans, over agents that learn a reward during training and then maximize said reward after deployment.