1

Measuring AI Ability to Complete Long Software Tasks

We introduce a task-completion time horizon metric to benchmark frontier AI on software engineering tasks, finding that AI time horizons have been doubling approximately every seven months since 2019.

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan

Mathematical Models of Computation in Superposition

We formally study what sparse circuits neural networks can represent and compute in superposition – that is, using sparse features respresented near-orthogonally in a high-dimensional space. We show that not only can ReLU MLPs perform more computation at a fixed given width by placing features in superposition, but also that they can maintain superposition through multiple layers.

Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan

Mathematical Models of Computation in Superposition

Compact Proofs of Model Performance via Mechanistic Interpretability

We use mechanistic interpretability to general compact formal guarantees on model performance on a toy Max-of-K task. We find that shorter proofs seem to require more mechanistic understanding and more faithful mechanistic understanding leads to tighter performance bounds. We identify compounding structureless noise as a key challenge for the proofs approach.

Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

We reverse engineer small transformers trained on group composition and use our understanding to explore the universality hypothesis.

Bilal Chughtai, Lawrence Chan, Neel Nanda

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

Progress measures for grokking via mechanistic interpretability

We reverse engineer 32 small transformers trained on modular addition and use this understanding to study grokking.

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

Progress measures for grokking via mechanistic interpretability

The alignment problem from a deep learning perspective

We argue that AGIs trained in similar ways as today’s most capable models could learn to act deceptively to receive higher reward; learn internally-represented goals which generalize beyond their training distributions; and pursue those goals using power-seeking strategies.

Richard Ngo, Lawrence Chan, Sören Mindermann

The alignment problem from a deep learning perspective

Adversarial Training for High-Stakes Reliability

We study the limits of adversarial training on a injury classification task using a variety of attacks including a new saliency map–based snippet rewrite tool to assist our human adversaries. While advesarial training was not able to eliminate all in-distribution failures, it did increase robustness to the adversarial attacks that we trained on.

Daniel M Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas

Optimal Cost Design for Model Predictive Control

Many robotics algorithms use model predictive control (MPC) for planning, which optimizes a cost function over a finite time horizon to …

Avik Jain, Lawrence Chan, Daniel S Brown, Anca D Dragan

The assistive multi-armed bandit

We study the problem of learning to assist a human who is themselves learning about their preferences.

Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan