Publications

Thomas Kwa, Ben West, Joel Becker , …, Lawrence Chan

March, 2025 NeurIPS 2025

Measuring AI Ability to Complete Long Software Tasks

We introduce a task-completion time horizon metric to benchmark frontier AI on software engineering tasks, finding that AI time horizons have been doubling approximately every seven months since 2019.

David Rein, Joel Becker, Amy Deng , …, Lawrence Chan, Elizabeth Barnes

March, 2025 arXiv preprint arXiv:2503.17354

HCAST: Human-Calibrated Autonomy Software Tasks

We present HCAST, a benchmark of 189 tasks with over 1500 hours of human baselines, finding that current AI agents succeed 70-80% on tasks taking humans less than one hour but under 20% on tasks taking more than 4 hours.

Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross

December, 2024 arXiv preprint arXiv:2412.03773

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

We present the first case study in rigorously compressing nonlinear feature-maps, discovering that the MLP in modular addition models can be understood as evaluating a quadrature scheme.

Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan

August, 2024 Mech Interp Workshop @ ICML 2024

Mathematical Models of Computation in Superposition

We formally study what sparse circuits neural networks can represent and compute in superposition – that is, using sparse features respresented near-orthogonally in a high-dimensional space. We show that not only can ReLU MLPs perform more computation at a fixed given width by placing features in superposition, but also that they can maintain superposition through multiple layers.

Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

June, 2024 NeurIPS 2024

Compact Proofs of Model Performance via Mechanistic Interpretability

We use mechanistic interpretability to general compact formal guarantees on model performance on a toy Max-of-K task. We find that shorter proofs seem to require more mechanistic understanding and more faithful mechanistic understanding leads to tighter performance bounds. We identify compounding structureless noise as a key challenge for the proofs approach.

Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du , …, Lawrence Chan, …, Paul Christiano

July, 2023 ARC Technical Report

Evaluating Language-Model Agents on Realistic Autonomous Tasks

We create four agents from Claude and GPT-4 to investigate the ability of frontier language models to perform autonomous replication and adaptation.

Bilal Chughtai, Lawrence Chan, Neel Nanda

February, 2023 ICML 2023

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

We reverse engineer small transformers trained on group composition and use our understanding to explore the universality hypothesis.

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

January, 2023 ICLR 2023

Progress measures for grokking via mechanistic interpretability

We reverse engineer 32 small transformers trained on modular addition and use this understanding to study grokking.

Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

December, 2022 Transactions of Machine Learning Research (TMLR)

Language models are better than humans at next-token prediction

We compare humans to small language models on next-token prediction tasks, and find that even relatively small language models consistently outperform humans.

Richard Ngo, Lawrence Chan, Sören Mindermann

December, 2022 ICLR 2024

The alignment problem from a deep learning perspective

We argue that AGIs trained in similar ways as today’s most capable models could learn to act deceptively to receive higher reward; learn internally-represented goals which generalize beyond their training distributions; and pursue those goals using power-seeking strategies.

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas

December, 2022 AI Alignment Forum

Causal Scrubbing: a method for rigorously testing interpretability hypotheses

We introduce a more principled approach for evaluating the quality of mechanistic interpretations via behavior-preserving resampling ablations—converting hypotheses into classes of activations inside a neural network can be resampled without affecting behavior.

Daniel M Ziegler, Seraphina Nix, Lawrence Chan , …, Nate Thomas

May, 2022 NeurIPS 2022

Adversarial Training for High-Stakes Reliability

We study the limits of adversarial training on a injury classification task using a variety of attacks including a new saliency map–based snippet rewrite tool to assist our human adversaries. While advesarial training was not able to eliminate all in-distribution failures, it did increase robustness to the adversarial attacks that we trained on.

Lawrence Chan, Andrew Critch, Anca Dragan

November, 2021 arXiv preprint arXiv:2111.06956

Human irrationality: both bad and good for reward inference

We study the effect of human irrationality on reward inference. We find that irrationality can both help and hurt reward inference, depending on the type of irrationality. We also find that the effect of irrationality on reward inference is not monotonic in the degree of irrationality.

Avik Jain, Lawrence Chan, Daniel S Brown, Anca D Dragan

April, 2021 L4DC 2021

Optimal Cost Design for Model Predictive Control

Many robotics algorithms use model predictive control (MPC) for planning, which optimizes a cost function over a finite time horizon to determine the next action. Typically, the cost function for MPC is identical to the ground-truth task cost used for evaluation. In this work, we challenge this common practice, and propose that a different cost function for MPC to optimize can yield better task performance. This is because MPC is an imperfect planner – it has a limited horizon and uses an imperfect model. We formalize this as an optimal cost design problem, and propose solving it with zeroth-order optimization. We test our approach in a few autonomous driving scenarios in simulation, where the learned costs lead to qualitatively interesting emergent driving behaviors.

Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell

December, 2020 NeurIPS 2020 Workshop

Benefits of assistance over reward learning

We illustrate the benefits of agents that try to assist humans, over agents that learn a reward during training and then maximize said reward after deployment.

Harry Giles, Lawrence Chan

November, 2020 NeurIPS 2020 Workshop

Accounting for Human Learning when Inferring Human Preferences

Inverse reinforcement learning (IRL) is a powerful tool for learning reward functions from demonstrations. However, standard IRL assumes that the human demonstrator is stationary; that is, the human’s policy does not change during the demonstrations. In this paper, we study IRL when the human is learning – that is, the human’s policy improves over the course of the demonstrations. We show that observing a learning human can be more informative about their preferences than observing a human with a fixed policy, and that standard IRL techniques perform poorly when the human is learning in an unfamiliar environment.

Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan

March, 2019 HRI 2019

The assistive multi-armed bandit

We study the problem of learning to assist a human who is themselves learning about their preferences.