Publications

Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan

August, 2024 Mech Interp Workshop @ ICML 2024

Mathematical Models of Computation in Superposition

We formally study what sparse circuits neural networks can represent and compute in superposition – that is, using sparse features respresented near-orthogonally in a high-dimensional space. We show that not only can ReLU MLPs perform more computation at a fixed given width by placing features in superposition, but also that they can maintain superposition through multiple layers.

Jason Gross, Rajashree Argawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

June, 2024 Mech Interp Workshop @ ICML 2024 (Spotlight)

Compact Proofs of Model Performance via Mechanistic Interpretability

We use mechanistic interpretability to general compact formal guarantees on model performance on a toy Max-of-K task. We find that shorter proofs seem to require more mechanistic understanding and more faithful mechanistic understanding leads to tighter performance bounds. We identify compounding structureless noise as a key challenge for the proofs approach.

Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, Paul Christiano

July, 2023 ARC Technical Report

Evaluating Language-Model Agents on Realistic Autonomous Tasks

We create four agents from Claude and GPT-4 to investigate the ability of frontier language models to perform autonomous replication and adaptation.

Bilal Chughtai, Lawrence Chan, Neel Nanda

February, 2023 ICML 2023

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

We reverse engineer small transformers trained on group composition and use our understanding to explore the universality hypothesis.

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

January, 2023 ICLR 2023 (Spotlight)

Progress measures for grokking via mechanistic interpretability

We fully reverse engineer 32 small transformers trained on modular addition and use this understanding to study grokking.

Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

December, 2022 Transactions of Machine Learning Research (TMLR)

Language models are better than humans at next-token prediction

We compare humans to small language models on next-token prediction tasks, and find that even relatively small language models consistently outperform humans.

Richard Ngo, Lawrence Chan, Sören Mindermann

December, 2022 ICLR 2024

The alignment problem from a deep learning perspective

We argue that AGIs trained in similar ways as today’s most capable models could learn to act deceptively to receive higher reward; learn internally-represented goals which generalize beyond their training distributions; and pursue those goals using power-seeking strategies.

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas

December, 2022 AI Alignment Forum

Causal Scrubbing: a method for rigorously testing interpretability hypotheses

We introduce a more principled approach for evaluating the quality of mechanistic interpretations via behavior-preserving resampling ablations—converting hypotheses into classes of activations inside a neural network can be resampled without affecting behavior.

Daniel M Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, others

May, 2022 NeurIPS 2022

Adversarial Training for High-Stakes Reliability

We study the limits of adversarial training on a injury classification task using a variety of attacks including a new saliency map–based snippet rewrite tool to assist our human adversaries. While advesarial training was not able to eliminate all in-distribution failures, it did increase robustness to the adversarial attacks that we trained on.

Lawrence Chan, Andrew Critch, Anca Dragan

November, 2021 arXiv preprint arXiv:2111.06956

Human irrationality: both bad and good for reward inference

We study the effect of human irrationality on reward inference. We find that irrationality can both help and hurt reward inference, depending on the type of irrationality. We also find that the effect of irrationality on reward inference is not monotonic in the degree of irrationality.

Avik Jain, Lawrence Chan, Daniel S Brown, Anca D Dragan

January, 2021 Learning for Dynamics and Control

Optimal Cost Design for Model Predictive Control

Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell

December, 2020

Benefits of assistance over reward learning

We illustrate the benefits of agents that try to assist humans, over agents that learn a reward during training and then maximize said reward after deployment.

Harry Giles, Lawrence Chan

January, 2020 arXiv preprint arXiv:2011.05596

Accounting for Human Learning when Inferring Human Preferences

Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan

March, 2019 HRI 2019

The assistive multi-armed bandit

We study the problem of learning to assist a human who is themselves learning about their preferences.