Lawrence Chan

PhD Candidate

Berkeley AI Research

As of January 2023, I’m currently working at METR (formerly “ARC Evals”) doing evaluations of large language models. Previously, I was at Redwood Research, where I worked on adversarial training and neural network interpretability.

I’m also doing a PhD at UC Berkeley advised by Anca Dragan and Stuart Russell. Before that, I received a BAS in Computer Science and Logic and a BS in Economics from the University of Pennsylvania’s M&T Program, where I was fortunate to work with Philip Tetlock on using ML for forecasting.

My main research interests are mechanistic interpretability and scalable oversight. In the past, I’ve also done conceptual work on learning human values.

I also sometimes blog about AI alignment and other topics on LessWrong/the AI Alignment Forum.

Recent Publications

Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan

August, 2024 Mech Interp Workshop @ ICML 2024

Mathematical Models of Computation in Superposition

We formally study what sparse circuits neural networks can represent and compute in superposition – that is, using sparse features respresented near-orthogonally in a high-dimensional space. We show that not only can ReLU MLPs perform more computation at a fixed given width by placing features in superposition, but also that they can maintain superposition through multiple layers.

Jason Gross, Rajashree Argawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

June, 2024 Mech Interp Workshop @ ICML 2024 (Spotlight)

Compact Proofs of Model Performance via Mechanistic Interpretability

We use mechanistic interpretability to general compact formal guarantees on model performance on a toy Max-of-K task. We find that shorter proofs seem to require more mechanistic understanding and more faithful mechanistic understanding leads to tighter performance bounds. We identify compounding structureless noise as a key challenge for the proofs approach.

Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, Paul Christiano

July, 2023 ARC Technical Report