Lawrence Chan
Lawrence Chan
Home
Publications
1
Progress measures for grokking via mechanistic interpretability
We fully reverse engineer 32 small transformers trained on modular addition and use this understanding to study grokking.
Neel Nanda
,
Lawrence Chan
,
Tom Lieberum
,
Jess Smith
,
Jacob Steinhardt
PDF
Cite
arXiv
Adversarial Training for High-Stakes Reliability
We study the limits of adversarial training on a injury classification task using a variety of attacks including a new saliency map–based snippet rewrite tool to assist our human adversaries. While advesarial training was not able to eliminate all in-distribution failures, it did increase robustness to the adversarial attacks that we trained on.
Daniel M Ziegler
,
Seraphina Nix
,
Lawrence Chan
,
Tim Bauman
,
Peter Schmidt-Nielsen
,
Tao Lin
,
Adam Scherlis
,
Noa Nabeshima
,
Ben Weinstein-Raun
,
Daniel de Haas
,
others
PDF
Cite
Slides
arXiv
Optimal Cost Design for Model Predictive Control
Avik Jain
,
Lawrence Chan
,
Daniel S Brown
,
Anca D Dragan
Cite
The assistive multi-armed bandit
We study the problem of learning to assist a human who is
themselves learning
about their preferences.
Lawrence Chan
,
Dylan Hadfield-Menell
,
Siddhartha Srinivasa
,
Anca Dragan
PDF
Cite
Code
arXiv
Cite
×