Lawrence Chan
Lawrence Chan
Home
Publications
Mechanistic Interpretability
Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration
We present the first case study in rigorously compressing nonlinear feature-maps, discovering that the MLP in modular addition models can be understood as evaluating a quadrature scheme.
Chun Hei Yip
,
Rajashree Agrawal
,
Lawrence Chan
,
Jason Gross
PDF
Cite
arXiv
Causal Scrubbing: a method for rigorously testing interpretability hypotheses
We introduce a more principled approach for evaluating the quality of mechanistic interpretations via
behavior-preserving resampling ablations
—converting hypotheses into classes of activations inside a neural network can be resampled without affecting behavior.
Lawrence Chan
,
Adrià Garriga-Alonso
,
Nicholas Goldowsky-Dill
,
Ryan Greenblatt
,
Jenny Nitishinskaya
,
Ansh Radhakrishnan
,
Buck Shlegeris
,
Nate Thomas
Cite
Alignment Forum
Cite
×