Lawrence Chan
Lawrence Chan
Home
Publications
Mechanistic Interpretability
Causal Scrubbing: a method for rigorously testing interpretability hypotheses
We introduce a more principled approach for evaluating the quality of mechanistic interpretations via
behavior-preserving resampling ablations
—converting hypotheses into classes of activations inside a neural network can be resampled without affecting behavior.
Lawrence Chan
,
Adrià Garriga-Alonso
,
Nicholas Goldowsky-Dill
,
Ryan Greenblatt
,
Jenny Nitishinskaya
,
Ansh Radhakrishnan
,
Buck Shlegeris
,
Nate Thomas
Cite
Alignment Forum
Cite
×