Mechanistic Interpretability

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

We present the first case study in rigorously compressing nonlinear feature-maps, discovering that the MLP in modular addition models can be understood as evaluating a quadrature scheme.

Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross

Causal Scrubbing: a method for rigorously testing interpretability hypotheses

We introduce a more principled approach for evaluating the quality of mechanistic interpretations via behavior-preserving resampling ablations—converting hypotheses into classes of activations inside a neural network can be resampled without affecting behavior.

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas