Causal Scrubbing: a method for rigorously testing interpretability hypotheses

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas

December, 2022

Abstract

We introduce causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key insight behind this work is that mechanistic interpretability hypotheses can be thought of as defining what activations inside a neural network can be resampled without affecting behavior. Accordingly, causal scrubbing tests interpretability hypotheses via behavior-preserving resampling ablations—converting hypotheses into distributions over activations that should preserve behavior, and checking if behavior is actually preserved. We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced.

Publication

AI Alignment Forum

Mechanistic Interpretability

Causal Scrubbing: a method for rigorously testing interpretability hypotheses

Abstract

Lawrence Chan

PhD Candidate