As of January 2023, I’m currently working at the Alignment Research Center doing evaluations of large language models. Previously, I was at Redwood Research, where I worked on adversarial training and neural network interpretability.
I’m also doing a PhD at UC Berkeley advised by Anca Dragan and Stuart Russell. Before that, I received a BAS in Computer Science and Logic and a BS in Economics from the University of Pennsylvania’s M&T Program, where I was fortunate to work with Philip Tetlock on using ML for forecasting.
My main research interests are mechanistic interpretability and scalable oversight. In the past, I’ve also done conceptual work on learning human values.
I also sometimes blog about AI alignment and other topics on LessWrong/the AI Alignment Forum.
We reverse engineer small transformers trained on group composition and use our understanding to explore the universality hypothesis.
We fully reverse engineer 32 small transformers trained on modular addition and use this understanding to study grokking.
We compare humans to small language models on next-token prediction tasks, and find that even relatively small language models consistently outperform humans.
We argue that AGIs trained in similar ways as today’s most capable models could learn to act deceptively to receive higher reward; learn internally-represented goals which generalize beyond their training distributions; and pursue those goals using power-seeking strategies.
We introduce a more principled approach for evaluating the quality of mechanistic interpretations via behavior-preserving resampling ablations—converting hypotheses into classes of activations inside a neural network can be resampled without affecting behavior.