As of January 2023, I’m currently working at the Alignment Research Center doing evaluations of large language models. Previously, I was at Redwood Research, where I worked on adversarial training and neural network interpretability.
I’m also doing a PhD at UC Berkeley advised by Anca Dragan and Stuart Russell. Before that, I received a BAS in Computer Science and Logic and a BS in Economics from the University of Pennsylvania’s M&T Program, where I was fortunate to work with Philip Tetlock on using ML for forecasting.
My main research interests are mechanistic interpretability and scalable oversight. In the past, I’ve also done conceptual work on learning human values.
I also sometimes blog about AI alignment and other topics on LessWrong/the AI Alignment Forum.
We reverse engineer small transformers trained on group composition and use our understanding to explore the universality hypothesis.
We fully reverse engineer 32 small transformers trained on modular addition and use this understanding to study grokking.