As of January 2023, I’m currently working at the Alignment Research Center doing evaluations of large language models. Previously, I was at Redwood Research, where I worked on adversarial training and neural network interpretability.
I’m also doing a PhD at UC Berkeley advised by Anca Dragan and Stuart Russell. Before that, I received a BAS in Computer Science and Logic and a BS in Economics from the University of Pennsylvania’s M&T Program, where I was fortunate to work with Philip Tetlock on using ML for forecasting.
We create four agents from Claude and GPT-4 to investigate the ability of frontier language models to perform autonomous replication and adaptation.
We reverse engineer small transformers trained on group composition and use our understanding to explore the universality hypothesis.
We fully reverse engineer 32 small transformers trained on modular addition and use this understanding to study grokking.
We compare humans to small language models on next-token prediction tasks, and find that even relatively small language models consistently outperform humans.
We argue that AGIs trained in similar ways as today’s most capable models could learn to act deceptively to receive higher reward; learn internally-represented goals which generalize beyond their training distributions; and pursue those goals using power-seeking strategies.