Lawrence Chan
Lawrence Chan
Home
Publications
3
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
We reverse engineer small transformers trained on group composition and use our understanding to explore the universality hypothesis.
Bilal Chughtai
,
Lawrence Chan
,
Neel Nanda
PDF
Cite
arXiv
Language models are better than humans at next-token prediction
We compare humans to small language models on next-token prediction tasks, and find that even relatively small language models consistently outperform humans.
Buck Shlegeris
,
Fabien Roger
,
Lawrence Chan
,
Euan McLean
PDF
Cite
arXiv
Human irrationality: both bad and good for reward inference
We study the effect of human irrationality on reward inference. We find that irrationality can both help and hurt reward inference, depending on the type of irrationality. We also find that the effect of irrationality on reward inference is not monotonic in the degree of irrationality.
Lawrence Chan
,
Andrew Critch
,
Anca Dragan
PDF
Cite
arXiv
Cite
×