I'm currently conducting research in mechanistic interpretability.
Core Contributor: Open-source replication of Anthropic's Sleeper Agents paper, repo here.
Past papers and posts that I have contributed on include:
- Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition (repo here, completed as a part of MATS 4.0)
- Structured World Representations in Maze-Solving Transformers (Accepted to NeurIPS UniReps '23 Workshop)
- Polysemantic Attention Head in a 4-Layer Transformer
- A Configurable Library for Generating and Manipulating Maze Datasets