Author: Dan Wilhelm, dan@danwilhelm.com
We investigate how a one-layer attention+feed-forward transformer computes the cumulative sum. I've written the investigation conversationally, providing numerous examples and insights.
Of particular interest, we:
- design a 38-weight attention-only circuit with smaller loss than the provided model;
- manually remove the MLP and rewire a trained circuit, retaining 100% accuracy;
- prove that an equally-attended attention block is equivalent to a single linear projection (of the prior-input mean!); and
- provide an independent transformer implementation to make it easier to modify the internals.
This is my proposed solution to a monthly puzzle authored by Callum McDougall! You may find more information about the challenge and monthly problem series here:
- Introduction
- All 24 embedding channels directly encode token sign and magnitude
- Attention softmax equally attends to each token
- Equally-divided attention computes the expanding mean
- Feed-forward network "cleans up" the signal
- What about the zero unembed pattern?
- Surgically removing the MLP, retaining 100% accuracy
- Designing a 38-weight attention-only cumsum circuit
- Appendix A. Rewriting two linear transforms as one
- Appendix B. Designing a 38-weight circuit with skip connections