This is a minimal implementation of a GPT-style transformer using only numpy.
Given that numpy is exclusively CPU-bound, it restricts the training to relatively small-scale models. I was able to conduct a miniaturized version of the grokking experiment on a 1 layer toy model in Nanda et al. 2023.