An easy to understand Transformer model implementation. This repo is designed so that it is easy for you to train your own custom datasets.
As mentioned in the talk (Cybersecurity and AI), i adapted the standart transformer decoder architecture, except that i used relative positioning as described in ALiBi. I also moved the batch norm layers in front of the Attention and MLP layers. For weight initalization i used nn.init.xavier_normal_, as mentioned in this paper. To reduce model size, i used weight tying for the input embedding and output projection. I called this model "alilama" because it uses the ALiBi positional encoding and some tricks from the llama2 architecture. To test the implementation, i used the TinyStories dataset. Here is a test story from the "alilama_2M" model i trained (the yellow text is written by me, the rest is filled up by the model):
I also implemented a fast tokenizer based on MinBpe by Andrej karpathy in c++ and connected it using ctypes.
To try out one of my models, you can open the inference.ipynb file in jupyter and just execute the cells. In the first cell, you can specify the model you want to run. I already trained "alilama_2M.pth" (d_model=128, blocks=8, max_seq_len=128, num_heads=8, hidden_dim=4*d_model) with learning rate 1e^-4. It took around 10 hours to train on my NVIDIA GeForce RTX 3050. I trained for around 4 epochs, so the model saw the complete dataset 4 times. Larger models would give much better results, but unfortunately i don't have the compute for that.. In the second cell, you can specify the "prompt" for the model. I am not completely satisfied with the performance of the model, even tho it seems to perform as well as a model of the same size from the TinyStories paper. There is definitely some improvement we can still do. For example, using warmup steps/cosine learning rate schedule as used in the training of Llama2 would be a good starting point. I am probably gonna include it in the future. Also, obviously training larger models would give much better results, but unfortunately i don't have enough compute for this.
You can put any data you want into the data_raw
folder. Make sure to name it train_data.txt
. This file can contain basically any text you want. The model will later learn from it and try to generate new data. Here are a few examples on how this could be useful.
-
Language Translation: For example you could build up a dataset that looks like this:
"ENGLISH: A man goes to the bar and drinks a beer GERMAN: Ein Mann geht in die Bar und trinkt ein Bier ENGLISH: a dog barks at a man, who walks down the street GERMAN: Ein Hund ... "
-
Code Generation: Feed in the entire linux kernel and see what the model spits out
-
Music Generation: Encode music into a stream of tokens and let the model generate some beautiful tracks (maybe)
To kickstart, use the following command to download the TinyStories dataset:
python data/data_raw/downloadTinyStories.py
execute
python data/tokenizeData.py
This can take some time (up to 10 minutes, depending on the dataset)
to train on your custom dataset, use:
python data/TokenizeData.py --data_path=path_to_your_file.txt
execute
python train.py
This will train the model. You can adjust the model and training parameters in TRAINCONFIG.py
To test your trained model, see the inference notebook. Specify the start string and execute the model on it. You can see the already trained models in the models folder.
The idea repo is inspired by an implementation of the llama model from Andrej Karpathy, however we implemented our own model that comes close to the performance of the Llama model, while beeing much easier to understand and adapt. To understand the model architecture and gradient descent better, i can recommend these two videos (in this order):
Understanding automatic differentiation (by Andrej Karpathy)
Understanding the basic transformer architecture (by Andrej Karpathy)
- LLama2 paper
- Title: Llama 2: Open Foundation and Fine-Tuned Chat Models
- Authors: Andrew Brock, David Ha, Chris Olah
-
GPT-2 paper
- Title: Language Models are Unsupervised Multitask Learners
- Authors: OpenAI (Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever)
-
main Tranformer architecture
- Title: Attention Is All You Need
- Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- Link: arXiv
-
Alibi Paper
- Title: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
- Authors: Ofir Press, Noah A. Smith, Mike Lewis
- Link: arXiv
-
Rotary Positional Embedding
- Title: RoFormer: Enhanced Transformer with Rotary Position Embedding
- Authors: Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu
-
Weight Tying Understanding
- Title: Creating a Transformer From Scratch, part 2
- Author: Benjamin Warner
-
Weight Initialization Scheme
- Title: Improving Transformer Optimization Through Better Initialization
- Authors: Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs
-
ResNet
- Title: Deep Residual Learning for Image Recognition
- Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
-
Neural Turing Machines
- Title: Neural Turing Machines
- Authors: Alex Graves, Greg Wayne, Ivo Danihelka
-
Flash Attention
- Title: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Authors: Tri Dao
-
TinyStories
- Title: TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
- Authors: Ronen Eldan, Yuanzhi Li
-
Alibi GitHub
- Title: Alibi GitHub Repository
- Authors: Jake Tae
- Link: Alibi GitHub
-
LLama2 Implementation Andrej Karpathy (Llama2.c)
- Link: GitHub
-
Tokenizer by Andrej Karpathy
- Link: GitHub
-
Andrej Karpathy Understanding Automatic Gradient Descent
- Title: The spelled-out intro to neural networks and backpropagation: building micrograd
- Link: YouTube
-
Andrej Karpathy Transformer
- Title: Let's build GPT: from scratch, in code, spelled out
- Link: YouTube
-
Origin of the Attention Mechanism
- Title: Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
- Link: YouTube
-
Alibi Explained
- Title: Alibi Explained
- Link: YouTube
- Strengths Transformer
- Link: Twitter