Alilama (Transformer with ALiBi positional encoding and some tricks from Llama2)

An easy to understand Transformer model implementation. This repo is designed so that it is easy for you to train your own custom datasets.

model Architecture

As mentioned in the talk (Cybersecurity and AI), i adapted the standart transformer decoder architecture, except that i used relative positioning as described in ALiBi. I also moved the batch norm layers in front of the Attention and MLP layers. For weight initalization i used nn.init.xavier_normal_, as mentioned in this paper. To reduce model size, i used weight tying for the input embedding and output projection. I called this model "alilama" because it uses the ALiBi positional encoding and some tricks from the llama2 architecture. To test the implementation, i used the TinyStories dataset. Here is a test story from the "alilama_2M" model i trained (the yellow text is written by me, the rest is filled up by the model):

$\textcolor{yellow}{\text{Once}}$ $\textcolor{yellow}{\text{upon}}$ $\textcolor{yellow}{\text{a}}$ $\textcolor{yellow}{\text{time,}}$ $\textcolor{yellow}{\text{Tom}}$ $\textcolor{yellow}{\text{and}}$ $\textcolor{yellow}{\text{Lisa}}$ $\textcolor{yellow}{\text{found}}$ $\textcolor{yellow}{\text{a}}$ $\textcolor{yellow}{\text{shiny}}$ $\textcolor{yellow}{\text{red}}$ $\textcolor{yellow}{\text{apple.}}$ $\textcolor{yellow}{\text{They}}$ $\textcolor{yellow}{\text{both}}$ $\textcolor{yellow}{\text{wanted}}$ $\textcolor{yellow}{\text{it.}}$ $\textcolor{yellow}{\text{Tom}}$ $\textcolor{yellow}{\text{said,}}$ $\textcolor{yellow}{\text{"I}}$ $\textcolor{yellow}{\text{found}}$ $\textcolor{yellow}{\text{it}}$ $\textcolor{yellow}{\text{first!"}}$ $\textcolor{yellow}{\text{Lisa}}$ $\textcolor{yellow}{\text{said,}}$ $\textcolor{yellow}{\text{"I}}$ $\textcolor{yellow}{\text{want}}$ $\textcolor{yellow}{\text{it}}$ $\textcolor{yellow}{\text{too!"}}$ Tom said, "No, it's mine!" They both pulled and pulled, but the apple did not move. Tom and Lisa were sad. They could not play with the apple anymore. They had to find a way to make the apple disappear. They could not agree on what to do. Then, Tom had an idea. He said, "Let's share the apple!" They both agreed. They shared the apple and played together. They were happy and became friends

I also implemented a fast tokenizer based on MinBpe by Andrej karpathy in c++ and connected it using ctypes.

inference my pretrained models

To try out one of my models, you can open the inference.ipynb file in jupyter and just execute the cells. In the first cell, you can specify the model you want to run. I already trained "alilama_2M.pth" (d_model=128, blocks=8, max_seq_len=128, num_heads=8, hidden_dim=4*d_model) with learning rate 1e^-4. It took around 10 hours to train on my NVIDIA GeForce RTX 3050. I trained for around 4 epochs, so the model saw the complete dataset 4 times. Larger models would give much better results, but unfortunately i don't have the compute for that.. In the second cell, you can specify the "prompt" for the model. I am not completely satisfied with the performance of the model, even tho it seems to perform as well as a model of the same size from the TinyStories paper. There is definitely some improvement we can still do. For example, using warmup steps/cosine learning rate schedule as used in the training of Llama2 would be a good starting point. I am probably gonna include it in the future. Also, obviously training larger models would give much better results, but unfortunately i don't have enough compute for this.

train on your own Data

Step 1: build Your Dataset

You can put any data you want into the data_raw folder. Make sure to name it train_data.txt. This file can contain basically any text you want. The model will later learn from it and try to generate new data. Here are a few examples on how this could be useful.

Language Translation: For example you could build up a dataset that looks like this:

"ENGLISH: A man goes to the bar and drinks a beer GERMAN: Ein Mann geht in die Bar und trinkt ein Bier ENGLISH: a dog barks at a man, who walks down the street GERMAN: Ein Hund ... "
Code Generation: Feed in the entire linux kernel and see what the model spits out
Music Generation: Encode music into a stream of tokens and let the model generate some beautiful tracks (maybe)

To kickstart, use the following command to download the TinyStories dataset:

python data/data_raw/downloadTinyStories.py

Step 2: train the tokenizer

execute

python data/tokenizeData.py

This can take some time (up to 10 minutes, depending on the dataset)

to train on your custom dataset, use:

python data/TokenizeData.py --data_path=path_to_your_file.txt

Step 3: train the model

execute

python train.py

This will train the model. You can adjust the model and training parameters in TRAINCONFIG.py

inference your model

To test your trained model, see the inference notebook. Specify the start string and execute the model on it. You can see the already trained models in the models folder.

main sources

The idea repo is inspired by an implementation of the llama model from Andrej Karpathy, however we implemented our own model that comes close to the performance of the Llama model, while beeing much easier to understand and adapt. To understand the model architecture and gradient descent better, i can recommend these two videos (in this order):

Understanding automatic differentiation (by Andrej Karpathy)

Understanding the basic transformer architecture (by Andrej Karpathy)

all sources

papers

LLama2 paper

Title: Llama 2: Open Foundation and Fine-Tuned Chat Models
Authors: Andrew Brock, David Ha, Chris Olah

GPT-2 paper
- Title: Language Models are Unsupervised Multitask Learners
- Authors: OpenAI (Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever)
main Tranformer architecture
- Title: Attention Is All You Need
- Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- Link: arXiv
Alibi Paper
- Title: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
- Authors: Ofir Press, Noah A. Smith, Mike Lewis
- Link: arXiv
Rotary Positional Embedding
- Title: RoFormer: Enhanced Transformer with Rotary Position Embedding
- Authors: Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu
Weight Tying Understanding
- Title: Creating a Transformer From Scratch, part 2
- Author: Benjamin Warner
Weight Initialization Scheme
- Title: Improving Transformer Optimization Through Better Initialization
- Authors: Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs
ResNet
- Title: Deep Residual Learning for Image Recognition
- Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Neural Turing Machines
- Title: Neural Turing Machines
- Authors: Alex Graves, Greg Wayne, Ivo Danihelka
Flash Attention
- Title: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Authors: Tri Dao
TinyStories
- Title: TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
- Authors: Ronen Eldan, Yuanzhi Li

github

Alibi GitHub
- Title: Alibi GitHub Repository
- Authors: Jake Tae
- Link: Alibi GitHub
LLama2 Implementation Andrej Karpathy (Llama2.c)
- Link: GitHub
Tokenizer by Andrej Karpathy
- Link: GitHub

videos

Andrej Karpathy Understanding Automatic Gradient Descent
- Title: The spelled-out intro to neural networks and backpropagation: building micrograd
- Link: YouTube
Andrej Karpathy Transformer
- Title: Let's build GPT: from scratch, in code, spelled out
- Link: YouTube
Origin of the Attention Mechanism
- Title: Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
- Link: YouTube
Alibi Explained
- Title: Alibi Explained
- Link: YouTube

rest

Strengths Transformer
- Link: Twitter

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
models		models
remaining		remaining
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TRAINCONFIG.py		TRAINCONFIG.py
inference.ipynb		inference.ipynb
model.py		model.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Alilama (Transformer with ALiBi positional encoding and some tricks from Llama2)

model Architecture

inference my pretrained models

train on your own Data

Step 1: build Your Dataset

Step 2: train the tokenizer

Step 3: train the model

inference your model

main sources

all sources

papers

github

videos

rest

About

Releases

Packages

Languages

License

JohannesVod/Alilama

Folders and files

Latest commit

History

Repository files navigation

Alilama (Transformer with ALiBi positional encoding and some tricks from Llama2)

model Architecture

inference my pretrained models

train on your own Data

Step 1: build Your Dataset

Step 2: train the tokenizer

Step 3: train the model

inference your model

main sources

all sources

papers

github

videos

rest

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages