Skip to content

JohannesVod/Alilama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alilama (Transformer with ALiBi positional encoding and some tricks from Llama2)

An easy to understand Transformer model implementation. This repo is designed so that it is easy for you to train your own custom datasets.

model Architecture

As mentioned in the talk (Cybersecurity and AI), i adapted the standart transformer decoder architecture, except that i used relative positioning as described in ALiBi. I also moved the batch norm layers in front of the Attention and MLP layers. For weight initalization i used nn.init.xavier_normal_, as mentioned in this paper. To reduce model size, i used weight tying for the input embedding and output projection. I called this model "alilama" because it uses the ALiBi positional encoding and some tricks from the llama2 architecture. To test the implementation, i used the TinyStories dataset. Here is a test story from the "alilama_2M" model i trained (the yellow text is written by me, the rest is filled up by the model):

$\textcolor{yellow}{\text{Once}}$ $\textcolor{yellow}{\text{upon}}$ $\textcolor{yellow}{\text{a}}$ $\textcolor{yellow}{\text{time,}}$ $\textcolor{yellow}{\text{Tom}}$ $\textcolor{yellow}{\text{and}}$ $\textcolor{yellow}{\text{Lisa}}$ $\textcolor{yellow}{\text{found}}$ $\textcolor{yellow}{\text{a}}$ $\textcolor{yellow}{\text{shiny}}$ $\textcolor{yellow}{\text{red}}$ $\textcolor{yellow}{\text{apple.}}$ $\textcolor{yellow}{\text{They}}$ $\textcolor{yellow}{\text{both}}$ $\textcolor{yellow}{\text{wanted}}$ $\textcolor{yellow}{\text{it.}}$ $\textcolor{yellow}{\text{Tom}}$ $\textcolor{yellow}{\text{said,}}$ $\textcolor{yellow}{\text{"I}}$ $\textcolor{yellow}{\text{found}}$ $\textcolor{yellow}{\text{it}}$ $\textcolor{yellow}{\text{first!"}}$ $\textcolor{yellow}{\text{Lisa}}$ $\textcolor{yellow}{\text{said,}}$ $\textcolor{yellow}{\text{"I}}$ $\textcolor{yellow}{\text{want}}$ $\textcolor{yellow}{\text{it}}$ $\textcolor{yellow}{\text{too!"}}$ Tom said, "No, it's mine!" They both pulled and pulled, but the apple did not move. Tom and Lisa were sad. They could not play with the apple anymore. They had to find a way to make the apple disappear. They could not agree on what to do. Then, Tom had an idea. He said, "Let's share the apple!" They both agreed. They shared the apple and played together. They were happy and became friends

I also implemented a fast tokenizer based on MinBpe by Andrej karpathy in c++ and connected it using ctypes.

inference my pretrained models

To try out one of my models, you can open the inference.ipynb file in jupyter and just execute the cells. In the first cell, you can specify the model you want to run. I already trained "alilama_2M.pth" (d_model=128, blocks=8, max_seq_len=128, num_heads=8, hidden_dim=4*d_model) with learning rate 1e^-4. It took around 10 hours to train on my NVIDIA GeForce RTX 3050. I trained for around 4 epochs, so the model saw the complete dataset 4 times. Larger models would give much better results, but unfortunately i don't have the compute for that.. In the second cell, you can specify the "prompt" for the model. I am not completely satisfied with the performance of the model, even tho it seems to perform as well as a model of the same size from the TinyStories paper. There is definitely some improvement we can still do. For example, using warmup steps/cosine learning rate schedule as used in the training of Llama2 would be a good starting point. I am probably gonna include it in the future. Also, obviously training larger models would give much better results, but unfortunately i don't have enough compute for this.

train on your own Data

Step 1: build Your Dataset

You can put any data you want into the data_raw folder. Make sure to name it train_data.txt. This file can contain basically any text you want. The model will later learn from it and try to generate new data. Here are a few examples on how this could be useful.

  1. Language Translation: For example you could build up a dataset that looks like this:

    "ENGLISH: A man goes to the bar and drinks a beer GERMAN: Ein Mann geht in die Bar und trinkt ein Bier ENGLISH: a dog barks at a man, who walks down the street GERMAN: Ein Hund ... "

  2. Code Generation: Feed in the entire linux kernel and see what the model spits out

  3. Music Generation: Encode music into a stream of tokens and let the model generate some beautiful tracks (maybe)

To kickstart, use the following command to download the TinyStories dataset:

python data/data_raw/downloadTinyStories.py

Step 2: train the tokenizer

execute

python data/tokenizeData.py

This can take some time (up to 10 minutes, depending on the dataset)

to train on your custom dataset, use:

python data/TokenizeData.py --data_path=path_to_your_file.txt

Step 3: train the model

execute

python train.py

This will train the model. You can adjust the model and training parameters in TRAINCONFIG.py

inference your model

To test your trained model, see the inference notebook. Specify the start string and execute the model on it. You can see the already trained models in the models folder.

main sources

The idea repo is inspired by an implementation of the llama model from Andrej Karpathy, however we implemented our own model that comes close to the performance of the Llama model, while beeing much easier to understand and adapt. To understand the model architecture and gradient descent better, i can recommend these two videos (in this order):

Understanding automatic differentiation (by Andrej Karpathy)

Understanding the basic transformer architecture (by Andrej Karpathy)

all sources

papers

  1. LLama2 paper
  1. GPT-2 paper

  2. main Tranformer architecture

    • Title: Attention Is All You Need
    • Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
    • Link: arXiv
  3. Alibi Paper

  4. Rotary Positional Embedding

  5. Weight Tying Understanding

  6. Weight Initialization Scheme

  7. ResNet

  8. Neural Turing Machines

  9. Flash Attention

  10. TinyStories

github

  1. Alibi GitHub

    • Title: Alibi GitHub Repository
    • Authors: Jake Tae
    • Link: Alibi GitHub
  2. LLama2 Implementation Andrej Karpathy (Llama2.c)

  3. Tokenizer by Andrej Karpathy

videos

  1. Andrej Karpathy Understanding Automatic Gradient Descent

    • Title: The spelled-out intro to neural networks and backpropagation: building micrograd
    • Link: YouTube
  2. Andrej Karpathy Transformer

    • Title: Let's build GPT: from scratch, in code, spelled out
    • Link: YouTube
  3. Origin of the Attention Mechanism

    • Title: Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
    • Link: YouTube
  4. Alibi Explained

    • Title: Alibi Explained
    • Link: YouTube

rest

  1. Strengths Transformer

About

modern transformer implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published