- The models
- The Roadmap π
- Low Rank Approximation
- We support LLaMa for BabyGPT β‘ β‘
- Quantization
- Our result
- Performance Benchmark
- Files
- Run π
- Auto Mixed Prescision
- Train and Generate π
- Results π
- Data
- Running the notebooks
- Acknowledgements
- Licenses
Building on the intuition of Karpathy's ng-video-lectures and mingpt, BabyGPT provides a working model of a GPT on a much smaller scale (256 as well as 16 out channels, 5 layer GPT, fine-tuned). BabyGPT has been built from a toyGPT which was made to understand transformers from scratch. It has been scaled down , as you will see below. Visit the notebooks. We scale up to transformers from simple Language models, attention mechanisms and finally BabyGPT. While toyGPT has been built by separating all the layers of a transformer indiviually. In BabyGPT, the attention mechanism is implemented manually. The purpose of building smaller GPTs is to understand transformer functions at a much more granular level.
To train small models we are using tinystories. You can download the weights from hugging face. We are setting max_iters to 5000 on a Tesla T4. For the OG model, we are using 256 out channels.
model | context length | n_layers | n_head | n_embd | train loss | val loss | parametres | data |
---|---|---|---|---|---|---|---|---|
15M | 16 | 4 | 4 | 16 | 2.4633 | 2.4558 | 13k | stories15M.bin |
42M | 32 | 8 | 8 | 32 | 2.3772 | 2.3821 | 1.01M | stories42M.bin |
BabyGPT Original | 64 | 8 | 8 | 256 | 1.3954 | 1.5959 | 6.37M | data |
Note:- The 110M is being omitted for now. The RAM blew up..!!
If you wish to understand the nitty gritty of how transformers work from scratch, this roadmap will guide you. We start from implementing simple language models bigram and ngram and then from there work our way up to building transformers , a GPT from scratch and then finally babyGPT.
We implement a Low rank approximation as well as lit-lama to babyGPT as well. We finally train the models and generate tokens.
Low rank approximation improves parametre efficiency(compression technique). A LoRa_model.py has been added(on 256 out channels) . We receive a parametre reduction of about 2. All, we need to do is to compute a rank parametre and compute the attention accordingly. In the LoRa notebook, an estimation of FLOPs has been done according to the chinchilla paper.
Quantization has also been performed on the lora model. A calculation of FLOPs has been added as well. For the BabyGPT model for 256 out channels we get 0.407 Peta FLOPs. In the LoRa notebook we have added the quantization. In terms of size reduction, we are getting a reduction of a factor of 1.3 for now.
An implemetation of the lit-llama model has been ported to BabyGPT(based on llama- version 1). You can find the notebook here -> llama_implementation . Run the model ,MFU has also been added.
llama\python llama_model_v1.py
. Training and generating tokens has been provided below.
Note:- We have ported build_rope_cache()
, apply_rope()
and RMSNorm()
from version 1. We are also not using the version 1 weights or checkpoints(these are for even larger models 7B, 13B, 65B etc). You can download the weights and port llama to your own version.
We have ported llama2 by meta into BabyGPT. You can find the implementation at llama\python llama2.py
. we have also provided a calculation of FLOPs along with the model.
The FLOPs to compute k, v cache is the flops to compute is :- 2 * 2 * num_layers * (embedded_dim) ^2. Find more information on how to compute memory and compute in kipply's blog MFU has been added to llama2
Note:- We are not using original llama weights by meta. We are also using arbitrary values for 70B. You can port it to your own model using your own weights.
Tokenization using sentencepiece has been done. We are exporting a tokenizer.bin
unlike the tokenizer the quant folder. Run it in llama\python tokenizer.py
(meta-pieces added) . We can use the .bin file for further inference.
We need efficient memory usage for LLMs. Hardware accelerators use a technique called Hardware FLOP Utilization for efficient trade-offs between memory usage and compute. This is typically done using an estimate of the ratio of FLOPs observed on a given device to its theoretical peak FLOPs. MFU is the ratio of the observed throughput (tokens-per-second), relative to the theoretical maximum throughput of a system operating at peak FLOPs. The theoretical peak matmul of Tesla T4 is around 8.1 TFLOPS. Hence, we calculate the MFU of the LLaMA trainer model. See in the trainer notebook under LLaMA-trainer. We receive a MFU of : 0.0527723427% on 3.22M parametres. This would of course increase as the number of parametres increases. For a 530B parametre model, MPU is around 30% on A100 GPUs. We use Section B from the PaLM paper for reference.
LLMs require many GPUs to run, we need to find ways to reduce these requirements while preserving the model's performance. Various technologies have been developed that try to shrink the model size, you may have heard of quantization and distillation. It has been discovered that instead of using the 4-byte FP32 precision, we can get an almost identical inference outcome with 2-byte BF16/FP16 half-precision, which halves the model size.
To remediate that, 8-bit quantization was introduced. This method uses a quarter precision, thus needing only 1/4th of the model size! But it's not done by just dropping another half of the bits. There's a lot more to this topic. Look at hugging face quantization.
You can see quant.md on how to perform llama-quantization. You can look at quantization Notebook for a beginner's introduction to quantization. Different benchmarkings has been done. For ex:- On a GPU, the 7B parametre model on bfloat16 will take about 15GB. BabyGPT will take about a few kilobytes..!!!
quantization.py
has been obtained from lit-llama repo. A tokenizer using sentencepiece has been added as well. Diffrent kinds of weight operations can be performed from the tokenizer.model
For post training quantization, we are able to reduce the model size by a factor of almost 4.
model_fp = BabyGPTmodel(config)
model_fp.eval()
model_int8 = torch.ao.quantization.quantize_dynamic(
model_fp, # the original model
{torch.nn.Linear}, # a set of layers to dynamically quantize
dtype=torch.qint8)
/// number of parameters: 3222637
12.9688 MB
3.4603 MB ////
Note: Just for quantization we are using a bigger model with about 3.22M paramtres
Performance benchmarking has been done on the BabyGPTmodel and the quantized model. Below are the results.
It has been added to the quantization Notebook in the quant folder.
BabyGPT
βββ bigram_lm.py
βββ ngram_lm.py
βββ model.py
βββ Lora_model.py
βββ Llama_model.py
βββ Attention
β βββ dot product attention.py
β βββ multi headed attention.py
β βββ cross attention.py
β βββ spatial attention.py
βββ Notebook
β βββ Dot product attention
β βββ multiheaded attention
β βββ gpt from scratch
β βββ spatial transformer
β βββ babyGPT
β βββ LoRa
β βββ llama_implementation
β βββ mixed precision
βββ Train
| βββ babygpt_trainer.py
| βββ llama_trainer.py
βββ transformers
| βββ transformer_model.py
β βββ babyGPT.py
βββ Quant
β βββ quantization.py
β βββ quantization notebook
β βββ tokenizer.model
β βββ tokenizer.vocab
β βββ tokenizer.py
β βββ model.pth
β βββ quant.md
βββ llama
β βββ llama2.py
β βββ llama_model_v1.py
β βββ tokenizer.py
β βββ tokenizer.vocab
β βββ tokenizer.bin
β βββ tokenizer.model
βββ text.txt
βββ trainer.ipynb
βββ requirements.txt
Clone the Repo and run the following:
! git clone https://github.com/soumyadip1995/BabyGPT.git
To run the bigram and ngram language models.
python bigram_lm.py
and python ngram_lm.py
.
To run babygpt
transformers\python babygpt.py
from transformers folder.
To run a simple transformer model
python transformer_model.py
To run a low rank approximation model
python LoRa_model.py
To run the llama model
llama\python llama_model_v1.py
To run llama2
llama\python llama2.py
Run the different attention mechanisms from Attention folder.
A very preliminary auto mixed precision has been added. FP16/FP32 It can be achieved with a cuda enabled gpu. A combination of pytorch's autocast()
and gradscalar()
is used for mixed precision. See more in the pytorch tutorial. Unfortunately the gpu blew up during training and cpu for now only supports bfloat16. Takes a hell of a long time to train. If anyone can improve upon it that would be awesome. Check the Mixed Precision Notebook.
If you wish to get started on BabyGPT and llama , but don't want to go through all the hassle of knowing all about transformer models, you can simply start by running the code from the train folder.
To train and generate text from both BabyGPT model and the LLaMA model. Run
train\python babygpt_trainer.py
and train\python llama_trainer.py
from the train folder.
Both have been trained on the Tesla T4 GPUs. You can increase or decrease the values of max_iters according to your wish. Takes a few minutes to train.
You can see the result from both the models in the trainer notebook.
Number of params = 3.22 M
``` number of parameters: 3222381 ```
step 0: train loss 4.6894, val loss 4.6895
step 500: train loss 2.1731, val loss 2.1832
step 1000: train loss 1.7580, val loss 1.8032
step 1500: train loss 1.5790, val loss 1.6645
step 2000: train loss 1.4482, val loss 1.5992
step 2500: train loss 1.3538, val loss 1.5874
step 3000: train loss 1.2574, val loss 1.5971
.
.
.
step 9000: train loss 0.5236, val loss 2.4614
step 9500: train loss 0.4916, val loss 2.5494
step 10000: train loss 0.4680, val loss 2.6631
step 10500: train loss 0.4448, val loss 2.6970
step 10999: train loss 0.4341, val loss 2.7462
Detroit, revior myself 'til I confused to get the big clead Mastles
Slaughterhouse on the blue, that's when he pine I'm hop with the cowprinton
robaly I want to a lox on my tempt
But now we can't never find a gift killed broke
Big before anyone could ever hear the first as I was cooped chill
But i this o for a big star
I said get chased up!
(Hello darkness, my old friend)[Eminem:]
If my legacy I acged buving in the tub (might what?)
I would know one [*Barrns, worried :]
Yeah, so kon bitch, it's
Seems like the model converges a bit early, towards the end. Maybe that will need more modification. Spitting some Eminem yo..:smile:
The data folder contains the text document which has the lyrics to all of Eminem's songs.
Notebook | Description |
---|---|
Dot product attention | colab |
Multi headed attention | colab |
GPT from scratch | colab (Approx 860k parametres) |
Spatial Transformers | colab |
BabyGPT | colab(16, 256 out channels) |
LoRa | colab(256 out channels) |
lit-llama for BabyGPT | colab(16 out channels for lit-llama ) |
trainer for Babygpt and llama | colab(16 out channels for BabyGPT , 256 out channels for llama) |
text.txt
is based on Eminem's Stan.
- Pytorch GPT tutorial
- Karpathy's youtube videos and tutorials
- Karpathy's Mingpt
- O' reilly notes for NLP
- lit-llama repository.
- llama from facebook
- chinchilla paper
- karpathy's nn zero to hero.
- Lightning AI for Low rank Approximation.
- IST das for gptq
Licenses have been updated to include facebookresearch/llama, lit-llama and IST-das lab You can use it under GNU, Apache and MIT.
- look into triton
- Fix the readme.md
- Inference using libtorch or each separately
- look into sentencepiece and tokenizer
- look into tinystories