Update README.md

RahulSChand · Oct 29, 2023 · fc1cac7 · fc1cac7
1 parent 85d1d18
commit fc1cac7
Showing 1 changed file with 22 additions and 40 deletions.
diff --git a/README.md b/README.md
@@ -17,15 +17,19 @@ Link: **https://rahulschand.github.io/gpu_poor/**
 
 <img width="643" alt="image" src="https://github.com/RahulSChand/gpu_poor/assets/16897807/29577394-0efd-42fb-aaf4-282e9a45d5db">
 
+---
+
 #### 2. Calculate ~token/s you can get ⏱️
 
 <img width="647" alt="image" src="https://github.com/RahulSChand/gpu_poor/assets/16897807/77627c9b-5fdd-44cf-8b7d-452ff0563a8a">
 
+---
+
 #### 3. Approximate time for finetuning (ms per iteration) ⌛️
 
 <img width="764" alt="image" src="https://github.com/RahulSChand/gpu_poor/assets/16897807/e5fd08a1-abb9-4e00-ad45-ba9bb15ec546">
 
-
+---
 
 For memory, output is total vRAM & its breakdown. It looks like below
 
@@ -40,8 +44,7 @@ For memory, output is total vRAM & its breakdown. It looks like below
 }
 ```
 
-
-For token/s, output is token/s & additional info, looks like below
+For token/s, additional info looks like below
 
 ```     
 {
@@ -52,45 +55,38 @@ For token/s, output is token/s & additional info, looks like below
 }
 ```
 
+For training, output is time for each forward pass (in ms)
+
+```     
+{
+  "ms per iteration (forward + backward)": 100,
+  "memory or compute bound?": Memory,
+}
+```
+
 ---
 
 
 ### Purpose
 
-I made this to check if you can run a particular LLM on your GPU. Useful to figure out the following
+made this to check if you can run a particular LLM on your GPU. Useful to figure out the following
 
 1. What quantization will fit on my GPU?
 2. Max context length & batch-size my GPU can handle?
 3. Which finetuning? Full? LoRA? QLoRA?
 5. What is consuming my GPU memory? What to change to fit the LLM on GPU?
-
-
-
-### Can't we just look at the model size & figure this out?
-
-Finding which LLMs your GPU can handle isn't as easy as looking at the model size because during inference (KV cache) takes susbtantial amount of memory. For example, with sequence length 1000 on llama-2-7b it takes 1GB of extra memory (using hugginface LlamaForCausalLM, with exLlama & vLLM this is 500MB). And during training both KV cache & activations & quantization overhead take a lot of memory. For example, llama-7b with bnb int8 quant is of size ~7.5GB but it isn't possible to finetune it using LoRA on data with 1000 context length even with RTX 4090 24 GB. Which means an additional 16GB memory goes into quant overheads, activations & grad memory.
+6. How much token/s can I get?
+7. How much total time to finetune? 
 
 ---
 
-### How to use
-
-#### Model Name/ID/Size
-
-1. You can either enter the model id of a huggingface model (e.g. meta-llama/Llama-2-7b). Currently I have hardcoded & saved model configs of top 3k most downlaoded LLMs on huggingface.
-2. If you have a custom model or your hugginface id isn't available then you can either upload a json config ([example]( https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json)) or just enter your model size (e.g. 7 billion for llama-2-7b)
+## Additional info + FAQ
 
-#### Options
-1. **Inference**: Find vRAM for inference using either HuggingFace implementation or vLLM or GGML
-2. **Training** : Find vRAM for either full model finetuning or finetuning using LoRA (currently I have hardcoded r=8 for LoRA config) or using QLoRA.
 
-#### Quantization
-1. Currently it supports: bitsandbytes (bnb) int8/int4 & GGML (QK_8, QK_6, QK_5, QK_4, QK_2). The latter are only for inference while bnb int8/int4 can be used for both training & inference
-
-#### Context Len/Sequence Length
-1. What is the length of your prompt+new maximum tokens generated. Or for training this is the sequence length of your training data. Batch sizes are 1 for inference & can be specified for training. The option to specify batch sizes for inference needs to be added.
+### Can't we just look at the model size & figure this out?
 
+Finding which LLMs your GPU can handle isn't as easy as looking at the model size because during inference (KV cache) takes susbtantial amount of memory. For example, with sequence length 1000 on llama-2-7b it takes 1GB of extra memory (using hugginface LlamaForCausalLM, with exLlama & vLLM this is 500MB). And during training both KV cache & activations & quantization overhead take a lot of memory. For example, llama-7b with bnb int8 quant is of size ~7.5GB but it isn't possible to finetune it using LoRA on data with 1000 context length even with RTX 4090 24 GB. Which means an additional 16GB memory goes into quant overheads, activations & grad memory.
 
----
 
 ### How reliable are the numbers?
 The results can vary depending on your model, input data, cuda version & what quant you are using & it is impossible to predict exact values. I have tried to take these into account & make sure the results are within 500MB. Below table I cross-check 3b,7b & 13b model memories given by the website vs. what what I get on my RTX 4090 & 2060 GPUs. All values are within 500MB. 
@@ -111,25 +107,11 @@ The results can vary depending on your model, input data, cuda version & what qu
 ### Why are the results wrong?
 Sometimes the answers might be very wrong in which case please open an issue here & I will try to fix it.
 
----
-
-
-#### Additions
-
-1. Added autocomplete for ease of use
-
-![new_autocomplete_3](https://github.com/RahulSChand/gpu_poor/assets/16897807/01a3ff57-c354-4e76-afb0-be0192a0ba6f)
-
-
-2. Updated config list with new Huggingface trending models (Llava/Mistral/Trismegistus etc.)
-
-3. Fixed bitsandbytes quantization overhead calculation (before it was linear in terms of context length, fixed it to be more accurate)
 
-4. **Added token/s**
 ---
 
 ### TODO
-1. Add support for exLlama
+1. Add support for vLLM for token/s
 2. ~Add QLora~ ✅
 3. ~Add way to measure approximste tokens/s you can get for a particular GPU~ ✅
 4. ~Improve logic to get hyper-params from size~ (since hidden layer/intermediate size/number of layers can vary for a particular size) ✅