-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #26 from premAI-io/main
Merge from main
- Loading branch information
Showing
13 changed files
with
248 additions
and
108 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# AutoAWQ | ||
|
||
[![GitHub Repo](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/casper-hansen/AutoAWQ) | ||
[![ArXiv](https://img.shields.io/badge/arXiv-%230170FE.svg?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2306.00978) | ||
|
||
|
||
[AutoAWQ](https://github.com/casper-hansen/AutoAWQ) is a package that is a polished implemementation of the original work [llm-awq](https://github.com/mit-han-lab/llm-awq) from MIT. AWQ or Activation Aware Quantization is a quantization method which supports 4-bit quantization. It massively increases the inference throughput and decreases the memory requirement of the model at the same time. (For example, according to this [reference](https://huggingface.co/TheBloke/Llama-2-70B-Chat-AWQ), Llama2 70B requires 2 x 80 GB but with AutoAWQ it can be run on 1 x 48 GB GPU). You can learn more about AWQ on the research paper and the github implementations. | ||
|
||
### 🚀 Running the AutoAWQ Benchmark. | ||
|
||
You can run the AutoAWQ benchmark using the following command: | ||
|
||
```bash | ||
./bench_autoawq/bench.sh \ | ||
--prompt <value> \ # Enter a prompt string | ||
--max_tokens <value> \ # Maximum number of tokens to output | ||
--repetitions <value> \ # Number of repititions to be made for the prompt. | ||
--log_file <file_path> \ # A .log file underwhich we want to write the results. | ||
--device <cpu/cuda/metal> \ # The device in which we want to benchmark. | ||
--models_dir <path_to_models> # The directory in which AWQ model weights are present | ||
``` | ||
|
||
To get started quickly you can simply run: | ||
|
||
```bash | ||
./bench_autoawq/bench.sh -d cuda | ||
``` | ||
This will take all the default values (see in the [bench.sh](/bench_autoawq/bench.sh) file) and do the benchmarks. You can find all the benchmarks results for AutoAWQ [here](/docs/llama2.md). | ||
|
||
|
||
### 👀 Some points to note: | ||
|
||
1. AutoAWQ is not supported devices other than GPU (only supports when CUDA is available). | ||
2. We are independently benchmarking AutoAWQ (i.e. the actual AWQ quantization method here). We are not benchmarking with combinations like: AutoAWQ + VLLM or AutoAWQ + TensorRT. | ||
3. For doing this benchmark, the default model that was choosen was: [Llama2-AutoAWQ by The Bloke](https://huggingface.co/TheBloke/Llama-2-7B-AWQ) | ||
4. AutoAWQ does not support INT8 quantization properly yet. See [this issue](https://github.com/casper-hansen/AutoAWQ/issues/45). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# ExLlamaV2 | ||
|
||
[![GitHub Repo](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/turboderp/exllamav2) | ||
|
||
[ExLlamaV2](https://github.com/turboderp/exllamav2) uses custom Kernels to speed up LLM inference under different quantizations. ExLlamaV2 supports a new "EXL2" format. EXL2 is based on the same optimization method as GPTQ and supports 2, 3, 4, 5, 6 and 8-bit quantization. For this benchmark implementation, we use 4-bit and 8-bit quantization version of Llama2. | ||
|
||
|
||
### 🚀 Running the ExLlamaV2 Benchmark. | ||
|
||
You can run the ExLlamaV2 benchmark using the following command: | ||
|
||
```bash | ||
./bench_exllamav2/bench.sh \ | ||
--prompt <value> \ # Enter a prompt string | ||
--max_tokens <value> \ # Maximum number of tokens to output | ||
--repetitions <value> \ # Number of repititions to be made for the prompt. | ||
--log_file <file_path> \ # A .log file underwhich we want to write the results. | ||
--device <cpu/cuda/metal> \ # The device in which we want to benchmark. | ||
--models_dir <path_to_models> # The directory in which model weights are present | ||
``` | ||
|
||
To get started quickly you can simply run: | ||
|
||
```bash | ||
./bench_exllamav2/bench.sh -d cuda | ||
``` | ||
This will take all the default values (see in the [bench.sh](/bench_exllamav2/bench.sh) file) and perform the benchmarks. You can find all the benchmarks results for ExLlamaV2 [here](/docs/llama2.md). | ||
|
||
|
||
### 👀 Some points to note: | ||
|
||
1. ExLlamaV2 supports quantized LLMs. So Float32/16 is not supported here. | ||
2. ExLlamaV2 currently [does not have support](https://github.com/turboderp/exllamav2/issues/184) for Mac/Metal. | ||
3. Although it supports CPU, but it is too slow to offload and run. So we did not include in our benchmarks. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.