-
Notifications
You must be signed in to change notification settings - Fork 262
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support transformers-like api for woq quantization (#1987)
Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Wang, Chang <chang1.wang@intel.com>
- Loading branch information
1 parent
9c39b42
commit 5de9a4f
Showing
32 changed files
with
73,062 additions
and
67 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
168 changes: 168 additions & 0 deletions
168
...nguage-modeling/quantization/transformers/weight_only/text-generation/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
# Step-by-Step | ||
We provide a Transformers-like API for model compression using the `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms, besides we provide use ipex to use intel extension for pytorch to accelerate the model. | ||
We provide the inference benchmarking script `run_generation.py` for large language models, the default search algorithm is beam search with `num_beams = 4`. [Here](./llm_quantization_recipes.md) are some well accuracy and performance optimized models we validated, more models are working in progress. | ||
|
||
# Quantization for CPU device | ||
|
||
## Prerequisite | ||
### Create Environment | ||
python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements, we recommend create environment as the following steps. | ||
|
||
```bash | ||
pip install -r requirements_cpu_woq.txt | ||
``` | ||
|
||
|
||
### Run | ||
#### Performance | ||
```shell | ||
# fp32 | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--batch_size 1 \ | ||
--benchmark | ||
|
||
# quant and do benchmark. | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--woq \ | ||
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "Awq", "Teq", "GPTQ", "AutoRound" are provided. | ||
--output_dir <WOQ_MODEL_SAVE_PATH> \ # Default is "./saved_results" | ||
--batch_size \ | ||
--benchmark | ||
|
||
# load WOQ quantized model and do benchmark. | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \ | ||
--model <WOQ_MODEL_SAVE_PATH> \ | ||
--benchmark | ||
|
||
# load WOQ model from Huggingface and do benchmark. | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--benchmark | ||
|
||
``` | ||
#### Accuracy | ||
The accuracy validation is based from [lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.3/lm_eval/__main__.py). | ||
```shell | ||
# fp32 | ||
python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--device cpu \ | ||
--batch_size 56 | ||
|
||
# quant and do accuracy. | ||
python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--woq \ | ||
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "Awq", "Teq", "GPTQ", "AutoRound" are provided. | ||
--output_dir <WOQ_MODEL_SAVE_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--batch_size 56 | ||
|
||
# load WOQ model quantied by itrex and do benchmark. | ||
python run_generate_cpu_woq.py \ | ||
--model <WOQ_MODEL_SAVE_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--batch_size 56 | ||
|
||
# load WOQ model quantied by itrex and do benchmark with neuralspeed. | ||
# only support quantized with algorithm "Awq", "GPTQ", "AutoRound" | ||
python run_generate_cpu_woq.py \ | ||
--model <WOQ_MODEL_SAVE_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--device cpu \ | ||
--batch_size 56 | ||
|
||
|
||
# load WOQ model from Huggingface and do benchmark. | ||
python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--device cpu \ | ||
--batch_size 56 | ||
|
||
# load WOQ model from Huggingface and do benchmark with neuralspeed. | ||
python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--device cpu \ | ||
--batch_size 56 \ | ||
|
||
``` | ||
|
||
# Quantization for GPU device | ||
>**Note**: | ||
> 1. default search algorithm is beam search with num_beams = 1. | ||
> 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well. | ||
> 3. We provide compression technologies `WeightOnlyQuant` with `Rtn/GPTQ/AutoRound` algorithms and `load_in_4bit` and `load_in_8bit` work on intel GPU device. | ||
## Prerequisite | ||
### Dependencies | ||
Intel-extension-for-pytorch dependencies are in oneapi package, before install intel-extension-for-pytorch, we should install oneapi first. Please refer to [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.10%2Bxpu) to install the OneAPI to "/opt/intel folder". | ||
|
||
### Create Environment | ||
Pytorch and Intel-extension-for-pytorch version for intel GPU > 2.1 are required, python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements_GPU.txt, we recommend create environment as the following steps. For Intel-exension-for-pytorch, we should install from source code now, and Intel-extension-for-pytorch will add weight-only quantization in the next version. | ||
|
||
>**Note**: please install transformers==4.40.2. | ||
```bash | ||
pip install -r requirements_GPU.txt | ||
pip install transformers==4.38.1 # llama use 4.38.1 | ||
source /opt/intel/oneapi/setvars.sh | ||
git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu | ||
cd ipex-gpu | ||
git submodule update --init --recursive | ||
export USE_AOT_DEVLIST='pvc,ats-m150' | ||
export BUILD_WITH_CPU=OFF | ||
|
||
python setup.py install | ||
``` | ||
|
||
## Run | ||
The following are command to show how to use it. | ||
|
||
### 1. Performance | ||
``` bash | ||
# fp16 | ||
python run_generation_gpu_woq.py \ | ||
--model EleutherAI/gpt-j-6b \ | ||
--benchmark | ||
|
||
# weightonlyquant | ||
python run_generation_gpu_woq.py \ | ||
--model EleutherAI/gpt-j-6b \ | ||
--woq \ | ||
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "GPTQ", "AutoRound" are provided. | ||
--benchmark | ||
``` | ||
> Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference. | ||
```bash | ||
# First step: Quantize and save model | ||
python run_generation_gpu_woq.py \ | ||
--model EleutherAI/gpt-j-6b \ | ||
--woq \ # default quantize method is Rtn | ||
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "GPTQ", "AutoRound" are provided. | ||
--output_dir "saved_dir" | ||
|
||
# Second step: Load model and inference | ||
python run_generation_gpu_woq.py \ | ||
--model "saved_dir" \ | ||
--benchmark | ||
``` | ||
|
||
### 2. Accuracy | ||
```bash | ||
# quantized model by following the steps above | ||
python run_generation_gpu_woq.py \ | ||
--model "saved_dir" \ | ||
--accuracy \ | ||
--tasks "lambada_openai" | ||
``` |
Oops, something went wrong.