Support transformers-like api for woq quantization (#1987)

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Wang, Chang <chang1.wang@intel.com>
intel · Sep 13, 2024 · 5de9a4f · 5de9a4f
1 parent 9c39b42
commit 5de9a4f
Show file tree

Hide file tree

Showing 32 changed files with 73,062 additions and 67 deletions.
diff --git a/.azure-pipelines/ut-basic.yml b/.azure-pipelines/ut-basic.yml
@@ -19,6 +19,8 @@ pr:
       - neural_compressor/torch
       - neural_compressor/tensorflow
       - neural_compressor/onnxrt
+      - neural_compressor/transformers
+      - neural_compressor/evaluation
       - .azure-pipelines/scripts/ut/3x
 
 pool: ICX-16C

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -129,7 +129,8 @@ repos:
               examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static/prompt.json|
               examples/notebook/dynas/ResNet50_Quantiation_Search_Supernet_NAS.ipynb|
               examples/notebook/dynas/Transformer_LT_Supernet_NAS.ipynb|
-              neural_compressor/torch/algorithms/fp8_quant/internal/diffusion_evaluation/SR_evaluation/imagenet1000_clsidx_to_labels.txt
+              neural_compressor/torch/algorithms/fp8_quant/internal/diffusion_evaluation/SR_evaluation/imagenet1000_clsidx_to_labels.txt|
+              neural_compressor/evaluation/hf_eval/datasets/cnn_validation.json
           )$
 
   - repo: https://github.com/astral-sh/ruff-pre-commit

diff --git a/...nguage-modeling/quantization/transformers/weight_only/text-generation/README.md b/...nguage-modeling/quantization/transformers/weight_only/text-generation/README.md
@@ -0,0 +1,168 @@
+# Step-by-Step
+We provide a Transformers-like API for model compression using the `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms, besides we provide use ipex to use intel extension for pytorch to accelerate the model.
+We provide the inference benchmarking script `run_generation.py` for large language models, the default search algorithm is beam search with `num_beams = 4`. [Here](./llm_quantization_recipes.md) are some well accuracy and performance optimized models we validated, more models are working in progress.
+
+# Quantization for CPU device
+
+## Prerequisite
+### Create Environment
+python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements, we recommend create environment as the following steps.
+
+```bash
+pip install -r requirements_cpu_woq.txt
+```
+
+
+### Run
+#### Performance
+```shell
+# fp32
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py  \
+    --model <MODEL_NAME_OR_PATH> \
+    --batch_size 1 \
+    --benchmark
+
+# quant and do benchmark.
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py  \
+    --model <MODEL_NAME_OR_PATH> \
+    --woq \
+    --woq_algo <ALGORITHM_NAME> \  # Default is "Rtn", "Awq", "Teq", "GPTQ", "AutoRound" are provided.
+    --output_dir <WOQ_MODEL_SAVE_PATH> \  # Default is "./saved_results"
+    --batch_size \
+    --benchmark
+
+# load WOQ quantized model and do benchmark.
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py  \
+    --model <WOQ_MODEL_SAVE_PATH> \
+    --benchmark
+
+# load WOQ model from Huggingface and do benchmark.
+OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \
+    --model <MODEL_NAME_OR_PATH> \
+    --benchmark
+
+```
+#### Accuracy
+The accuracy validation is based from [lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.3/lm_eval/__main__.py).
+```shell
+# fp32
+python run_generate_cpu_woq.py \
+    --model <MODEL_NAME_OR_PATH> \
+    --accuracy \
+    --tasks lambada_openai,piqa,hellaswag \  # notice: no space.
+    --device cpu \
+    --batch_size 56
+
+# quant and do accuracy.
+python run_generate_cpu_woq.py \
+    --model <MODEL_NAME_OR_PATH> \
+    --woq \
+    --woq_algo <ALGORITHM_NAME> \  # Default is "Rtn", "Awq", "Teq", "GPTQ", "AutoRound" are provided.
+    --output_dir <WOQ_MODEL_SAVE_PATH> \
+    --accuracy \
+    --tasks lambada_openai,piqa,hellaswag \  # notice: no space.
+    --batch_size 56 
+
+# load WOQ model quantied by itrex and do benchmark.
+python run_generate_cpu_woq.py \
+    --model <WOQ_MODEL_SAVE_PATH> \
+    --accuracy \
+    --tasks lambada_openai,piqa,hellaswag \  # notice: no space.
+    --batch_size 56 
+
+# load WOQ model quantied by itrex and do benchmark with neuralspeed.
+# only support quantized with algorithm "Awq", "GPTQ", "AutoRound"
+python run_generate_cpu_woq.py \
+    --model <WOQ_MODEL_SAVE_PATH> \
+    --accuracy \
+    --tasks lambada_openai,piqa,hellaswag \  # notice: no space.
+    --device cpu \
+    --batch_size 56
+
+
+# load WOQ model from Huggingface and do benchmark.
+python run_generate_cpu_woq.py \
+    --model <MODEL_NAME_OR_PATH> \
+    --accuracy \
+    --tasks lambada_openai,piqa,hellaswag \  # notice: no space.
+    --device cpu \
+    --batch_size 56
+
+# load WOQ model from Huggingface and do benchmark with neuralspeed.
+python run_generate_cpu_woq.py \
+    --model <MODEL_NAME_OR_PATH> \
+    --accuracy \
+    --tasks lambada_openai,piqa,hellaswag \  # notice: no space.
+    --device cpu \
+    --batch_size 56 \
+
+```
+
+# Quantization for GPU device
+>**Note**: 
+> 1.  default search algorithm is beam search with num_beams = 1.
+> 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well.
+> 3. We provide compression technologies `WeightOnlyQuant` with `Rtn/GPTQ/AutoRound` algorithms and `load_in_4bit` and `load_in_8bit` work on intel GPU device.
+
+## Prerequisite
+### Dependencies
+Intel-extension-for-pytorch dependencies are in oneapi package, before install intel-extension-for-pytorch, we should install oneapi first. Please refer to [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.10%2Bxpu) to install the OneAPI to "/opt/intel folder".
+
+### Create Environment
+Pytorch and Intel-extension-for-pytorch version for intel GPU > 2.1 are required, python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements_GPU.txt, we recommend create environment as the following steps. For Intel-exension-for-pytorch, we should install from source code now, and Intel-extension-for-pytorch will add weight-only quantization in the next version.
+
+>**Note**: please install transformers==4.40.2.
+
+```bash
+pip install -r requirements_GPU.txt
+pip install transformers==4.38.1 # llama use 4.38.1
+source /opt/intel/oneapi/setvars.sh
+git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu
+cd ipex-gpu
+git submodule update --init --recursive
+export USE_AOT_DEVLIST='pvc,ats-m150'
+export BUILD_WITH_CPU=OFF
+
+python setup.py install
+```
+
+## Run
+The following are command to show how to use it.
+
+### 1. Performance
+``` bash
+# fp16
+python run_generation_gpu_woq.py \
+    --model EleutherAI/gpt-j-6b \
+    --benchmark
+
+# weightonlyquant
+python run_generation_gpu_woq.py \
+    --model EleutherAI/gpt-j-6b \
+    --woq \
+    --woq_algo <ALGORITHM_NAME> \  # Default is "Rtn", "GPTQ", "AutoRound" are provided.
+    --benchmark
+```
+> Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference.
+```bash
+# First step: Quantize and save model
+python run_generation_gpu_woq.py \
+    --model EleutherAI/gpt-j-6b \
+    --woq \ # default quantize method is Rtn
+    --woq_algo <ALGORITHM_NAME> \  # Default is "Rtn", "GPTQ", "AutoRound" are provided.
+    --output_dir "saved_dir"
+
+# Second step: Load model and inference
+python run_generation_gpu_woq.py \
+    --model "saved_dir" \
+    --benchmark
+```
+
+### 2. Accuracy
+```bash
+# quantized model by following the steps above
+python run_generation_gpu_woq.py \
+    --model "saved_dir" \
+    --accuracy \
+    --tasks "lambada_openai"
+```