vllm-project · robertgshaw2-neuralmagic · Aug 14, 2024 · Aug 11, 2024 · Aug 11, 2024 · Aug 11, 2024
diff --git a/README.md b/README.md
@@ -4,6 +4,7 @@
 * Comprehensive set of quantization algorithms including weight-only and activation quantization
 * Seamless integration Hugging Face models and repositories
 * `safetensors`-based file format compatible with `vllm`
+* Large model support via `accelerate`
 
 <p align="center">
    <img alt="LLM Compressor Flow" src="docs/images/architecture.png" width="75%" />
@@ -25,7 +26,7 @@
 
 ## Installation
 
-`llm-compressor` can be installed from the source code via a git clone and local pip install.
+`llmcompressor` can be installed from the source code via a git clone and local pip install.
 
 ### From PyPI
 ```bash
@@ -74,12 +75,17 @@ model = LLM("llama-compressed-quickstart")
 output = model.generate("I love 4 bit models because")
 ```
 
-## End-to-End Examples
-The `llm-compressor` library provides a rich feature-set for model compression. Below are examples
-and documentation of a few key flows:
+## Examples
+
+See below for end-to-end examples applying quantization with `llmcompressor`:
+* [`Meta-Llama-3-8B-Instruct` W8A8-INT8 With GPTQ and SmoothQuant](examples/quantization_w8a8_int8)
+* [`Meta-Llama-3-8B-Instruct` W8A8-FP8 With PTQ](examples/quantization_w8a8_fp8)
 * [`Meta-Llama-3-8B-Instruct` W4A16 With GPTQ](examples/quantization_w4a16)
-* [`Meta-Llama-3-8B-Instruct` W8A8-Int8 With GPTQ and SmoothQuant](examples/quantization_w8a8_int8)
-* [`Meta-Llama-3-8B-Instruct` W8A8-Fp8 With PTQ](examples/quantization_w8a8_fp8)
+
+## User Guides
+See below for deep dive user guides into key topics related to using `llmcompressor`:
+* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate)
+
 
 If you have any questions or requests open an [issue](https://github.com/vllm-project/llm-compressor/issues) and we will add an example or documentation.
 

diff --git a/examples/big_model_offloading/big_model_fp8.py b/examples/big_model_offloading/big_model_fp8.py
diff --git a/examples/big_model_offloading/big_model_w8a8_calibrate.py b/examples/big_model_offloading/big_model_w8a8_calibrate.py
diff --git a/examples/big_models_with_accelerate/README.md b/examples/big_models_with_accelerate/README.md
@@ -0,0 +1,80 @@
+# Quantizing Big Models with HF Accelerate
+
+`llmcompressor` integrates with `accelerate` to support quantizing large models such as Llama 70B and 405B.
+
+## Overview
+
+[`accelerate`]((https://huggingface.co/docs/accelerate/en/index)) is a highly useful library in the Hugging Face ecosystem that supports for working with large models, including:
+- Offloading parameters to CPU
+- Sharding models across multiple GPUs with pipeline-parallelism
+
+
+### Using `device_map`
+
+To enable `accelerate` features with `llmcompressor`, simple insert `device_map` in `from_pretrained` during model load.
+
+```python
+from llmcompressor.transformers import SparseAutoModelForCausalLM
+MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"
+
+# device_map="auto" triggers usage of accelerate
+# if > 1 GPU, the model will be sharded across the GPUs
+# if not enough GPU memory to fit the model, parameters are offloaded to the CPU
+model = SparseAutoModelForCausalLM.from_pretrained(
+    MODEL_ID, device_map="auto", torch_dtype="auto")
+```
+
+`llmcompressor` is designed to respect the `device_map`, so calls to `oneshot` will work properly.
+
+### Practical Advice
+
+When working with `accelerate`, it is important to keep in mind that CPU offloading and naive pipeline-parallelism will slow down forward passes through the model. As a result, we need to take care to ensure that the quantization methods used fit well with the offloading scheme as methods that require many forward passes though the model will be slowed down.
+
+General rules of thumb:
+- CPU offloading should only be with data-free quantization methods (e.g. PTQ with `FP8_DYNAMIC`)
+- Multi-GPU is fast enough to be used with calibration data-based methods with `sequential_update=False`
+- It is possible to use Multi-GPU with `sequential_update=True`, but the runtime will be slower
+
+## Examples
+
+We will show working examples for each use case:
+- **CPU Offloading**: Quantize `Llama-70B` to `FP8` using `PTQ` with a single GPU
+- **Multi-GPU**: Quantize `Llama-70B` to `INT8` using `GPTQ` and `SmoothQuant` with 8 GPUs
+
+### Installation
+
+Install `llmcompressor`:
+
+```bash
+pip install llmcompressor==0.1.0
+```
+
+### CPU Offloading: `FP8` Quantization with `PTQ`
+
+CPU offloading is slow. As a result, we recommend using this feature only with data-free quantization methods. For example, when quantizing a model to `fp8`, we typically use simple `PTQ` to statically quantize the weights and use dynamic quantization for the activations. These methods do not require calibration data.
+
+- `cpu_offloading_fp8.py` demonstrates quantizing the weights and activations of `Llama-70B` to `fp8` on a single GPU:
+
+```bash
+export CUDA_VISIBLE_DEVICES=0
+python cpu_offloading_fp8.py
+```
+
+The resulting model `./Meta-Llama-3-70B-Instruct-FP8-Dynamic` is ready to run with `vllm`!
+
+### Multi-GPU: `INT8` Quantization with `GPTQ`
+
+For quantization methods that require calibration data (e.g. `GPTQ`), CPU offloading is too slow. For these methods, `llmcompressor` can use `accelerate` multi-GPU to quantize models that are larger than a single GPU. For example, when quantizing a model to `int8`, we typically use `GPTQ` to statically quantize the weights, which requires calibration data.
+
+- `multi_gpu_int8.py` demonstrates quantizing the weights and activations of `Llama-70B` to `int8` on 8 A100s:
+
+```python
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+python multi_gpu_int8.py
+```
+
+The resulting model `./Meta-Llama-3-70B-Instruct-INT8-Dynamic` is quantized and ready to run with `vllm`!
+
+## Questions or Feature Request?
+
+Please open up an issue on `vllm-project/llm-compressor`
diff --git a/examples/big_models_with_accelerate/cpu_offloading_fp8.py b/examples/big_models_with_accelerate/cpu_offloading_fp8.py
@@ -0,0 +1,26 @@
+from transformers import AutoTokenizer
+
+from llmcompressor.modifiers.quantization import QuantizationModifier
+from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
+
+MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"
+OUTPUT_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+
+# Load model
+# Note: device_map="auto" will offload to CPU if not enough space on GPU.
+model = SparseAutoModelForCausalLM.from_pretrained(
+    MODEL_ID, device_map="auto", torch_dtype="auto"
+)
+
+# Configure the quantization scheme and algorithm (PTQ + FP8_DYNAMIC).
+recipe = QuantizationModifier(
+    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
+)
+
+# Apply quantization and save in `compressed-tensors` format.
+oneshot(
+    model=model,
+    recipe=recipe,
+    tokenizer=AutoTokenizer.from_pretrained(MODEL_ID),
+    output_dir=OUTPUT_DIR,
+)
diff --git a/examples/big_models_with_accelerate/multi_gpu_int8.py b/examples/big_models_with_accelerate/multi_gpu_int8.py
@@ -0,0 +1,74 @@
+from datasets import load_dataset
+from transformers import AutoTokenizer
+
+from llmcompressor.modifiers.quantization import GPTQModifier
+from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
+
+MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"
+SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic"
+
+# 1) Load model (device_map="auto" with shard the model over multiple GPUs!).
+model = SparseAutoModelForCausalLM.from_pretrained(
+    MODEL_ID,
+    device_map="auto",
+    torch_dtype="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+
+# 2) Prepare calibration dataset (in this case, we use ultrachat).
+DATASET_ID = "HuggingFaceH4/ultrachat_200k"
+DATASET_SPLIT = "train_sft"
+
+# Select number of samples. 512 samples is a good place to start.
+# Increasing the number of samples can improve accuracy.
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 1024
+
+# Load dataset and preprocess.
+ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
+ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+
+
+def preprocess(example):
+    return {
+        "text": tokenizer.apply_chat_template(
+            example["messages"],
+            tokenize=False,
+        )
+    }
+
+
+ds = ds.map(preprocess)
+
+
+# Tokenize inputs.
+def tokenize(sample):
+    return tokenizer(
+        sample["text"],
+        padding=False,
+        max_length=MAX_SEQUENCE_LENGTH,
+        truncation=True,
+        add_special_tokens=False,
+    )
+
+
+ds = ds.map(tokenize, remove_columns=ds.column_names)
+
+# 3) Configure algorithms. In this case, we:
+#   * quantize the weights to int8 with GPTQ (static per channel)
+#   * quantize the activations to int8 (dynamic per token)
+recipe = [
+    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
+]
+
+# 4) Apply algorithms and save in `compressed-tensors` format.
+oneshot(
+    model=model,
+    tokenizer=tokenizer,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+    output_dir=SAVE_DIR,
+)