Skip to content

Commit

Permalink
Fix typos in example for w8a8 quant (#38)
Browse files Browse the repository at this point in the history
eldarkurtic authored Jul 26, 2024

Verified

This commit was signed with the committer’s verified signature.
myii Imran Iqbal
1 parent d5be854 commit 780256c
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions examples/quantization_w8a8_int8/README.md
Original file line number Diff line number Diff line change
@@ -22,7 +22,7 @@ The example includes an end-to-end script for applying the quantization algorith
python3 llama3_example.py
```

The resulting model `Meta-Llama-3-8B-Instruct-W8A8-INT8` is ready to be loaded into vLLM.
The resulting model `Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token` is ready to be loaded into vLLM.

## Code Walkthough

@@ -49,7 +49,7 @@ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

### 2) Prepare Calibration Data

Prepare the calibration data. When quantizing activations of a model to `fp8`, we need some sample data to estimate the activation scales. As a result, it is very useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea.
Prepare the calibration data. When quantizing activations of a model to `int8`, we need some sample data to estimate the activation scales. As a result, it is very useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea.

In our case, we are quantizing an Instruction tuned generic model, so we will use the `ultrachat` dataset. Some best practices include:
* 512 samples is a good place to start (increase if accuracy drops)
@@ -114,15 +114,15 @@ model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```

We have successfully created an `fp8` model!
We have successfully created an `w8a8` model with weights and activations quantized to 8-bit integers!

### 4) Evaluate Accuracy

With the model created, we can now load and run in vLLM (after installing).

```python
from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-FP8")
model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
```

We can evaluate accuracy with `lm_eval` (`pip install lm_eval==v0.4.3`):

0 comments on commit 780256c

Please sign in to comment.