This directory contains an example script for quantizing the Mixtral-8x7B-Instruct-v0.1
model using the static per-tensor FP8 quantization scheme.
To get started, install the necessary dependencies by executing the following commands:
git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
pip install -e .
The provided example script demonstrates an end-to-end process for applying the quantization algorithm:
python3 mixtral_moe_w8a8_fp8.py
This example leverages llm-compressor
and compressed-tensors
to create an FP8-quantized Mixtral-8x7B-Instruct-v0.1
model. The model is calibrated and trained using the open_platypus
dataset.
You can follow the detailed steps below or simply run the example script with:
python mixtral_moe_w8a8_fp8.py
In this step, you'll choose a baseline model for quantization, a dataset for calibration, and a quantization recipe.
- Models: Can be referenced from a local directory or retrieved from the Hugging Face Hub.
- Datasets: Can also be from a local directory or the Hugging Face Hub.
- Recipes: These are YAML files or Python modifier objects that describe how a model should be optimized during or after training. In this example, we use a
QuantizationModifier
object with the scheme set toFP8
.
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(scheme="FP8", targets="Linear", ignore=["lm_head", "re:.*block_sparse_moe.gate"])
NOTE: .*block_sparse_moe.gate
layers do not quantize well, hence they are ignored!
The oneshot
method applies the selected recipe to your model and dataset without requiring any fine-tuning. The model will be sparsified and saved to Mixtral-8x7B-Instruct-v0.1-FP8
.
from llmcompressor.transformers import oneshot
output_dir = "Mixtral-8x7B-Instruct-v0.1-FP8"
oneshot(
model=model,
dataset=dataset,
recipe=recipe,
save_compressed=True,
output_dir=output_dir,
overwrite_output_dir=True,
max_seq_length=2048,
num_calibration_samples=512,
)
NOTE: Only per-tensor quantization is supported in vLLM as of now (vllm==0.6.1
)
The repository supports multiple quantization techniques configured via a recipe. Supported strategies include tensor
, group
, and channel
quantization.
In the above example, FP8 per-tensor quantization is used as specified by the FP8
scheme. For other preset schemes, refer to the quantization schemes in the compressed-tensors
library.
A custom scheme can also be specified using config_groups
:
# Example of defining a custom quantization scheme
from llmcompressor.modifiers.quantization.gptq import GPTQModifier
config_groups = {
"group_0": {
"targets": ["Linear"],
"input_activations": None,
"output_activations": None,
"weights": {
"num_bits": 8,
"type": "int",
"symmetric": true,
"strategy": "group",
"group_size": 128,
}
}
}
recipe = GPTQModifier(config_groups=config_groups)