This project focuses on model quantization techniques to reduce model size and potentially improve inference speed. Currently, it supports quantization in a packed format.
- Quantization: Implemented GPTQ and tested quantization to 4 bits for Llama 3.3 4B parameter model using a packed format.
- Inference: Inference for quantized models is yet to be run.
Run_quantization.ipynb: Jupyter notebook demonstrating the quantization process.quantize.py: Core script for performing model quantization.pack_quantized.py: Script for packing the quantized model components.helpers.py: Utility functions supporting the quantization process.load_datasets.py: Script for loading datasets used during quantization.requirements.txt: Lists the necessary Python packages for this project.hooks.py: Contains hooks used during the quantization process.test.py: Contains tests for the project.
- Clone the repository.
- Create a Python virtual environment (recommended):
python -m venv myenv source myenv/bin/activate # On Windows use `myenv\Scripts\activate`
- Install the required packages:
pip install -r requirements.txt
The primary way to run quantization is through the Run_quantization.ipynb notebook. This notebook provides a step-by-step guide and execution environment for quantizing models.
Alternatively, you might be able to run quantization using the scripts directly, though the notebook is the recommended starting point.
- Run inference for the quantized models.
- Test and validate performance on a wider range of models and datasets.