GPTQModel

Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.

Latest News

05/19/2025 4.0.0-dev main: Qwen 2.5 Omni model support.
05/05/2025 4.0.0-dev main: Python 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage.
04/29/2025 3.1.0-dev (Now 4.) main: Xiaomi Mimo model support. Qwen 3 and 3 MoE model support. New arg for quantize(..., calibration_dataset_min_length=10) to filter out bad calibration data that exists in public dataset (wikitext).
04/13/2025 3.0.0: 🎉 New ground-breaking GPTQ v2 quantization option for improved model quantization accuracy validated by GSM8K_PLATINUM benchmarks vs original gptq. New Phi4-MultiModal model support . New Nvidia Nemotron-Ultra model support. New Dream model support. New experimental multi-gpu quantization support. Reduced vram usage. Faster quantization.
04/2/2025 2.2.0: New Qwen 2.5 VL model support. New samples log column during quantization to track module activation in MoE models. Loss log column now color-coded to highlight modules that are friendly/resistant to quantization. Progress (per-step) stats during quantization now streamed to log file. Auto bfloat16 dtype loading for models based on model config. Fix kernel compile for Pytorch/ROCm. Slightly faster quantization and auto-resolve some low-level oom issues for smaller vram gpus.
03/12/2025 2.1.0: ✨ New QQQ quantization method and inference support! New Google Gemma 3 zero-day model support. New Alibaba Ovis 2 VL model support. New AMD Instella zero-day model model support. New GSM8K Platinum and MMLU-Pro benchmarking suppport. Peft Lora training with GPTQModel is now 30%+ faster on all gpu and IPEX devices. Auto detect MoE modules not activated during quantization due to insufficient calibration data. ROCm setup.py compat fixes. Optimum and Peft compat fixes. Fixed Peft bfloat16 training.
03/03/2025 2.0.0: 🎉 GPTQ quantization internals are now broken into multiple stages (processes) for feature expansion. Synced Marlin kernel inference quality fix from upstream. Added MARLIN_FP16, lower-quality but faster backend. ModelScope support added. Logging and cli progress bar output has been revamped with sticky bottom progress. Fixed generation_config.json save and load. Fixed Transformers v4.49.0 compat. Fixed compat of models without bos. Fixed group_size=-1 and bits=3 packing regression. Fixed Qwen 2.5 MoE regressions. Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate loggin/progressbar to LogBar pkg. Fix ROCm version auto detection in setup install.

Archived News

What is GPTQModel?

GPTQModel is a production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF Transformers, vLLM, and SGLang.

Public and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.

GPTQModel not only supports GPTQ but also QQQ, GPTQv2, Eora with more quantization methods and enhancements planned.

Quantization Support

GPTQModel is a modular design supporting multiple quantization methods and feature extensions.

Quantization Feature	GPTQModel	Transformers	vLLM	SGLang	Lora Training
GPTQ	✅	✅	✅	✅	✅
EoRA	✅	✅	✅	✅	x
GPTQ v2	✅	✅	✅	✅	✅
QQQ	✅	x	x	x	x
Rotation	✅	x	x	x	x

Multi-Modal

Native support support some of the most popular multi-modal models:

Multi-Modal
Qwen 2.5 Omni	✅
Qwen2 VL	✅
Ovis 1.6 + 2	✅
Phi-4 MultiModal	✅

GPTQ v2 quantization unlocks useful utral-low bit quantization

Features

✨ Native integration with HF Transformers, Optimum, and Peft (main)
🚀 vLLM and SGLang inference integration for quantized model with format = FORMAT.GPTQ
🚀 Extensive model support for: Ovis VL, Llama 1-3.3, Qwen2-VL, Olmo2, Hymba, GLM, IBM Granite, Llama 3.2 Vision, MiniCPM3, GRIN-Moe, Phi 1-4, EXAONE 3.0, InternLM 2.5, Gemma 2, DeepSeek-V2, DeepSeek-V2-Lite, ChatGLM, MiniCPM, Qwen2MoE, DBRX.
✨ Linux, MacOS, Windows platform quantization and accelerated inference support for CUDA (Nvidia), XPU (Intel), ROCm (AMD), MPS (Apple Silicon), CPU (Intel/AMD/Apple Silicon).
💯 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
✨ Dynamic mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together.
🚀 Intel/IPEX hardware accelerated quantization/inference for CPU [avx, amx, xmx] and Intel GPU [Arc + Datacenter Max].
🚀 Microsoft/BITBLAS format + dynamically compiled inference.
✨ Intel/AutoRound alternative gptq-inference compatible quantization method.
✨ Asymmetric Sym=False support. Model weights sharding support with optional hash check of model weights on load.
✨ lm_head module quant inference support for further VRAM reduction.
🚀 45% faster packing stage in quantization (Llama 3.1 8B). 50% faster PPL calculations (OPT).

Quality: GPTQ 4bit (5.0 bpw) can match BF16:

🤗 ModelCloud quantized Vortex models on HF

Model Support

Model
Baichuan	✅	Falcon	✅	InternLM 1/2.5	✅	OPT	✅	TeleChat2	✅
Bloom	✅	Gemma 1/2/3	✅	Llama 1-3.3	✅	OLMo2	✅	Yi	✅
ChatGLM	✅	GPTBigCod	✅	Llama 3.2 VL	✅	Ovis 1.6/2	✅	XVERSE	✅
CodeGen	✅	GPTNeoX	✅	LongLLaMA	✅	Phi 1-4	✅
Cohere 1-2	✅	GPT-2	✅	MiniCPM3	✅	Qwen 1/2/3	✅
DBRX Converted	✅	GPT-J	✅	Mistral	✅	Qwen 2/3 MoE	✅
Deci	✅	Granite	✅	Mixtral	✅	Qwen 2/2.5 VL	✅
DeepSeek-V2/V3/R1	✅	GRIN-MoE	✅	MobileLLM	✅	Qwen 2.5 Omni	✅
DeepSeek-V2-Lite	✅	Hymba	✅	MOSS	✅	RefinedWeb	✅
Dream	✅	Instella	✅	MPT	✅	StableLM	✅
EXAONE 3.0	✅			Nemotron Ultra	✅	StarCoder2	✅

Platform and HW Support

GPTQModel is validated for Linux, MacOS, and Windows 11:

Platform	Device		Optimized Arch	Kernels
🐧 Linux	Nvidia GPU	✅	`Ampere+`	Marlin, Exllama V2, Exallma V1, Triton, Torch
🐧 Linux	Intel XPU	✅	`Arc`, `Datacenter Max`	IPEX, Torch
🐧 Linux	AMD GPU	✅	`7900XT+`, `ROCm 6.2+`	Exllama V2, Exallma V1, Torch
🐧 Linux	Intel/AMD CPU	✅	`avx`, `amx`, `xmx`	IPEX, Torch
🍎 MacOS	GPU (Metal) / CPU	✅	`Apple Silicon`, `M1+`	Torch, MLX via conversion
🪟 Windows	GPU (Nvidia) / CPU	✅	`Nvidia`	Torch

Install

PIP/UV

# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]
pip install -v gptqmodel --no-build-isolation 
uv pip install -v gptqmodel --no-build-isolation

Install from source

# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel

# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas,ipex,auto_round]
pip install -v . --no-build-isolation

Inference

Three line api to use GPTQModel for gptq model inference:

from gptqmodel import GPTQModel

model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

To use models from ModelScope instead of HuggingFace Hub, set an environment variable:

export GPTQMODEL_USE_MODELSCOPE=True

from gptqmodel import GPTQModel
# load Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4 from modelscope
model = GPTQModel.load("Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

OpenAI API compatible end-point

# load model using above inference guide first
model.serve(host="0.0.0.0",port="12345")

Quantization

Basic example of using GPTQModel to quantize a llm model:

from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128)

model = GPTQModel.load(model_id, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)

model.save(quant_path)

Quantization using GPTQ V2

Enable GPTQ v2 quantization by setting v2 = True for potentially higher post-quantization accuracy recovery.

# note v2 is currently experiemental and requires 2-4x more vram to execute
# if oom on 1 gpu, please set CUDA_VISIBLE_DEVICES=0,1 to 2 gpu and gptqmodel will auto use second gpu
quant_config = QuantizeConfig(bits=4, group_size=128, v2=True)

Llama 3.1 8B-Instruct quantized using test/models/test_llama3_2.py

Method	Bits/Group Size	ARC_CHALLENGE	GSM8K_Platinum_COT
GPTQ	4 / 128	49.15	48.30
GPTQ v2	4 / 128	49.74 👍 +1.20%	61.46 🔥 +27.25%
GPTQ	3 / 128	39.93	43.26
GPTQ v2	3 / 128	41.13 👍 +3.01%	50.54 🔥 +16.83%

Quantization Inference

# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

Quantization + EoRA Accuracy Recovery

GPTQModel now support EoRA, a LoRA method that can further imporve the accuracy of the quantized model

# higher rank improves accuracy at the cost of vram usage
# suggestion: test rank 64 and 32 before 128 or 256 as latter may overfit while increasing memory usage
eora = Lora(
  # for eora generation, path is adapter save path; for load, it is loading path
  path=f"{quant_path}/eora_rank32", 
  rank=32,
)

# provide a previously gptq quantized model path
GPTQModel.adapter.generate(
  adapter=eora,
  model_id_or_path=model_id,
  quantized_model_id_or_path=quant_path,
  calibration_dataset=calibration_dataset,
  calibration_dataset_concat_size=0,
  auto_gc=False)

# post-eora inference
model = GPTQModel.load(
  model_id_or_path=quant_path,
  adapter=eora
)

tokens = model.generate("Capital of France is")[0]
result = model.tokenizer.decode(tokens)

print(f"Result: {result}")
# For more detail of EoRA please see GPTQModel/examples/eora
# Please use the benchmark tools in later part of this README to evaluate EoRA effectiveness

For more advanced features of model quantization, please reference to this script

How to Add Support for a New Model

Read the gptqmodel/models/llama.py code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.

Evaluation and Quality Benchmarks

GPTQModel inference is integrated into both lm-eval and evalplus
We highly recommend avoid using ppl and use lm-eval/evalplus to validate post-quantization model quality. ppl should only be used for regression tests and is not a good indicator of model output quality.

# gptqmodel is integrated into lm-eval >= v0.4.7
pip install lm-eval>=0.4.7

# gptqmodel is integrated into evalplus[main]
pip install -U "evalplus @ git+https://github.com/evalplus/evalplus"

Below is a basic sample using GPTQModel.eval API

from gptqmodel import GPTQModel
from gptqmodel.utils.eval import EVAL

model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"

# Use `lm-eval` as framework to evaluate the model
lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.LM_EVAL.ARC_CHALLENGE], output_file='lm-eval_result.json')

# Use `evalplus` as framework to evaluate the model
evalplus_results = GPTQModel.eval(model_id, framework=EVAL.EVALPLUS, tasks=[EVAL.EVALPLUS.HUMAN], output_file='evalplus_result.json')

Dynamic Quantization (Per Module QuantizeConfig Override)

QuantizeConfig.dynamic is dynamic control which allows specific matching modules to be skipped for quantization (negative matching) or have a unique [bits, group_size, sym, desc_act, mse, pack_dtype] property override per matching module vs base QuantizeConfig (postive match with override).

Sample QuantizerConfig.dynamic usage:

dynamic = { 
    # `.*\.` matches the layers_node prefix 
    # layer index start at 0 
    
    # positive match: layer 19, gate module 
    r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},  
    
    # positgive match: layer 20, gate module (prefix defaults to positive if missing)
    r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  
    
    # negative match: skip layer 21, gate module
    r"-:.*\.20\..*gate.*": {}, 
    
    # negative match: skip all down modules for all layers
    r"-:.*down.*": {},  
 }

Experimental Features

GPTQ v2: set v2=True in quantization config.
Multi-GPU Quantization: set CUDA_VISIBLE_DEVICES=0,1 to two devices and GPTQModel will use second gpu for quantization.
Pass auto_gc = False to quantize() api to speed up quantization if gpu has plenty of vram and does not need to call slow gc.
Pass buffered_fwd = True to quantize() api to potentially speed up quantization if gpu has plenty of vram and can hold all fwd inputs in vram.

Attribution of Quantization Methods:

GPTQ (v1): IST-DASLab, main-author: Elias Frantar, arXiv:2210.17323
GPTQ (v2): Yale Intelligent Computing Lab, main-author: Yuhang Li, arXiv:2504.02692
QQQ: Meituan, main-author Ying Zhang, arXiv:2406.09904

Citation

# GPTQModel
@misc{qubitium2024gptqmodel,
  author = {ModelCloud.ai and qubitium@modelcloud.ai},
  title = {GPTQModel},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
  note = {Contact: qubitium@modelcloud.ai},
  year = {2024},
}

# GPTQ
@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  journal={arXiv preprint arXiv:2210.17323},
  year={2022}
  
}

# GPTQ v2
@article{li2025gptqv2,
  title={GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration}, 
  author={Yuhang Li and Ruokai Yin and Donghyun Lee and Shiting Xiao and Priyadarshini Panda},
  journal={arXiv preprint arXiv:2504.02692},
  year={2025}
}

# EoRA
@article{liu2024eora,
  title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
  author={Liu, Shih-Yang and Yang, Huck and Wang, Chien-Yi and Fung, Nai Chit and Yin, Hongxu and Sakr, Charbel and Muralidharan, Saurav and Cheng, Kwang-Ting and Kautz, Jan and Wang, Yu-Chiang Frank and others},
  journal={arXiv preprint arXiv:2410.21271},
  year={2024}
}

# GPTQ Marlin Kernel
@article{frantar2024marlin,
  title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
  author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2408.11743},
  year={2024}
}

# QQQ 
@article{zhang2024qqq,
      title={QQQ: Quality Quattuor-Bit Quantization for Large Language Models}, 
      author={Ying Zhang and Peng Zhang and Mincong Huang and Jingyang Xiang and Yujie Wang and Chao Wang and Yineng Zhang and Lei Yu and Chuan Liu and Wei Lin},
      journal={arXiv preprint arXiv:2406.09904},
      year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2,188 Commits
.github		.github
chat		chat
examples		examples
format		format
gptqmodel		gptqmodel
gptqmodel_ext		gptqmodel_ext
licenses		licenses
tests		tests
.gitignore		.gitignore
CREDITS.md		CREDITS.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
upload_model.py		upload_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPTQModel

Latest News

What is GPTQModel?

Quantization Support

Multi-Modal

GPTQ v2 quantization unlocks useful utral-low bit quantization

Features

Quality: GPTQ 4bit (5.0 bpw) can match BF16:

Model Support

Platform and HW Support

Install

PIP/UV

Install from source

Inference

OpenAI API compatible end-point

Quantization

Quantization using GPTQ V2

Quantization Inference

Quantization + EoRA Accuracy Recovery

How to Add Support for a New Model

Evaluation and Quality Benchmarks

Dynamic Quantization (Per Module QuantizeConfig Override)

Experimental Features

Attribution of Quantization Methods:

Citation

About

Uh oh!

Releases 47

Packages

Uh oh!

Contributors 74

Uh oh!

Languages

License

ModelCloud/GPTQModel

Folders and files

Latest commit

History

Repository files navigation

GPTQModel

Latest News

What is GPTQModel?

Quantization Support

Multi-Modal

GPTQ v2 quantization unlocks useful utral-low bit quantization

Features

Quality: GPTQ 4bit (5.0 bpw) can match BF16:

Model Support

Platform and HW Support

Install

PIP/UV

Install from source

Inference

OpenAI API compatible end-point

Quantization

Quantization using GPTQ V2

Quantization Inference

Quantization + EoRA Accuracy Recovery

How to Add Support for a New Model

Evaluation and Quality Benchmarks

Dynamic Quantization (Per Module QuantizeConfig Override)

Experimental Features

Attribution of Quantization Methods:

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 47

Packages 0

Uh oh!

Contributors 74

Uh oh!

Languages

Packages