GPTQModel v0.9.10
What's Changed
Ported vllm/nm gptq_marlin inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with format = FORMAT.GPTQ
. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized()
called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference stage.
- [CORE] add marlin inference kernel by @ZX-ModelCloud in #310
- [CI] Increase timeout to 40m by @CSY-ModelCloud in #295, #299
- [FIX] save_quantized() by @ZX-ModelCloud in #296
- [FIX] autoround nsample/seqlen to be actual size of calibration_dataset. by @LRL-ModelCloud in #297, @LRL-ModelCloud in #298
- Update HF transformers to 4.43.3 by @Qubitium in #305
- [CI] remove test_marlin_hf_cache_serialization() by @ZX-ModelCloud in #314
Full Changelog: v0.9.9...v0.9.10