GPTQModel v1.4.0
What's Changed
⚡ EvalPlus
harness integration merged upstream. We now support both lm-eval
and EvalPlus
.
⚡ Added pure torch Torch
kernel.
⚡ Refactored Cuda
kernel to be DynamicCuda
kernel.
⚡ Triton
kernel now auto-padded for max model support.
⚡ Dynamic
quantization now supports both positive +::default, and -: negative matching which allows matched modules to be skipped entirely for quantization.
⚡Added auto-kernel fallback for unsupported kernel/module pairs.
🐛 Fixed auto-Marlin
kernel selection.
🗑 Deprecated the saving of Marlin
weight format. Marlin
allows auto conversion of gptq
format to Marlin
during runtime. gptq
format allows max kernel flexibility including Marlin
kernel support.
Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge.
- Remove Marlin old kernel and Marlin format saving. Marlin[new] is still supported via inference. by @CSY-ModelCloud in #714
- Remove marlin(old) kernel codes & do ruff by @CSY-ModelCloud in #719
- [FIX] gptq v2 load by @ZX-ModelCloud in #724
- Add hf_convert_gptq_v1_to_v2_format, hf_convert_gptq_v2_to_v1_format,… by @LRL-ModelCloud in #727
- if use the ipex quant linear, no need to convert by @LRL-ModelCloud in #730
- hf_select_quant_linear add device_map by @LRL-ModelCloud in #732
- Add TorchQuantLinear by @ZX-ModelCloud in #735
- Add QUANT_TYPE in qlinear by @jiqing-feng in #736
- Replace error with warning for Intel CPU check by @CSY-ModelCloud in #737
- Add BACKEND.AUTO_CPU by @LRL-ModelCloud in #739
- Fix ipex linear check by @jiqing-feng in #741
- fFx select quant linear by @jiqing-feng in #742
- Now meta.quantizer value can be an array by @ZX-ModelCloud in #744
- Receive checkpoint_format argument by @ZX-ModelCloud in #747
- Modify hf convert gptq v2 to v1 format by @ZX-ModelCloud in #749
- update score max negative delta by @CSY-ModelCloud in #748
- [CI] max parallel jobs 10 by @CSY-ModelCloud in #751
- hymba got high score by @CSY-ModelCloud in #752
- hf_select_quant_linear() always set pack=True by @ZX-ModelCloud in #754
- Refractor CudaQuantLinear to DynamicCudaQuantLinear by @ZX-ModelCloud in #759
- Remove filename prefix on qlinear dir by @ZX-ModelCloud in #760
- Replace Nvidia-smi with devicesmi by @CSY-ModelCloud in #761
- Fix XPU training by @jiqing-feng in #763
- Fix auto marlin kernel selection by @CSY-ModelCloud in #765
- Add BaseQuantLinear SUPPORTS_TRAINING declaration by @LRL-ModelCloud in #766
- Add Eval() api to support LM-Eval or EvalPlus benchmark harnesses by @CL-ModelCloud in #750
- Fix validate_device by @LRL-ModelCloud in #769
- Force BaseQuantLinear properties to be explicitly declared by all QuantLinears by @ZX-ModelCloud in #767
- Convert str backend to enum backend by @LRL-ModelCloud in #772
- Remove nested list in dict by @CSY-ModelCloud in #774
- Fix training qlinear by @LRL-ModelCloud in #777
- Check kernel by @CSY-ModelCloud in #764
- BACKEND.AUTO if backend is None by @LRL-ModelCloud in #781
- Fix lm_head quantize test by @CSY-ModelCloud in #784
- Fix exllama doesn't support 8 bit by @CSY-ModelCloud in #790
- Use set() to avoid calling torch twice by @CSY-ModelCloud in #791
- Fix ipex cpu backend import error and fix too much logs by @jiqing-feng in #793
- Eval API opt by @CL-ModelCloud in #794
- Fixed ipex linear param check and logging once by @jiqing-feng in #795
- Check device before sync by @LRL-ModelCloud in #796
- Only AUTO will try other quant linears by @CSY-ModelCloud in #797
- Add SUPPORTS_AUTO_PADDING property to QuantLinear by @LRL-ModelCloud in #799
- Dynamic now support skipping modules/layers by @CSY-ModelCloud in #804
- Fix module was skipped but still be looped by @CSY-ModelCloud in #806
- Make Triton kernel auto-pad on features/group_size by @LRL-ModelCloud in #808
Full Changelog: v1.3.1...v1.4.0