Releases: ModelCloud/GPTQModel
GPTQModel v1.6.0
What's Changed
⚡ 25% faster quantization. 35% reduction in vram usage vs v1.5. 👀
🎉 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU.
💫 Auto-tokenizer loader via load() api. For most models you no longer need to manually init a tokenizer for both inference and quantization.
- note about
batch_size
to speed up quant by @Qubitium in #992 - Add ROCm support by @CSY-ModelCloud in #993
- Add bits test by @ZX-ModelCloud in #995
- note about rocm support by @Qubitium in #998
- [FIX] wrong variable name by @ZX-ModelCloud in #997
- update rocm version tag by @CSY-ModelCloud in #999
- Auto-tokenizer will be called within
load()
by @LRL-ModelCloud in #996 - update transformers by @Qubitium in #1001
- [FIX] torch qlinear forward by @ZX-ModelCloud in #1002
- cleanup marlin info by @Qubitium in #1004
- Use custom forward hook by @LRL-ModelCloud in #1003
- fix hooked linear init by @LRL-ModelCloud in #1011
- add HookedConv1D by @LRL-ModelCloud in #1012
- record fwd time by @LRL-ModelCloud in #1013
- add PYTORCH_CUDA_ALLOC_CONF for global & do ruff by @CSY-ModelCloud in #1015
- [FIX] quantize_config could not read from config.json by @ZX-ModelCloud in #1022
- Fix quant time by @LRL-ModelCloud in #1025
- fix forward hook by @LRL-ModelCloud in #1027
- Fix hooked conv2d by @LRL-ModelCloud in #1030
- clean cache by @CL-ModelCloud in #1032
Full Changelog: v1.5.1...v1.6.0
GPTQModel v1.5.1
What's Changed
🎉 2025!
⚡ Added QuantizeConfig.device
to clearly define which device is used for quantization: default = auto
. Non-quantized models are always loaded on cpu by-default and each layer is moved to QuantizeConfig.device
during quantization to minimize vram usage.
💫 Improve QuantLinear
selection from optimum
.
🐛 Fix attn_implementation_autoset
compat in latest transformers.
- Add QuantizeConfig.device and use. by @Qubitium in #950
- fix hf_select_quant_linear by @LRL-ModelCloud in #966
- update vllm gptq_marlin code by @ZX-ModelCloud in #967
- fix cuda:0 not a enum device by @CSY-ModelCloud in #968
- fix marlin info for non-cuda device by @Qubitium in #972
- fix backend str bug by @CL-ModelCloud in #973
- hf select quant_linear with pack by @LRL-ModelCloud in #969
- remove auto select BACKEND.IPEX by @CSY-ModelCloud in #975
- fix autoround received a device_map by @CSY-ModelCloud in #976
- use enum instead of magic number by @CSY-ModelCloud in #979
- use new ci docker images by @CSY-ModelCloud in #980
- fix flash attntion was auto loaded on cpu for pretrained model by @CSY-ModelCloud in #981
- fix old transformer doesn't have _attn_implementation_autoset by @CSY-ModelCloud in #982
- fix gptbigcode test temporally by @CSY-ModelCloud in #983
- fix version parsing by @CSY-ModelCloud in #985
Full Changelog: v1.5.0...v1.5.1
GPTQModel v1.5.0
What's Changed
⚡ Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
🐛 Fixed Qwen 2-VL model quantization vram usage and post-quant file copy of relevant config files.
🐛 Fixed install/compilations in envs with wrong TORCH_CUDA_ARCH_LIST set (Nvidia docker images)
🐛 Warn about bad torch[cuda] install on Windows
- Fix backend not ipex by @CSY-ModelCloud in #930
- Fix broken ipex check by @Qubitium in #933
- Fix dynamic_cuda validation by @CSY-ModelCloud in #936
- Fix bdist_wheel does not exist on old setuptools by @CSY-ModelCloud in #939
- Add cuda warning on windows by @CSY-ModelCloud in #942
- Add torch inference benchmark by @CL-ModelCloud in #940
- Add
modality
toBaseModel
by @ZX-ModelCloud in #937 - [FIX] qwen_vl_utils should be locally import by @ZX-ModelCloud in #946
- Filter torch cuda arch < 6.0 by @CSY-ModelCloud in #955
- [FIX] wrong filepath was used when model_id_or_path was hugging model id by @ZX-ModelCloud in #956
- Fix import error was not caught by @CSY-ModelCloud in #961
Full Changelog: v1.4.5...v1.5.0
GPTQModel v1.4.5
What's Changed
⚡ Windows 11 support added/validated with DynamicCuda
and Torch
kernels.
⚡ Ovis 1.6 VL model support with image data calibration.
⚡ Reduced quantization vram usage.
🐛 Fixed dynamic
controlled layer loading logic
- Refractor by @Qubitium in #895
- Add platform check by @LRL-ModelCloud in #899
- Exclude marlin & exllama on windows by @CSY-ModelCloud in #898
- Remove unnecessary backslash in the expression & typehint by @CSY-ModelCloud in #903
- Add DEVICE.ALL by @LRL-ModelCloud in #901
- [FIX] the error of loading quantized model with dynamic by @ZX-ModelCloud in #907
- [FIX] gpt2 quantize error by @ZX-ModelCloud in #912
- Simplify checking generated str for vllm test & fix transformers version for cohere2 by @CSY-ModelCloud in #914
- [MODEL] add OVIS support by @ZX-ModelCloud in #685
- Fix IDE warning marlin not in all by @CSY-ModelCloud in #920
Full Changelog: v1.4.4...v1.4.5
GPTQModel v1.4.4 Patch
What's Changed
⚡ Reduced memory usage during quantization
⚡ Fix device_map={"":"auto"}
compat
- Speed up unit tests by @Qubitium in #885
- [FIX] hf select quant linear parse device map by @ZX-ModelCloud in #887
- Avoid cloning on gpu by @Qubitium in #886
- Expose hf_quantize() by @ZX-ModelCloud in #888
- Update integration hf code by @ZX-ModelCloud in #891
- Add back fasterquant() for compat by @Qubitium in #892
Full Changelog: v1.4.2...v1.4.4
GPTQModel v1.4.2
What's Changed
⚡ MacOS gpu
(MPS) + cpu
inference and quantization support
⚡ Added Cohere 2 model support
- Build Changes by @Qubitium in #855
- Fix MacOS support by @Qubitium in #861
- check device_map on from_quantized() by @ZX-ModelCloud in #865
- call patch for TestTransformersIntegration by @CSY-ModelCloud in #867
- Add MacOS gpu acceleration via MPS by @Qubitium in #864
- [MODEL] add cohere2 support by @CL-ModelCloud in #869
- check device_map by @ZX-ModelCloud in #872
- set PYTORCH_ENABLE_MPS_FALLBACK for macos by @CSY-ModelCloud in #873
- check device_map int value by @ZX-ModelCloud in #876
- Simplify by @Qubitium in #877
- [FIX] device_map={"":None} by @ZX-ModelCloud in #878
- set torch_dtype to float16 for XPU by @CSY-ModelCloud in #875
- remove IPEX device check by @ZX-ModelCloud in #879
- [FIX] call normalize_device() by @ZX-ModelCloud in #881
- [FIX] get_best_device() wrong usage by @ZX-ModelCloud in #882
Full Changelog: v1.4.1...v1.4.2
GPTQModel v1.4.1
What's Changed
⚡ Added Qwen2-VL model support.
⚡ mse
quantization control exposed in QuantizeConfig
⚡ New GPTQModel.patch_hf()
and GPTQModel.patch_vllm()
monkey patch api to allow Transformers/Optimum/Peft to use GPTQModel while upstream PRs are pending.
⚡ New GPTQModel.patch_vllm()
monkey patch api to allow vLLM
to correctly load dynamic
/mixed gptq quantized models.
- Add warning for vllm/sglang when using dynamic feature by @CSY-ModelCloud in #810
- Update Eval() usage sample by @CL-ModelCloud in #819
- auto select best device by @CSY-ModelCloud in #822
- Fix error msg by @CSY-ModelCloud in #823
- allow pass meta_quantizer from save() by @CSY-ModelCloud in #824
- Quantconfig add mse field by @CL-ModelCloud in #825
- [MODEL] add qwen2_vl support by @LRL-ModelCloud in #826
- check cuda when there's only cuda device by @CSY-ModelCloud in #830
- Update lm-eval test by @CL-ModelCloud in #831
- add patch_vllm() by @ZX-ModelCloud in #829
- Monkey patch HF transformer/optimum/peft support by @CSY-ModelCloud in #818
- auto patch vllm by @CSY-ModelCloud in #837
- Fix lm-eval API BUG by @CL-ModelCloud in #838
- [FIX] dynamic get "desc_act" error by @ZX-ModelCloud in #841
- BaseModel add supports_desc_act by @ZX-ModelCloud in #842
- [FIX] should local import patch_vllm() by @ZX-ModelCloud in #844
- Mod vllm generate by @LRL-ModelCloud in #833
- fix patch_vllm by @LRL-ModelCloud in #850
Full Changelog: v1.4.0...v1.4.1
GPTQModel v1.4.0
What's Changed
⚡ EvalPlus
harness integration merged upstream. We now support both lm-eval
and EvalPlus
.
⚡ Added pure torch Torch
kernel.
⚡ Refactored Cuda
kernel to be DynamicCuda
kernel.
⚡ Triton
kernel now auto-padded for max model support.
⚡ Dynamic
quantization now supports both positive +::default, and -: negative matching which allows matched modules to be skipped entirely for quantization.
⚡Added auto-kernel fallback for unsupported kernel/module pairs.
🐛 Fixed auto-Marlin
kernel selection.
🗑 Deprecated the saving of Marlin
weight format. Marlin
allows auto conversion of gptq
format to Marlin
during runtime. gptq
format allows max kernel flexibility including Marlin
kernel support.
Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge.
- Remove Marlin old kernel and Marlin format saving. Marlin[new] is still supported via inference. by @CSY-ModelCloud in #714
- Remove marlin(old) kernel codes & do ruff by @CSY-ModelCloud in #719
- [FIX] gptq v2 load by @ZX-ModelCloud in #724
- Add hf_convert_gptq_v1_to_v2_format, hf_convert_gptq_v2_to_v1_format,… by @LRL-ModelCloud in #727
- if use the ipex quant linear, no need to convert by @LRL-ModelCloud in #730
- hf_select_quant_linear add device_map by @LRL-ModelCloud in #732
- Add TorchQuantLinear by @ZX-ModelCloud in #735
- Add QUANT_TYPE in qlinear by @jiqing-feng in #736
- Replace error with warning for Intel CPU check by @CSY-ModelCloud in #737
- Add BACKEND.AUTO_CPU by @LRL-ModelCloud in #739
- Fix ipex linear check by @jiqing-feng in #741
- fFx select quant linear by @jiqing-feng in #742
- Now meta.quantizer value can be an array by @ZX-ModelCloud in #744
- Receive checkpoint_format argument by @ZX-ModelCloud in #747
- Modify hf convert gptq v2 to v1 format by @ZX-ModelCloud in #749
- update score max negative delta by @CSY-ModelCloud in #748
- [CI] max parallel jobs 10 by @CSY-ModelCloud in #751
- hymba got high score by @CSY-ModelCloud in #752
- hf_select_quant_linear() always set pack=True by @ZX-ModelCloud in #754
- Refractor CudaQuantLinear to DynamicCudaQuantLinear by @ZX-ModelCloud in #759
- Remove filename prefix on qlinear dir by @ZX-ModelCloud in #760
- Replace Nvidia-smi with devicesmi by @CSY-ModelCloud in #761
- Fix XPU training by @jiqing-feng in #763
- Fix auto marlin kernel selection by @CSY-ModelCloud in #765
- Add BaseQuantLinear SUPPORTS_TRAINING declaration by @LRL-ModelCloud in #766
- Add Eval() api to support LM-Eval or EvalPlus benchmark harnesses by @CL-ModelCloud in #750
- Fix validate_device by @LRL-ModelCloud in #769
- Force BaseQuantLinear properties to be explicitly declared by all QuantLinears by @ZX-ModelCloud in #767
- Convert str backend to enum backend by @LRL-ModelCloud in #772
- Remove nested list in dict by @CSY-ModelCloud in #774
- Fix training qlinear by @LRL-ModelCloud in #777
- Check kernel by @CSY-ModelCloud in #764
- BACKEND.AUTO if backend is None by @LRL-ModelCloud in #781
- Fix lm_head quantize test by @CSY-ModelCloud in #784
- Fix exllama doesn't support 8 bit by @CSY-ModelCloud in #790
- Use set() to avoid calling torch twice by @CSY-ModelCloud in #791
- Fix ipex cpu backend import error and fix too much logs by @jiqing-feng in #793
- Eval API opt by @CL-ModelCloud in #794
- Fixed ipex linear param check and logging once by @jiqing-feng in #795
- Check device before sync by @LRL-ModelCloud in #796
- Only AUTO will try other quant linears by @CSY-ModelCloud in #797
- Add SUPPORTS_AUTO_PADDING property to QuantLinear by @LRL-ModelCloud in #799
- Dynamic now support skipping modules/layers by @CSY-ModelCloud in #804
- Fix module was skipped but still be looped by @CSY-ModelCloud in #806
- Make Triton kernel auto-pad on features/group_size by @LRL-ModelCloud in #808
Full Changelog: v1.3.1...v1.4.0
GPTQModel v1.3.1
What's Changed
⚡ Olmo2 model support.
⚡ Intel XPU acceleration via IPEX.
Sharding compat fix due to api deprecation in HF Transformers.
Removed triton dependency. Triton kernel now optionally dependent on triton pkg.
Fixed Hymba Test (Hymba requires desc_act=False)
- [FIX] use split_torch_state_dict_into_shards to replace shard_checkpoint by @LRL-ModelCloud in #682
- [Model] add olmo2 support by @LRL-ModelCloud in #678
- [FIX] Hymba currently only supports a batch size of 1 by @ZX-ModelCloud in #683
- [CI] fix extensions is not defined by @CSY-ModelCloud in #684
- Ipex XPU support by @jiqing-feng in #608
- [FIX] add require_pkgs_version and checks by @ZX-ModelCloud in #693
- fix ipex test by @Qubitium in #691
- [FIX] remove require_transformers_version and require_tokenizers_version by @ZX-ModelCloud in #695
- Remove use_safetensors argument by @ZX-ModelCloud in #696
- Revert exllamav1 by @CSY-ModelCloud in #692
- Make Triton optional by @CSY-ModelCloud in #697
- Unify backend use by @LRL-ModelCloud in #700
- [FIX] fix test_hymba by @ZX-ModelCloud in #704
- FIX IPEX XPU selection by @Qubitium in #705
- fix cpu/xpu backend selection by @jiqing-feng in #706
- Upgrade device-smi depend by @Qubitium in #708
- [FIX] hymba quant needs desc_act=False by @ZX-ModelCloud in #710
Full Changelog: v1.3.0...v1.3.1
GPTQModel v1.3.0
What's Changed
Zero-Day Hymba model support added. Removed tqdm
and rogue
depends.
- Move lm-eval to utils to make it optional, fixed #664 by @CSY-ModelCloud in #666
- Add ipex bench code by @LRL-ModelCloud in #660
- [MODEL] add hymba support by @LRL-ModelCloud in #651
- [FIX] HymbaConfig.conv_dim keys is converted from str to int by @ZX-ModelCloud in #674
- [FIX] progress first index starts from 1 instead of 0 by @ZX-ModelCloud in #673
Full Changelog: v1.2.3...v1.3.0