Support loading Quark quantized models in Transformers (#36372)

fxmarty-amd · SunMarc · BowenBao · web-flow · commit 1a374799cedc · 2025-03-20T15:40:51.000+01:00
* add quark quantizer

* add quark doc

* clean up doc

* fix tests

* make style

* more style fixes

* cleanup imports

* cleaning

* precise install

* Update docs/source/en/quantization/quark.md

Co-authored-by: Marc Sun &lt;57196510+SunMarc@users.noreply.github.com&gt;

* Update tests/quantization/quark_integration/test_quark.py

Co-authored-by: Marc Sun &lt;57196510+SunMarc@users.noreply.github.com&gt;

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Marc Sun &lt;57196510+SunMarc@users.noreply.github.com&gt;

* remove import guard as suggested

* update copyright headers

* add quark to transformers-quantization-latest-gpu Dockerfile

* make tests pass on transformers main + quark==0.7

* add missing F8_E4M3 and F8_E5M2 keys from str_to_torch_dtype

---------

Co-authored-by: Marc Sun &lt;57196510+SunMarc@users.noreply.github.com&gt;
Co-authored-by: Bowen Bao &lt;bowenbao@amd.com&gt;
Co-authored-by: Mohamed Mekkouri &lt;93391238+MekkCyber@users.noreply.github.com&gt;
diff --git a/docker/transformers-quantization-latest-gpu/Dockerfile b/docker/transformers-quantization-latest-gpu/Dockerfile
@@ -79,6 +79,9 @@ RUN git clone https://github.com/NetEase-FuXi/EETQ.git && cd EETQ/ && git submod
 # Add compressed-tensors for quantization testing
 RUN python3 -m pip install --no-cache-dir compressed-tensors
 
+# Add AMD Quark for quantization testing
+RUN python3 -m pip install --no-cache-dir amd-quark
+
 # Add transformers in editable mode
 RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch]
 
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -187,6 +187,8 @@
     title: Optimum
   - local: quantization/quanto
     title: Quanto
+  - local: quantization/quark
+    title: Quark
   - local: quantization/torchao
     title: torchao
   - local: quantization/spqr
diff --git a/docs/source/en/main_classes/quantization.md b/docs/source/en/main_classes/quantization.md
@@ -88,3 +88,7 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
 ## FineGrainedFP8Config
 
 [[autodoc]] FineGrainedFP8Config
+
+## QuarkConfig
+
+[[autodoc]] QuarkConfig
diff --git a/docs/source/en/quantization/overview.md b/docs/source/en/quantization/overview.md
@@ -40,6 +40,7 @@ Use the Space below to help you pick a quantization method depending on your har
 | [VPTQ](./vptq)                             | 🔴                   | 🔴              |     🟢     | 🟡        | 🔴                                 | 🔴              | 🟢              | 1/8         | 🔴               | 🟢                          | 🟢                      | https://github.com/microsoft/VPTQ            |
 | [FINEGRAINED_FP8](./finegrained_fp8)                 | 🟢                   | 🔴              | 🟢        | 🔴        | 🔴                                 | 🔴              | 🔴              | 8             | 🔴               | 🟢                          | 🟢                      |        |
 | [SpQR](./spqr)                          | 🔴                       |  🔴   | 🟢        | 🔴              |    🔴    | 🔴         |         🟢              | 3              |              🔴                     | 🟢           | 🟢                      | https://github.com/Vahe1994/SpQR/       |
+| [Quark](./quark.md)                           | 🔴                       | 🟢 | 🟢      | 🟢      | 🟢                   | 🟢       | ?               | 2/4/6/8/9/16 | 🔴                | 🔴                               | 🟢                       | https://quark.docs.amd.com/latest/                      |
 
 ## Resources
 
@@ -55,4 +56,4 @@ If you are looking for a user-friendly quantization experience, you can use the
 * [Bitsandbytes Space](https://huggingface.co/spaces/bnb-community/bnb-my-repo)
 * [GGUF Space](https://huggingface.co/spaces/ggml-org/gguf-my-repo)
 * [MLX Space](https://huggingface.co/spaces/mlx-community/mlx-my-repo)
-* [AuoQuant Notebook](https://colab.research.google.com/drive/1b6nqC7UZVt8bx4MksX7s656GXPM-eWw4?usp=sharing#scrollTo=ZC9Nsr9u5WhN)
+* [AuoQuant Notebook](https://colab.research.google.com/drive/1b6nqC7UZVt8bx4MksX7s656GXPM-eWw4?usp=sharing#scrollTo=ZC9Nsr9u5WhN)
diff --git a/docs/source/en/quantization/quark.md b/docs/source/en/quantization/quark.md
@@ -0,0 +1,84 @@
+<!--Copyright 2025 Advanced Micro Devices, Inc. and The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Quark
+
+[Quark](https://quark.docs.amd.com/latest/) is a deep learning quantization toolkit designed to be agnostic to specific data types, algorithms, and hardware. Different pre-processing strategies, algorithms and data-types can be combined in Quark.
+
+The PyTorch support integrated through 🤗 Transformers primarily targets AMD CPUs and GPUs, and is primarily meant to be used for evaluation purposes. For example, it is possible to use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) with 🤗 Transformers backend and evaluate a wide range of models quantized through Quark seamlessly.
+
+Users interested in Quark can refer to its [documentation](https://quark.docs.amd.com/latest/) to get started quantizing models and using them in supported open-source libraries!
+
+Although Quark has its own checkpoint / [configuration format](https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV-Quark-test/blob/main/config.json#L26), the library also supports producing models with a serialization layout compliant with other quantization/runtime implementations ([AutoAWQ](https://huggingface.co/docs/transformers/quantization/awq), [native fp8 in 🤗 Transformers](https://huggingface.co/docs/transformers/quantization/finegrained_fp8)).
+
+To be able to load Quark quantized models in Transformers, the library first needs to be installed:
+
+```bash
+pip install amd-quark
+```
+
+## Support matrix
+
+Models quantized through Quark support a large range of features, that can be combined together. All quantized models independently of their configuration can seamlessly be reloaded through `PretrainedModel.from_pretrained`.
+
+The table below shows a few features supported by Quark:
+
+| **Feature**                     | **Supported subset in Quark**                                                                             |   |
+|---------------------------------|-----------------------------------------------------------------------------------------------------------|---|
+| Data types                      | int8, int4, int2, bfloat16, float16, fp8_e5m2, fp8_e4m3, fp6_e3m2, fp6_e2m3, fp4, OCP MX, MX6, MX9, bfp16 |   |
+| Pre-quantization transformation | SmoothQuant, QuaRot, SpinQuant, AWQ                                                                       |   |
+| Quantization algorithm          | GPTQ                                                                                                      |   |
+| Supported operators             | ``nn.Linear``, ``nn.Conv2d``, ``nn.ConvTranspose2d``, ``nn.Embedding``, ``nn.EmbeddingBag``               |   |
+| Granularity                     | per-tensor, per-channel, per-block, per-layer, per-layer type                                             |   |
+| KV cache                        | fp8                                                                                                       |   |
+| Activation calibration          | MinMax / Percentile / MSE                                                                                 |   |
+| Quantization strategy           | weight-only, static, dynamic, with or without output quantization                                         |   |
+
+## Models on Hugging Face Hub
+
+Public models using Quark native serialization can be found at https://huggingface.co/models?other=quark.
+
+Although Quark also supports [models using `quant_method="fp8"`](https://huggingface.co/models?other=fp8) and [models using `quant_method="awq"`](https://huggingface.co/models?other=awq), Transformers loads these models rather through [AutoAWQ](https://huggingface.co/docs/transformers/quantization/awq) or uses the [native fp8 support in 🤗 Transformers](https://huggingface.co/docs/transformers/quantization/finegrained_fp8).
+
+## Using Quark models in Transformers
+
+Here is an example of how one can load a Quark model in Transformers:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_id = "EmbeddedLLM/Llama-3.1-8B-Instruct-w_fp8_per_channel_sym"
+model = AutoModelForCausalLM.from_pretrained(model_id)
+model = model.to("cuda")
+
+print(model.model.layers[0].self_attn.q_proj)
+# QParamsLinear(
+#   (weight_quantizer): ScaledRealQuantizer()
+#   (input_quantizer): ScaledRealQuantizer()
+#   (output_quantizer): ScaledRealQuantizer()
+# )
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+inp = tokenizer("Where is a good place to cycle around Tokyo?", return_tensors="pt")
+inp = inp.to("cuda")
+
+res = model.generate(**inp, min_new_tokens=50, max_new_tokens=100)
+
+print(tokenizer.batch_decode(res)[0])
+# <|begin_of_text|>Where is a good place to cycle around Tokyo? There are several places in Tokyo that are suitable for cycling, depending on your skill level and interests. Here are a few suggestions:
+# 1. Yoyogi Park: This park is a popular spot for cycling and has a wide, flat path that's perfect for beginners. You can also visit the Meiji Shrine, a famous Shinto shrine located in the park.
+# 2. Imperial Palace East Garden: This beautiful garden has a large, flat path that's perfect for cycling. You can also visit the
+```
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
@@ -1046,6 +1046,7 @@
         "HiggsConfig",
         "HqqConfig",
         "QuantoConfig",
+        "QuarkConfig",
         "SpQRConfig",
         "TorchAoConfig",
         "VptqConfig",
@@ -6287,6 +6288,7 @@
         HiggsConfig,
         HqqConfig,
         QuantoConfig,
+        QuarkConfig,
         SpQRConfig,
         TorchAoConfig,
         VptqConfig,
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
@@ -536,6 +536,10 @@ def load_sharded_checkpoint(model, folder, strict=True, prefer_safe=True):
     str_to_torch_dtype["U32"] = torch.uint32
     str_to_torch_dtype["U64"] = torch.uint64
 
+if is_torch_greater_or_equal("2.1.0"):
+    str_to_torch_dtype["F8_E4M3"] = torch.float8_e4m3fn
+    str_to_torch_dtype["F8_E5M2"] = torch.float8_e5m2
+
 
 def load_state_dict(
     checkpoint_file: Union[str, os.PathLike],
@@ -3675,6 +3679,10 @@ def to(self, *args, **kwargs):
 
         if getattr(self, "quantization_method", None) == QuantizationMethod.HQQ:
             raise ValueError("`.to` is not supported for HQQ-quantized models.")
+
+        if dtype_present_in_args and getattr(self, "quantization_method", None) == QuantizationMethod.QUARK:
+            raise ValueError("Casting a Quark quantized model to a new `dtype` is not supported.")
+
         # Checks if the model has been loaded in 4-bit or 8-bit with BNB
         if getattr(self, "quantization_method", None) == QuantizationMethod.BITS_AND_BYTES:
             if dtype_present_in_args:
diff --git a/src/transformers/quantizers/auto.py b/src/transformers/quantizers/auto.py
@@ -1,4 +1,5 @@
 # Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+# Modifications Copyright (C) 2025, Advanced Micro Devices, Inc. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -31,6 +32,7 @@
     QuantizationConfigMixin,
     QuantizationMethod,
     QuantoConfig,
+    QuarkConfig,
     SpQRConfig,
     TorchAoConfig,
     VptqConfig,
@@ -49,6 +51,7 @@
 from .quantizer_higgs import HiggsHfQuantizer
 from .quantizer_hqq import HqqHfQuantizer
 from .quantizer_quanto import QuantoHfQuantizer
+from .quantizer_quark import QuarkHfQuantizer
 from .quantizer_spqr import SpQRHfQuantizer
 from .quantizer_torchao import TorchAoHfQuantizer
 from .quantizer_vptq import VptqHfQuantizer
@@ -61,6 +64,7 @@
     "gptq": GptqHfQuantizer,
     "aqlm": AqlmHfQuantizer,
     "quanto": QuantoHfQuantizer,
+    "quark": QuarkHfQuantizer,
     "eetq": EetqHfQuantizer,
     "higgs": HiggsHfQuantizer,
     "hqq": HqqHfQuantizer,
@@ -81,6 +85,7 @@
     "gptq": GPTQConfig,
     "aqlm": AqlmConfig,
     "quanto": QuantoConfig,
+    "quark": QuarkConfig,
     "hqq": HqqConfig,
     "compressed-tensors": CompressedTensorsConfig,
     "fbgemm_fp8": FbgemmFp8Config,
diff --git a/src/transformers/quantizers/quantizer_quark.py b/src/transformers/quantizers/quantizer_quark.py
@@ -0,0 +1,113 @@
+# coding=utf-8
+# Copyright 2025 Advanced Micro Devices, Inc. and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING, Any, Dict
+
+from ..file_utils import is_torch_available
+from .base import HfQuantizer
+
+
+if TYPE_CHECKING:
+    from ..modeling_utils import PreTrainedModel
+
+    if is_torch_available():
+        import torch
+
+from ..utils import is_accelerate_available, is_quark_available, logging
+
+
+if is_accelerate_available():
+    from accelerate.utils import set_module_tensor_to_device
+
+logger = logging.get_logger(__name__)
+
+
+CHECKPOINT_KEYS = {
+    "weight_scale": "weight_quantizer.scale",
+    "bias_scale": "bias_quantizer.scale",
+    "input_scale": "input_quantizer.scale",
+    "output_scale": "output_quantizer.scale",
+    "weight_zero_point": "weight_quantizer.zero_point",
+    "bias_zero_point": "bias_quantizer.zero_point",
+    "input_zero_point": "input_quantizer.zero_point",
+    "output_zero_point": "output_quantizer.zero_point",
+}
+
+
+class QuarkHfQuantizer(HfQuantizer):
+    """
+    Quark quantizer (https://quark.docs.amd.com/latest/).
+    """
+
+    requires_calibration = True  # On-the-fly quantization with quark is not supported for now.
+    required_packages = ["quark"]
+
+    # Checkpoints are expected to be already quantized when loading a quark model. However, as some keys from
+    # the checkpoint might mismatch the model parameters keys, we use the `create_quantized_param` method
+    # to load the checkpoints, remapping the keys.
+    requires_parameters_quantization = True
+
+    def __init__(self, quantization_config, **kwargs):
+        super().__init__(quantization_config, **kwargs)
+
+        self.json_export_config = quantization_config.json_export_config
+
+    def validate_environment(self, *args, **kwargs):
+        if not is_quark_available():
+            raise ImportError(
+                "Loading a Quark quantized model requires the `quark` library but it was not found in the environment. Please refer to https://quark.docs.amd.com/latest/install.html."
+            )
+
+    def _process_model_before_weight_loading(self, model: "PreTrainedModel", **kwargs):
+        from quark.torch.export.api import _map_to_quark
+
+        _map_to_quark(
+            model,
+            self.quantization_config.quant_config,
+            pack_method=self.json_export_config.pack_method,
+            custom_mode=self.quantization_config.custom_mode,
+        )
+
+        return model
+
+    def check_quantized_param(
+        self,
+        model: "PreTrainedModel",
+        param_value: "torch.Tensor",
+        param_name: str,
+        state_dict: Dict[str, Any],
+        **kwargs,
+    ) -> bool:
+        return True
+
+    def create_quantized_param(
+        self, model, param, param_name, param_device, state_dict, unexpected_keys
+    ) -> "torch.nn.Parameter":
+        postfix = param_name.split(".")[-1]
+
+        if postfix in CHECKPOINT_KEYS:
+            param_name = param_name.replace(postfix, CHECKPOINT_KEYS[postfix])
+
+        set_module_tensor_to_device(model, param_name, param_device, value=param)
+
+    def _process_model_after_weight_loading(self, model: "PreTrainedModel", **kwargs):
+        return model
+
+    def is_serializable(self, safe_serialization=None):
+        return False
+
+    @property
+    def is_trainable(self):
+        return False
diff --git a/src/transformers/testing_utils.py b/src/transformers/testing_utils.py
@@ -116,6 +116,7 @@
     is_pytesseract_available,
     is_pytest_available,
     is_pytorch_quantization_available,
+    is_quark_available,
     is_rjieba_available,
     is_sacremoses_available,
     is_safetensors_available,
@@ -1299,6 +1300,13 @@ def require_fbgemm_gpu(test_case):
     return unittest.skipUnless(is_fbgemm_gpu_available(), "test requires fbgemm-gpu")(test_case)
 
 
+def require_quark(test_case):
+    """
+    Decorator for quark dependency
+    """
+    return unittest.skipUnless(is_quark_available(), "test requires quark")(test_case)
+
+
 def require_flute_hadamard(test_case):
     """
     Decorator marking a test that requires higgs and hadamard
diff --git a/src/transformers/utils/__init__.py b/src/transformers/utils/__init__.py
@@ -181,6 +181,7 @@
     is_pytesseract_available,
     is_pytest_available,
     is_pytorch_quantization_available,
+    is_quark_available,
     is_rich_available,
     is_rjieba_available,
     is_sacremoses_available,
diff --git a/src/transformers/utils/import_utils.py b/src/transformers/utils/import_utils.py
@@ -45,6 +45,11 @@ def _is_package_available(pkg_name: str, return_version: bool = False) -> Union[
     package_version = "N/A"
     if package_exists:
         try:
+            # TODO: Once python 3.9 support is dropped, `importlib.metadata.packages_distributions()`
+            # should be used here to map from package name to distribution names
+            # e.g. PIL -> Pillow, Pillow-SIMD; quark -> amd-quark; onnxruntime -> onnxruntime-gpu.
+            # `importlib.metadata.packages_distributions()` is not available in Python 3.9.
+
             # Primary method to get the package version
             package_version = importlib.metadata.version(pkg_name)
         except importlib.metadata.PackageNotFoundError:
@@ -62,6 +67,12 @@ def _is_package_available(pkg_name: str, return_version: bool = False) -> Union[
                 except ImportError:
                     # If the package can't be imported, it's not available
                     package_exists = False
+            elif pkg_name == "quark":
+                # TODO: remove once `importlib.metadata.packages_distributions()` is supported.
+                try:
+                    package_version = importlib.metadata.version("amd-quark")
+                except Exception:
+                    package_exists = False
             else:
                 # For packages other than "torch", don't attempt the fallback and set as not available
                 package_exists = False
@@ -150,6 +161,7 @@ def _is_package_available(pkg_name: str, return_version: bool = False) -> Union[
 _gptqmodel_available = _is_package_available("gptqmodel")
 # `importlib.metadata.version` doesn't work with `awq`
 _auto_awq_available = importlib.util.find_spec("awq") is not None
+_quark_available = _is_package_available("quark")
 _is_optimum_quanto_available = False
 try:
     importlib.metadata.version("optimum_quanto")
@@ -1118,6 +1130,10 @@ def is_optimum_quanto_available():
     return _is_optimum_quanto_available
 
 
+def is_quark_available():
+    return _quark_available
+
+
 def is_compressed_tensors_available():
     return _compressed_tensors_available
 
diff --git a/src/transformers/utils/quantization_config.py b/src/transformers/utils/quantization_config.py
diff --git a/tests/quantization/quark_integration/__init__.py b/tests/quantization/quark_integration/__init__.py
diff --git a/tests/quantization/quark_integration/test_quark.py b/tests/quantization/quark_integration/test_quark.py