Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Hf quantizer refactor #28703

Closed
wants to merge 78 commits into from
Closed
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
e0650b2
squashed earlier commits for easier rebase
poedator Dec 23, 2023
42adf9d
rm rebase leftovers
poedator Dec 23, 2023
7f57f26
4bit save enabled @quantizers
poedator Dec 23, 2023
f1f5da0
TMP gptq test use exllama
poedator Dec 24, 2023
a94d3a7
fix AwqConfigTest::test_wrong_backend for A100
poedator Dec 25, 2023
0b30de4
quantizers AWQ fixes
poedator Dec 25, 2023
4cdaf0d
_load_pretrained_model low_cpu_mem_usage branch
poedator Dec 25, 2023
0db1107
quantizers style
poedator Dec 25, 2023
89d1177
remove require_low_cpu_mem_usage attr
poedator Dec 25, 2023
0c71b00
rm dtype arg from process_model_before_weight_loading
poedator Dec 25, 2023
2b4122a
rm config_origin from Q-config
poedator Dec 25, 2023
02ad562
rm inspect from q_config
poedator Dec 25, 2023
3e51d51
fixed docstrings in QuantizationConfigParser
poedator Dec 25, 2023
2569367
logger.warning fix
poedator Dec 25, 2023
3259243
mv is_loaded_in_4(8)bit to BnbHFQuantizer
poedator Dec 25, 2023
ab61417
is_accelerate_available error msg fix in quantizer
poedator Dec 25, 2023
95e44cd
split is_model_trainable in bnb quantizer class
poedator Dec 25, 2023
b936cfb
rm llm_int8_skip_modules as separate var in Q
poedator Dec 25, 2023
0b40d21
Q rm todo
poedator Dec 25, 2023
c53a3fb
fwd ref to HFQuantizer in type hint
poedator Dec 25, 2023
dbd93f2
rm note re optimum.gptq.GPTQQuantizer
poedator Dec 26, 2023
e34bd58
quantization_config in __init__ simplified
poedator Dec 26, 2023
fcd5a7a
replaced NonImplemented with create_quantized_param
poedator Dec 26, 2023
954c5e6
rm load_in_4/8_bit deprecation warning
poedator Dec 26, 2023
49e163f
QuantizationConfigParser refactoring
poedator Dec 26, 2023
f8b9e07
awq-related minor changes
poedator Jan 8, 2024
5eaf9ac
awq-related changes
poedator Jan 8, 2024
d678d99
awq config.modules_to_not_convert
poedator Jan 8, 2024
7c9c49b
raise error if no q-method in q-config in args
poedator Jan 8, 2024
0d739d3
minor cleanup
poedator Jan 10, 2024
b5f2bab
awq quantizer docstring
poedator Jan 10, 2024
af33463
combine common parts in bnb process_model_before_weight_loading
poedator Jan 10, 2024
d4af5f1
revert test_gptq
poedator Jan 10, 2024
94f2cc7
.process_model_ cleanup
poedator Jan 10, 2024
ec77d10
restore dict config warning
poedator Jan 10, 2024
f5b9849
removed typevars in quantizers.py
poedator Jan 16, 2024
fb37bb8
cleanup post-rebase 16 jan
poedator Jan 16, 2024
cdc71c8
QuantizationConfigParser classmethod refactor
poedator Jan 16, 2024
e6df6ed
rework of handling of unexpected aux elements of bnb weights
poedator Jan 17, 2024
1c433f5
moved q-related stuff from save_pretrained to quantizers
poedator Jan 17, 2024
60781dd
refactor v1
younesbelkada Jan 24, 2024
842391a
more changes
younesbelkada Jan 24, 2024
0803440
fix some tests
younesbelkada Jan 24, 2024
594d1a9
remove it from main init
younesbelkada Jan 24, 2024
a771ab7
ooops
younesbelkada Jan 24, 2024
aa4ec34
Apply suggestions from code review
younesbelkada Jan 25, 2024
53619de
fix awq issues
younesbelkada Jan 25, 2024
cd4aa90
Merge remote-tracking branch 'upstream/main' into hf-quantizer-work
younesbelkada Jan 25, 2024
a988d01
fix
younesbelkada Jan 25, 2024
a911e7d
fix
younesbelkada Jan 25, 2024
3886559
fix
younesbelkada Jan 25, 2024
43e5e70
fix
younesbelkada Jan 25, 2024
c1dcaa3
fix
younesbelkada Jan 25, 2024
b0ac4a7
fix
younesbelkada Jan 25, 2024
1575c47
Merge branch 'main' into hf-quantizer-work
younesbelkada Jan 25, 2024
ad8d7f6
add docs
younesbelkada Jan 25, 2024
89cf6cf
Apply suggestions from code review
younesbelkada Jan 26, 2024
0ebaf4e
Apply suggestions from code review
younesbelkada Jan 26, 2024
adaae05
Update docs/source/en/hf_quantizer.md
younesbelkada Jan 26, 2024
f0b5f96
address comments
younesbelkada Jan 26, 2024
30e1fc2
fix
younesbelkada Jan 26, 2024
3b7e625
Merge branch 'hf-quantizer-work' of https://github.com/younesbelkada/…
younesbelkada Jan 26, 2024
493d117
fixup
younesbelkada Jan 26, 2024
48c5761
Update src/transformers/modeling_utils.py
younesbelkada Jan 26, 2024
3744fb1
Update src/transformers/modeling_utils.py
younesbelkada Jan 26, 2024
c4995ab
address final comment
younesbelkada Jan 26, 2024
17f95bf
Merge branch 'hf-quantizer-work' of https://github.com/younesbelkada/…
younesbelkada Jan 26, 2024
abb4db3
update
younesbelkada Jan 26, 2024
7e5a5b8
Update src/transformers/quantizers/base.py
younesbelkada Jan 26, 2024
122b494
Update src/transformers/quantizers/auto.py
younesbelkada Jan 26, 2024
901ace5
fix
younesbelkada Jan 26, 2024
2da5233
Merge remote-tracking branch 'upstream/main' into hf-quantizer-work
younesbelkada Jan 29, 2024
2ab7fd5
add kwargs update
younesbelkada Jan 30, 2024
242682c
Merge remote-tracking branch 'upstream/main' into HEAD
younesbelkada Jan 30, 2024
e387f68
Merge branch 'quant' into hf-quantizer-work
younesbelkada Jan 30, 2024
4c0c33e
fixup
younesbelkada Jan 30, 2024
c37b222
add `optimum_quantizer` attribute
younesbelkada Jan 30, 2024
ca40b04
oops
younesbelkada Jan 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,8 @@
title: Overview
- local: quantization
title: Quantization
- local: hf_quantizer
title: Build a new HFQuantizer class
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
- sections:
- local: perf_train_gpu_one
title: Methods and tools for efficient training on a single GPU
Expand Down
67 changes: 67 additions & 0 deletions docs/source/en/hf_quantizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Build a new `HFQuantizer` class to add quantization support for a new quantization method.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Build a new `HFQuantizer` class to add quantization support for a new quantization method.
# Build a new `HfQuantizer` class to add quantization support for a new quantization method.

do we need to make sure its HfQuantizer vs Quantizer?

younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

Through this document, you will learn how to work on a transformers integration of a new quantization method. Note that currently the `HFQuantizer` is not meant to be used for any PyTorch module, but you should rather see it as an internal utility class that is used in the core modeling code to easily quantize transformers models with different SoTA approaches (e.g. QLoRA, GPTQ, LLM.int8, AWQ, ...).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Through this document, you will learn how to work on a transformers integration of a new quantization method. Note that currently the `HFQuantizer` is not meant to be used for any PyTorch module, but you should rather see it as an internal utility class that is used in the core modeling code to easily quantize transformers models with different SoTA approaches (e.g. QLoRA, GPTQ, LLM.int8, AWQ, ...).
Through this document, you will learn how to work on integrating a new quantization method to the `transformers` library. Note that currently the `HFQuantizer` is not meant to be used for every single PyTorch module, but you should rather see it as an internal utility class that is used in the core modeling code to easily quantize transformers models with different SoTA approaches (e.g. QLoRA, GPTQ, LLM.int8, AWQ, ...).

younesbelkada marked this conversation as resolved.
Show resolved Hide resolved


## Pre-requisities
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

Before you start integrating a new quantization method into transformers, make sure that the method you are trying to add meet the following pre-requisities. Note we only support quantization methods that can be run with PyTorch modules for now.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

- The quantization method is available through a Python package that is pip-installable by anyone (it is also fine if you can only install the package from source), ideally with pre-compiled kernels included in the pip package.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
- The method can at least run on a commonly-used hardware (CPU, GPU, ..).
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
- The method is wrapped in a `nn.Module` (e.g. `Linear8bitLt`, `Linear4bit`). Ideally your quantized linear layer should have the following definition
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
```py
class Linear4bit(nn.Module):
def __init__(self, ...):
...

def forward(self, x):
return my_4bit_kernel(x, self.weight, self.bias)
```
That way, transformers models can be easily quantizable by simply replacing some instances of `nn.Linear` with a target class.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
- Ideally the quantization method should be serializable, i.e. you can save the quantized weights locally or push them on the Hub.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
- Make sure the package that contains the quantization kernels / primitive is mature enough (e.g. no frequent breaking changes).
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

Note that for some quantization methods it is a strong requirement to "pre-quantize" the models through data calibration (e.g. AWQ). In that case, we prefer to support only inference through transformers and let third-party libraries maintained by the ML community deal with the model quantization itself.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

## How should I get started?
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

ArthurZucker marked this conversation as resolved.
Show resolved Hide resolved
0- 📕 Create a new quantization config class inside `src/transformers/utils/quantization_config.py`, and make sure to expose that new quantization config inside transformers main init, by adding it on the `_import_structure` object of `src/transformers/__init__.py`.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

1- 🗃 Create a new file inside `src/transformers/quantizers/` named `quantizer_your_method.py` and make it inherit from `src/transformers/quantizers/base.py::HFQuantizer`. Make sure to add the new quantizer and quantization config in the quantization auto-mapping in `src/transformers/quantizers/auto.py`
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

2- 🔩 Define the following class attributes / property methods:
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

2.1. `requires_calibration`: Whether the quantization method requires a data-calibration process. If set to `True` you'll be able to only support inference (with quantized weights) and not inference + quantization.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
2.2. `required_packages`: A list of strings of the required packages to use the quantized weights. You might need to define some new utility methods such as `is_auto_awq_available` in `transformers/src/utils/import_utils.py`
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
2.3 `requires_parameters_quantization`: (Advanced - defaults to `False`) Only required if your quantization methods requires a special care of the underlying `nn.Parameter` object. For example bitsandbytes uses `Params4bit` and `Int8Param` that requires some special care when quantizing the model. Most of the recent quantization method packs int2 / int4 weights inside `torch.uint8` weights so that flag should not be really required
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
2.4 `is_serializable`: A property method to determine whether the method is serializable or not
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
2.5. `is_trainable`: A property method to determine whether you can fine-tune models on top of that quantization methods (with or without PEFT approaches).
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved


3- 🪛 Write the `validate_environment` and `set_torch_dtype` methods. These methods are called before creating the quantized model to make sure users are on the right configuration. You can have a look at how this is done on other quantizers.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

4- 🖋 Write the `_process_model_before_weight_loading` method. In transformers, the quantized models are first initialized on the `"meta"` device before loading the weights. Therefore `_process_model_before_weight_loading` can take care of manipulating the model skeleton to replace some modules (e.g. nn.Linear) with target modules (quantization modules). You can define a module replacement logic or any other utility method by creating a new file in `transformers/src/integrations/` and make sure to expose the relevant methods in the `__init__.py` file of that folder. Again the best starting point would be to have a look at what is done for other quantization methods such as `quantizer_awq.py`
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

5- 🖊 Write the `_process_model_after_weight_loading` method: in case you want to implement additional features that requires to manipulate the model post loading the weight, you can define that whole logic there!
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would structure this in 4 section. 1 per function that are the most important!

Copy link
Contributor

@poedator poedator Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to emphasize that the AWQ / GPTQ approach with is preferred to that of bnb. It is nice to see that after _process_model_before_weight_loading () all required quantization params fall into proper buffers, and there is no need for manipulations in _load_state_dict_into_meta_model(). And whatever magic happens in create_quantized_param() would happen in postprocessing phase.

Ideally, bnb quantizers could be rewritten with same interface as AWQ.

6- 📖 Document eveything! Make sure that your quantization method is documented in the `docs/source/en/quantization.md` file.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

7- 🟢 Add tests! You should add tests by first adding the package in our nightly Dockerfile inside `docker/transformers-all-latest-gpu` then adding a new test file in `tests/quantization/xxx`. Feel free to check out what is done on other quantization methods (e.g. awq)
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

4 changes: 4 additions & 0 deletions docs/source/en/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ Quantization techniques focus on representing data with less information while a

Transformers supports several quantization schemes to help you run inference with large language models (LLMs) and finetune adapters on quantized models. This guide will show you how to use Activation-aware Weight Quantization (AWQ), AutoGPTQ, and bitsandbytes.

## Adding new quatization methods

You are interested in adding a new quantization method in transformers? Read out the [HFQuantizer](./hf_quantizer) guide to learn more about it.
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved

## AWQ

<Tip>
Expand Down
1 change: 1 addition & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1002,6 +1002,7 @@
"pipeline",
],
"processing_utils": ["ProcessorMixin"],
"quantizers": [],
"testing_utils": [],
"tokenization_utils": ["PreTrainedTokenizer"],
"tokenization_utils_base": [
Expand Down
11 changes: 6 additions & 5 deletions src/transformers/integrations/awq.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,17 +187,18 @@ def fuse_awq_modules(model, quantization_config):
Args:
model (`~PreTrainedModel`):
The model to fuse - note this model should have been converted into AWQ format beforehand.
quantization_config (`dict`):
quantization_config (`Union[AwqConfig, dict]`):
The quantization configuration to use.
"""
# We need to convert it from dict in order to get an AwqConfig object
# otherwise the fields `backend` etc. will not be available
# https://github.com/huggingface/transformers/pull/27411#discussion_r1414044495
awq_config = AwqConfig.from_dict(quantization_config)
backend = awq_config.backend
if not isinstance(quantization_config, AwqConfig):
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
quantization_config = AwqConfig.from_dict(quantization_config)
backend = quantization_config.backend

modules_to_fuse = get_modules_to_fuse(model, awq_config)
modules_to_not_convert = getattr(awq_config, "modules_to_not_convert", None)
modules_to_fuse = get_modules_to_fuse(model, quantization_config)
modules_to_not_convert = getattr(quantization_config, "modules_to_not_convert", None)

if backend == AwqBackendPackingMethod.AUTOAWQ:
from awq.modules.fused.attn import QuantAttentionFused
Expand Down
Loading
Loading