FEAT : Adding BitNet quantization method to HFQuantizer #33410

MekkCyber · 2024-09-10T15:41:09Z

What does this PR do?

This pull request introduces a new quantization method: BitNet quantization at 1.58 bits. It enables users to load and utilize quantized & packed models with ternary weights directly in Transformers, providing out-of-the-box inference.

Who can review?

@SunMarc

HuggingFaceDocBuilderDev · 2024-09-10T16:16:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks for your work @MekkCyber ! This looks pretty good ! I've left a few comments. Could you also add some tests and update the documentation about this new quantizer + fix CI ?

src/transformers/integrations/bitnet.py

SunMarc · 2024-09-10T16:47:16Z

src/transformers/quantizers/quantizer_bitnet.py

+        if not torch.cuda.is_available():
+            raise RuntimeError("No GPU found. A GPU is needed for quantization.")


This should also run on cpu no since we are just using compile ?

Yes it runs on cpu, but it's slow

SunMarc · 2024-09-10T16:48:01Z

src/transformers/quantizers/quantizer_bitnet.py

+        if device_map is None:
+            logger.warning_once(
+                "You have loaded an BitNet model on CPU and have a CUDA device available, make sure to set "
+                "your model on a GPU device in order to run your model."
+            )


to update according to how we fix the above comment

src/transformers/utils/import_utils.py

SunMarc · 2024-09-10T16:52:09Z

Can you have a look @dacorvo on the BitLinear class ? I would love to have your insights !

src/transformers/quantizers/quantizer_bitnet.py

SunMarc

Nice job for adding this @MekkCyber ! Just a few nits ! Could you also update the quantization overview and create a page for bitnet ? I would link to the nanotron PR for users who wants to fine-tune their model and link a script to perform the conversion to the right format (quantization + packing). This way, users will be able to load their 1.58 models with this quantizer.

Needs to add docs

dacorvo · 2024-09-13T07:33:23Z

src/transformers/integrations/bitnet.py

+        packed_tensor_shape = (row_dim, *original_shape[1:])
+
+    packed = torch.zeros(packed_tensor_shape, device=quantized_weights.device, dtype=torch.uint8)
+    unpacked = quantized_weights.to(torch.uint8)


nit: for clarity I would have put this line immediately under line 46 as they are both related to the conversion from [-1, 0, 1]/int8 to [0, 1, 2]/uint8.

I did the opposite, I put the line where I add +1 immediately before this one, because packed_tensor_shape is not defined before

dacorvo · 2024-09-13T07:41:35Z

src/transformers/integrations/bitnet.py

+    packed = torch.zeros(packed_tensor_shape, device=quantized_weights.device, dtype=torch.uint8)
+    unpacked = quantized_weights.to(torch.uint8)
+
+    def lshift(t: torch.Tensor, bits: int):


This code reminds me of something ... ;-).
https://github.com/huggingface/optimum-quanto/blob/f62c887731cfc4800f930ba55c3da0262f10f84e/optimum/quanto/tensor/qbits/packed.py#L24
If you don't plan to support the MPS device, then you can inline the << operation in the loop.

Yes it's based on quanto 😅

dacorvo · 2024-09-13T07:46:19Z

src/transformers/integrations/bitnet.py

+        self.register_buffer(
+            "weight",
+            torch.zeros(
+                (out_features // 4, in_features),


Consider defining a constant for the number of values per item and use it in the whole file.

dacorvo · 2024-09-13T07:58:28Z

src/transformers/integrations/bitnet.py

+                device=device,
+            ),
+        )
+        self.register_buffer(


Quantizing per-tensor (i.e. using a single scale value for the whole tensor) at that level of quantization will greatly reduce the precision. Are the quantized weights supposed to be fine-tuned ?

Yes bitnet is a QAT method not PTQ, so the model is finetuned with fake quantization layers

dacorvo · 2024-09-13T07:59:33Z

src/transformers/integrations/bitnet.py

+        Qn = -(2 ** (num_bits - 1))
+        Qp = 2 ** (num_bits - 1) - 1


Note: this does not produce ternary outputs for bits=2 (but outputs in [-2, 1, 0, 1]).

Yes this is only for activations, as they are quantized in the range [-128, 127] for b = 8

dacorvo · 2024-09-13T08:11:27Z

src/transformers/integrations/bitnet.py

+    @torch.compile
+    def activation_quant(self, x, num_bits=8):
+        """
+        Activation function : Performs symmetric, per-channel quantization on the input activations.


per-channel is a bit misleading here, since by convention 'channel' usually denotes the last dimension, and one would expect the quantized output to have one scale per 'channel', i.e. slice along the last dimension where here you obtain one scale for each slice along the first dimension (typically the tokens in language models).

dacorvo · 2024-09-13T08:14:04Z

src/transformers/integrations/bitnet.py

+        out = input / si
+        out = out / sw


This could be done more efficiently as:

out = input / (si * sw)

dacorvo · 2024-09-13T08:21:20Z

src/transformers/integrations/bitnet.py

+        w_quant = unpack_weights(w, dtype=self.dtype)
+        x_quant, x_scale = self.activation_quant(x)
+        y = F.linear(x_quant.to(self.dtype), w_quant)
+        y = self.post_quant_process(y, self.weight_scale, x_scale)


This is prone to overflows in the matmul accumulator: when unpacking you basically recreate a tensor of ternary values expressed in full precision, i.e. not only does it occupy a large chunk of device memory, but is also expressed in a range ([-1, 1]) that is usually larger than the one of the original float16 weights.
You should therefore apply the weight scale immediately to come back into the expected computation range. Same thing for the activations.

In other words: apply the scales before the computation, and not afterwards.

Yes I agree, but now we are doing the multiplication in bfloat16 or float32, they have a sufficient range, not to overflow. The idea after that is to integrate a kernel for int8*int2 mutliplication that is why we don't apply the scales directly after the unpacking

dacorvo · 2024-09-13T08:27:02Z

src/transformers/quantizers/quantizer_bitnet.py

+
+    def validate_environment(self, *args, **kwargs):
+        if not is_accelerate_available():
+            raise ImportError("Loading an BitNet quantized model requires accelerate (`pip install accelerate`)")


Suggested change

raise ImportError("Loading an BitNet quantized model requires accelerate (`pip install accelerate`)")

raise ImportError("Loading a BitNet quantized model requires accelerate (`pip install accelerate`)")

dacorvo · 2024-09-13T08:27:26Z

src/transformers/quantizers/quantizer_bitnet.py

+
+        if kwargs.get("from_tf", False) or kwargs.get("from_flax", False):
+            raise ValueError(
+                "Converting into 8-bit weights from tf/flax weights is currently not supported, please make"


SunMarc

Thanks for adding this new quantizer ! Excited to see how this quantizer will evolve in the future ! Just a few nits.

SunMarc · 2024-09-13T16:46:23Z

docs/source/en/quantization/bitnet.md

+X_{dequantized} = X_q * scale_x
+$$
+
+To learn more about how we trained, and fine-tuned bitnet models checkout the blogpost [here](https://)


To update when the blogpost is released

https://huggingface.co/blog/1_58_llm_extreme_quantization

SunMarc · 2024-09-13T16:47:29Z

src/transformers/integrations/bitnet.py

+    """
+    Unpacks a tensor of quantized weights that were stored in a packed format using 2 bits per value.
+
+    Parameters:
+    -----------
+    packed : torch.Tensor
+        A tensor containing packed weights where each element represents 4 quantized values (using 2 bits per value).
+    dtype : torch.dtype
+        The dtype of the returned Tensor
+    Returns:
+    --------
+    torch.Tensor
+        A tensor of unpacked weights, where each value is converted from its packed 2-bit representation.
+
+    Example:
+    --------
+    packed = torch.tensor([[0b10100001, 0b00011000],
+                           [0b10010000, 0b00001010]], dtype=torch.uint8)
+
+    # Unpack the values
+    unpacked = unpack_weights(packed)
+
+    # Resulting unpacked tensor
+    print(unpacked)
+    # Output: tensor([[ 0, -1],
+                      [-1,  1],
+                      [-1,  1],
+                      [-1,  1],
+                      [ 1,  0],
+                      [ 0, -1],
+                      [ 1, -1],


Nice ! Thanks for adding !

SunMarc · 2024-09-13T16:54:18Z

src/transformers/quantizers/quantizer_bitnet.py

+            if isinstance(device_map, dict) and ("cpu" in device_map.values() or "disk" in device_map.values()):
+                logger.warning_once(
+                    "You are attempting to load a BitNet model with a device_map that contains a CPU or disk device."
+                    "This will degrade the inference speed because of weight unpacking"
+                )


Does it really work with cpu or disk offload ? Since the BitNetLinear is composed of buffers essentially, make sure to use with init_empty_weights(include_buffers=True) in _replace_with_bitnet_linear + device_map_kwargs["offload_buffers"] = True in modeling_utils.py. Check fbgemm-fp8 code. Otherwise, it won't offload the buffers correctly.
We can do that in a follow up PR if needed and we can just raise an error here for now.

tengomucho · 2024-09-19T07:50:56Z

docs/source/en/quantization/bitnet.md

+X_{dequantized} = X_q * scale_x
+$$
+
+To learn more about how we trained, and fine-tuned bitnet models checkout the blogpost [here](https://)


https://huggingface.co/blog/1_58_llm_extreme_quantization

tengomucho · 2024-09-19T12:59:12Z

src/transformers/integrations/bitnet.py

+        original_row_dim = packed_shape[0] * VALUES_PER_ITEM
+        unpacked_shape = (original_row_dim, *packed_shape[1:])
+
+    unpacked = torch.zeros(unpacked_shape, device=packed.device, dtype=torch.uint8)


wouldn't make more sense to have dtype=dtype here and then change line 136:

unpacked[start:end] = torch.tensor((packed & mask) >> (2 * i) - 1, dtype=dtype)

This way you could generalize this to other types than int8 (and possibly avoid a copy).

ArthurZucker

🚀 let's make sure we link the ressources for training, a gist? or a link to a repo!

docs/source/en/quantization/bitnet.md

src/transformers/integrations/bitnet.py

src/transformers/quantizers/quantizer_bitnet.py

SunMarc

Thanks for iterating ! Can you fix the merge conflits ? Also, in a previous PR, we changed the is_serializable method. It's not a property anymore, so you need to change this also here. Thanks !

ArthurZucker

IMO missing one small test and good to go!

ArthurZucker · 2024-10-04T11:14:13Z

tests/quantization/bitnet_integration/test_bitnet.py

+@slow
+@require_torch_gpu
+@require_accelerate
+class BitNetTest(unittest.TestCase):


let's test serialization as well! 🤗

tests/quantization/bitnet_integration/test_bitnet.py

SunMarc

Thanks for iterating ! Merging !

…33410) * rebasing changes * fixing style * adding some doc to functions * remove bitblas * change dtype * fixing check_code_quality * fixing import order * adding doc to tree * Small update on BitLinear * adding some tests * sorting imports * small update * reformatting * reformatting * reformatting with ruff * adding assert * changes after review * update disk offloading * adapting after review * Update after review * add is_serializable back * fixing style * adding serialization test * make style * small updates after review

rebasing changes

3848966

MekkCyber force-pushed the mohamed_bitnet branch from ee0d770 to 3848966 Compare September 10, 2024 15:57

fixing style

72bc4fd

SunMarc reviewed Sep 10, 2024

View reviewed changes

SunMarc requested a review from dacorvo September 10, 2024 16:50

MekkCyber added 14 commits September 10, 2024 22:08

adding some doc to functions

2e4c515

remove bitblas

0031cc2

change dtype

cfdeb4c

fixing check_code_quality

64a4c9d

fixing import order

218b2b6

adding doc to tree

93c993d

Small update on BitLinear

391b05b

adding some tests

2e3796d

sorting imports

9ad7239

small update

39e96c2

reformatting

3645f5e

reformatting

039f07e

reformatting with ruff

1c68849

adding assert

e52da65

SunMarc reviewed Sep 12, 2024

View reviewed changes

src/transformers/quantizers/quantizer_bitnet.py Outdated Show resolved Hide resolved

SunMarc reviewed Sep 12, 2024

View reviewed changes

src/transformers/quantizers/quantizer_bitnet.py Show resolved Hide resolved

SunMarc previously approved these changes Sep 12, 2024

View reviewed changes

dacorvo reviewed Sep 13, 2024

View reviewed changes

changes after review

de0c905

SunMarc approved these changes Sep 13, 2024

View reviewed changes

SunMarc requested a review from ArthurZucker September 13, 2024 16:56

update disk offloading

6eaf96b

tengomucho reviewed Sep 19, 2024

View reviewed changes

ArthurZucker reviewed Sep 19, 2024

View reviewed changes

MekkCyber and others added 2 commits September 26, 2024 16:41

adapting after review

8b493ce

Merge branch 'main' into mohamed_bitnet

aa63ab9

SunMarc reviewed Sep 30, 2024

View reviewed changes

MekkCyber and others added 2 commits October 1, 2024 09:12

Merge branch 'main' into mohamed_bitnet

9f80f70

Update after review

363ccaa

SunMarc requested a review from ArthurZucker October 1, 2024 14:47

MekkCyber added 2 commits October 1, 2024 15:16

add is_serializable back

946b38e

fixing style

4061935

ArthurZucker approved these changes Oct 4, 2024

View reviewed changes

MekkCyber added 2 commits October 9, 2024 09:45

adding serialization test

33675ab

make style

ad5789d

SunMarc reviewed Oct 9, 2024

View reviewed changes

tests/quantization/bitnet_integration/test_bitnet.py Outdated Show resolved Hide resolved

tests/quantization/bitnet_integration/test_bitnet.py Outdated Show resolved Hide resolved

small updates after review

d3bdde2

SunMarc approved these changes Oct 9, 2024

View reviewed changes

SunMarc merged commit 36d410d into huggingface:main Oct 9, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT : Adding BitNet quantization method to HFQuantizer #33410

FEAT : Adding BitNet quantization method to HFQuantizer #33410

MekkCyber commented Sep 10, 2024

HuggingFaceDocBuilderDev commented Sep 10, 2024

SunMarc left a comment •

edited

Loading

SunMarc Sep 10, 2024

MekkCyber Sep 11, 2024

SunMarc Sep 10, 2024

SunMarc commented Sep 10, 2024

SunMarc left a comment

dacorvo Sep 13, 2024

MekkCyber Sep 13, 2024

dacorvo Sep 13, 2024

MekkCyber Sep 13, 2024

dacorvo Sep 13, 2024

dacorvo Sep 13, 2024

MekkCyber Sep 13, 2024

dacorvo Sep 13, 2024

MekkCyber Sep 13, 2024

dacorvo Sep 13, 2024

dacorvo Sep 13, 2024

dacorvo Sep 13, 2024

MekkCyber Sep 13, 2024

dacorvo Sep 13, 2024

dacorvo Sep 13, 2024

SunMarc left a comment

SunMarc Sep 13, 2024

tengomucho Sep 19, 2024

SunMarc Sep 13, 2024

SunMarc Sep 13, 2024

tengomucho Sep 19, 2024

tengomucho Sep 19, 2024 •

edited

Loading

ArthurZucker left a comment

SunMarc left a comment

ArthurZucker left a comment

ArthurZucker Oct 4, 2024

SunMarc left a comment

		if not torch.cuda.is_available():
		raise RuntimeError("No GPU found. A GPU is needed for quantization.")

	raise ImportError("Loading an BitNet quantized model requires accelerate (`pip install accelerate`)")
	raise ImportError("Loading a BitNet quantized model requires accelerate (`pip install accelerate`)")

FEAT : Adding BitNet quantization method to HFQuantizer #33410

FEAT : Adding BitNet quantization method to HFQuantizer #33410

Conversation

MekkCyber commented Sep 10, 2024

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Sep 10, 2024

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc commented Sep 10, 2024

SunMarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tengomucho Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc left a comment •

edited

Loading

tengomucho Sep 19, 2024 •

edited

Loading