Quantized KV cache: update quanto #31052

zucchini-nlp · 2024-05-27T07:56:45Z

What does this PR do?

The latest version of quanto was apparently refactored affecting quant-cache. This PR updates QuantoQuantizedCache to work with the latest version of quanto

younesbelkada

Thanks !
My only question being if these changes are Backward compatible, would this work with the previous quanto version? If not we could raise an error when initializing the QuantoCache educating users to install quanto>=0.2.0 , what do you think?

HuggingFaceDocBuilderDev · 2024-05-27T08:17:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2024-05-27T08:30:51Z

@younesbelkada right, it's not compatible with older versions. I can add a warning or raise an error asking to update quanto to the latest version, if that's usual workflow in these situations? :)

younesbelkada · 2024-05-27T08:35:45Z

Since the quanto cache is not part of a release yet, I think we should just force users to use quanto >= 0.2.0 - for that I would just check the quanto version using something like this: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L1576C53-L1576C79 and raise an error at the cache init if the quanto minimum version is not met
With respect to the imports we'll need to make sure to import AffineQuantizer and MaxOptimizer only if quanto version is greater than 0.2.0

SunMarc

LGTM ! I think it will be interesting to rerun the benchmark with quanto for int4 kv-cache quantization. @dacorvo said that the latency of models quantized with qint4 weights is drastically reduced (almost divided by two).

younesbelkada

Clean work, thanks !

dacorvo · 2024-05-27T12:20:05Z

LGTM ! I think it will be interesting to rerun the benchmark with quanto for int4 kv-cache quantization. @dacorvo said that the latency of models quantized with qint4 weights is drastically reduced (almost divided by two).

The new kernels only work when doing a matmul between fp16 inputs and int4 weights. In this case the KV are just dequantized, and I haven't bound the kernel for that (the CUDA dequantize method is included in quanto though so maybe you can give it a try).

dacorvo · 2024-05-27T12:24:38Z

LGTM ! I think it will be interesting to rerun the benchmark with quanto for int4 kv-cache quantization. @dacorvo said that the latency of models quantized with qint4 weights is drastically reduced (almost divided by two).

The new kernels only work when doing a matmul between fp16 inputs and int4 weights. In this case the KV are just dequantized, and I haven't bound the kernel for that (the CUDA dequantize method is included in quanto though so maybe you can give it a try).

More specifically: https://github.com/huggingface/quanto/blob/f545e01443767b8920609bfcde5417fe064eedfd/quanto/library/ext/cuda/awq/dequantize.cuh#L14
This is the fast int4 -> fp16 dequantizer for packed int4 "a la AWQ". This should be bound in pybind and called here: https://github.com/huggingface/quanto/blob/f545e01443767b8920609bfcde5417fe064eedfd/quanto/tensor/qbits/awq/qbits.py#L32

zucchini-nlp · 2024-05-27T13:25:27Z

@dacorvo i see, thanks! I'll check it out!

amyeroberts

Thanks for updating and adding the version guard!

src/transformers/cache_utils.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

quanto latest version was refactored

23aaf95

zucchini-nlp requested review from younesbelkada and SunMarc May 27, 2024 07:56

younesbelkada reviewed May 27, 2024

View reviewed changes

add error msg

f829acf

SunMarc approved these changes May 27, 2024

View reviewed changes

younesbelkada approved these changes May 27, 2024

View reviewed changes

incorrect compare sign

f851298

zucchini-nlp requested a review from amyeroberts May 27, 2024 13:35

amyeroberts approved these changes May 28, 2024

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

zucchini-nlp and others added 3 commits May 28, 2024 14:40

Merge branch 'huggingface:main' into quant

92aa648

Update src/transformers/cache_utils.py

3d6492d

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Merge branch 'huggingface:main' into quant

d58b991

zucchini-nlp merged commit d521ba5 into huggingface:main May 29, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized KV cache: update quanto #31052

Quantized KV cache: update quanto #31052

zucchini-nlp commented May 27, 2024

younesbelkada left a comment

HuggingFaceDocBuilderDev commented May 27, 2024

zucchini-nlp commented May 27, 2024

younesbelkada commented May 27, 2024

SunMarc left a comment

younesbelkada left a comment

dacorvo commented May 27, 2024

dacorvo commented May 27, 2024

zucchini-nlp commented May 27, 2024

amyeroberts left a comment

Quantized KV cache: update quanto #31052

Quantized KV cache: update quanto #31052

Conversation

zucchini-nlp commented May 27, 2024

What does this PR do?

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 27, 2024

zucchini-nlp commented May 27, 2024

younesbelkada commented May 27, 2024

SunMarc left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

dacorvo commented May 27, 2024

dacorvo commented May 27, 2024

zucchini-nlp commented May 27, 2024

amyeroberts left a comment

Choose a reason for hiding this comment