Create INT8 KV Cache on Qserve #2446

dleunji · 2024-11-14T12:47:44Z

Hi,

Thanks for your contributions and updates of Qserve.

I added an INT8KV feature as well.

Previously, the scale factor was calculated using the maximum value among the outputs of q_proj, k_proj, and v_proj. (code)

However, I found that it is not working in Qserve.

It only works well in Qserve when the scale factor is calculated based solely on the outputs of k_proj and v_proj.
This is different from the INT8 KV Cache in Qserve paper, which uses a dynamic cache. However, this int8 kv cache is a sufficient alternative for Qserve with high accuracy.

[Reference]
Qserve retrieves the scale of the kv cache separately for k and v, treating each with its own scale. (code)
However, TensorRT-LLM merges the k and v scales into a single kv_cache_scaling_factor derived from the outputs of qkv_proj. This setup made it difficult to use the kv cache scaling style of Qserve in TensorRT-LLM. However, I modified the approach to obtain the kv cache scale without considering q_proj, making it more similar to Qserve.
And I got much higher quality of outputs.

lkm2835 · 2024-11-14T13:52:39Z

This is related to #2444

bobboli · 2024-11-18T10:00:01Z

Hi,
Thank you for your contribution! The current checkpoint conversion is implemented in the legacy path, whereas we plan to migrate to the unified converter in the future. After that we can handle the combination of KV cache quantization with w4a8 in a more unified way.

Since you modified load_weights_from_lmquant heavily which will be deprecated, we will not proceed with this PR. But we will refer to your observation of not using q_proj for calibration.

Thank you!

Create INT8 KV Cache on Qserve

dbec956

fix unused module

6a6ec86

hello-11 added the triaged Issue has been triaged by maintainers label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create INT8 KV Cache on Qserve #2446

Create INT8 KV Cache on Qserve #2446

dleunji commented Nov 14, 2024 •

edited

Loading

lkm2835 commented Nov 14, 2024

bobboli commented Nov 18, 2024

Create INT8 KV Cache on Qserve #2446

Are you sure you want to change the base?

Create INT8 KV Cache on Qserve #2446

Conversation

dleunji commented Nov 14, 2024 • edited Loading

lkm2835 commented Nov 14, 2024

bobboli commented Nov 18, 2024

dleunji commented Nov 14, 2024 •

edited

Loading