Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate kv_scale into k_scale and v_scale #25

Merged
merged 2 commits into from
Jul 23, 2024
Merged

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Jul 3, 2024

Required for vllm-project/vllm#6081

Since we already quantize key_cache and value_cache separately in PagedAttention, there is "free accuracy on the table" for FP8 KV Cache quantization as we could use separate per-tensor scales for each.

The FlashInfer FP8 attention kernel also uses separate k_scale and v_scale values, so this PR is in preparation to enable that usage. Source: https://github.com/flashinfer-ai/flashinfer/blob/dc2c76f8577d8695112b61d1fd43ef88569272ef/python/flashinfer/decode.py#L98-L101

@mgoin mgoin changed the title Separate kv_scale into key_scale and value_scale Separate kv_scale into k_scale and v_scale Jul 16, 2024
@mgoin mgoin merged commit 2cd265f into main Jul 23, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant