Skip to content

Conversation

@jmkuebler
Copy link
Contributor

@jmkuebler jmkuebler commented Sep 15, 2025

[WIP] see #24916 for context

Known limitations

Purpose

This PR adds a flag to skip the sliding window layers when quantizing the attention. The idea is that we can anyways not save much memory / latency in those layers, yet quantizing adds overheads. Furthermore, quantization always comes with additional quality risks. Not quantizing the SW layers leads to better performance as outlined in #24916

Turing off all SW layers gives us two groups of attention configs:

  • Full attention layers in FP8
  • SW attention layers in BF16

We currently use 2x the block size for Full attention in FP8 such that the page size equals between both groups. However, as pointed out by @heheda12345 , this currently makes it incompatible w/ prefix caching (requiring uniform block size).

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jonas Kuebler <kuebj@amazon.com>
@mergify mergify bot added gpt-oss Related to GPT-OSS models v1 labels Sep 15, 2025
@mergify
Copy link

mergify bot commented Sep 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jmkuebler.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the different block_size of swa and full attn compatible with prefix caching? We only have the BlockHash for one block_size. My plan is to combine the kv cache of two full attn layers, so that it has the same size with one sliding window layer. I'm prototyping.

@robertgshaw2-redhat robertgshaw2-redhat changed the title enable skipping of SW layers for attn quant [HMA] Enabling Skipping SWA for Fp8 Quant Sep 16, 2025
@jmkuebler
Copy link
Contributor Author

Does the different block_size of swa and full attn compatible with prefix caching? We only have the BlockHash for one block_size. My plan is to combine the kv cache of two full attn layers, so that it has the same size with one sliding window layer. I'm prototyping.

@heheda12345 Thanks a lot for looking into this. Great catch. I overlooked that as all the perf benchmarking I did was disabling prefix caching. But yeah, when enabling prefix caching it breaks because of non-uniform blocks.

I was focusing on ensuring that the page size is uniform. If we also need to ensure that block size is uniform I also cannot think of a way other than combining blocks...

@heheda12345
Copy link
Collaborator

Here is my PR for supporting different data type by adjusting block_size. Can you have a try? #24949

…_quant

# Conflicts:
#	vllm/v1/core/kv_cache_utils.py
@mergify mergify bot removed the needs-rebase label Sep 17, 2025
@jmkuebler
Copy link
Contributor Author

jmkuebler commented Sep 17, 2025

Here is my PR for supporting different data type by adjusting block_size. Can you have a try? #24949

@heheda12345
This is awesome. I merged your PR locally and this enables prefix caching. When sending the same request twice, it correctly tracks the prefix cache hit and outputs the same generation 🙌 .

Probably makes sense to update my PR only once yours is merged?

@heheda12345
Copy link
Collaborator

Probably makes sense to update my PR only once yours is merged?

Sounds good.

@mergify
Copy link

mergify bot commented Sep 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jmkuebler.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models needs-rebase v1

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

2 participants