[HMA] Enabling Skipping SWA for Fp8 Quant #24912

jmkuebler · 2025-09-15T20:30:12Z

[WIP] see #24916 for context

Known limitations

~~Currently not compatible w/ prefix caching because we use two different block sizes for SW and Full Attention layers.~~
[Update 09/17] We will probably defer most of the KV cache modifications to [Hybrid Allocator] Support KV cache groups with different block_size #24949 which also enables prefix caching.

Purpose

This PR adds a flag to skip the sliding window layers when quantizing the attention. The idea is that we can anyways not save much memory / latency in those layers, yet quantizing adds overheads. Furthermore, quantization always comes with additional quality risks. Not quantizing the SW layers leads to better performance as outlined in #24916

Turing off all SW layers gives us two groups of attention configs:

Full attention layers in FP8
SW attention layers in BF16

We currently use 2x the block size for Full attention in FP8 such that the page size equals between both groups. However, as pointed out by @heheda12345 , this currently makes it incompatible w/ prefix caching (requiring uniform block size).

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jonas Kuebler <kuebj@amazon.com>

mergify · 2025-09-15T20:31:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jmkuebler.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345

Does the different block_size of swa and full attn compatible with prefix caching? We only have the BlockHash for one block_size. My plan is to combine the kv cache of two full attn layers, so that it has the same size with one sliding window layer. I'm prototyping.

jmkuebler · 2025-09-16T18:55:59Z

Does the different block_size of swa and full attn compatible with prefix caching? We only have the BlockHash for one block_size. My plan is to combine the kv cache of two full attn layers, so that it has the same size with one sliding window layer. I'm prototyping.

@heheda12345 Thanks a lot for looking into this. Great catch. I overlooked that as all the perf benchmarking I did was disabling prefix caching. But yeah, when enabling prefix caching it breaks because of non-uniform blocks.

I was focusing on ensuring that the page size is uniform. If we also need to ensure that block size is uniform I also cannot think of a way other than combining blocks...

heheda12345 · 2025-09-16T19:53:48Z

Here is my PR for supporting different data type by adjusting block_size. Can you have a try? #24949

…_quant # Conflicts: # vllm/v1/core/kv_cache_utils.py

jmkuebler · 2025-09-17T08:04:12Z

Here is my PR for supporting different data type by adjusting block_size. Can you have a try? #24949

@heheda12345
This is awesome. I merged your PR locally and this enables prefix caching. When sending the same request twice, it correctly tracks the prefix cache hit and outputs the same generation 🙌 .

Probably makes sense to update my PR only once yours is merged?

heheda12345 · 2025-09-17T16:53:30Z

Probably makes sense to update my PR only once yours is merged?

Sounds good.

mergify · 2025-09-21T01:07:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jmkuebler.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

enable skipping of SW layers for attn quant

e704df9

Signed-off-by: Jonas Kuebler <kuebj@amazon.com>

mergify bot added gpt-oss Related to GPT-OSS models v1 labels Sep 15, 2025

mergify bot added the needs-rebase label Sep 15, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Sep 15, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Sep 15, 2025

jmkuebler mentioned this pull request Sep 15, 2025

[Feature]: Make FP8 Attention fast for GPT-OSS w/ FA3 on Hopper #24916

Open

1 task

heheda12345 reviewed Sep 16, 2025

View reviewed changes

robertgshaw2-redhat changed the title ~~enable skipping of SW layers for attn quant~~ [HMA] Enabling Skipping SWA for Fp8 Quant Sep 16, 2025

Merge remote-tracking branch 'origin/main' into upstream_skip_sw_attn…

a1a6875

…_quant # Conflicts: # vllm/v1/core/kv_cache_utils.py

mergify bot removed the needs-rebase label Sep 17, 2025

mergify bot added the needs-rebase label Sep 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[HMA] Enabling Skipping SWA for Fp8 Quant #24912

[HMA] Enabling Skipping SWA for Fp8 Quant #24912

Uh oh!

jmkuebler commented Sep 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Sep 15, 2025

Uh oh!

heheda12345 left a comment

Uh oh!

jmkuebler commented Sep 16, 2025

Uh oh!

heheda12345 commented Sep 16, 2025

Uh oh!

jmkuebler commented Sep 17, 2025 •

edited

Loading

Uh oh!

heheda12345 commented Sep 17, 2025

Uh oh!

mergify bot commented Sep 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[HMA] Enabling Skipping SWA for Fp8 Quant #24912

Are you sure you want to change the base?

[HMA] Enabling Skipping SWA for Fp8 Quant #24912

Uh oh!

Conversation

jmkuebler commented Sep 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Known limitations

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Sep 15, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

jmkuebler commented Sep 16, 2025

Uh oh!

heheda12345 commented Sep 16, 2025

Uh oh!

jmkuebler commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heheda12345 commented Sep 17, 2025

Uh oh!

mergify bot commented Sep 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmkuebler commented Sep 15, 2025 •

edited by github-actions bot

Loading

jmkuebler commented Sep 17, 2025 •

edited

Loading