[Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention #25674

elvischenv · 2025-09-25T13:40:49Z

Purpose

Support FP8-qkv Flashinfer TRTLLM sinks attention.
Note: require flashinfer v0.4.0rc4(updating in #26326)
flashinfer-ai/flashinfer#1758

Test Plan && Test Result

Kernel unit test:

tests/kernels/attention/test_flashinfer_trtllm_attention.py

===== 224 passed, 16 skipped in 30.94s ====

E2E accuracy:

kv_cache_dtype=fp8

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20250925_081936', 'metric': 0.7904040404040404}]
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250925_084243', 'metric': 0.925}]

kv_cache_dtype=auto

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20250925_090839', 'metric': 0.7910353535353535}]
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250925_093344', 'metric': 0.9125}]

E2E perf:

kv_cache_dtype=fp8, 6.9% perf gain

Output token throughput (tok/s):         22427.23

kv_cache_dtype=auto

Output token throughput (tok/s):         20977.07

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-10-07T01:56:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

nvpohanh · 2025-10-09T07:26:41Z

@elvischenv Please fix the conflict. thanks!

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

…on (vllm-project#25674) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…on (vllm-project#25674) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…on (vllm-project#25674) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

…on (vllm-project#25674) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…on (vllm-project#25674) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

mergify bot added the gpt-oss Related to GPT-OSS models label Sep 25, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Sep 25, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Sep 25, 2025

elvischenv force-pushed the elvischenv/support-fp8-sinks-trtllm-attn branch from 63d414b to ac1611c Compare September 26, 2025 07:19

mergify bot added the needs-rebase label Oct 7, 2025

elvischenv force-pushed the elvischenv/support-fp8-sinks-trtllm-attn branch from ac1611c to 8d6c871 Compare October 7, 2025 02:17

mergify bot removed the needs-rebase label Oct 7, 2025

elvischenv force-pushed the elvischenv/support-fp8-sinks-trtllm-attn branch from 8d6c871 to 9a53778 Compare October 7, 2025 09:06

mergify bot added ci/build v1 labels Oct 7, 2025

elvischenv marked this pull request as ready for review October 7, 2025 09:12

elvischenv requested review from WoosukKwon, mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 7, 2025 09:12

elvischenv force-pushed the elvischenv/support-fp8-sinks-trtllm-attn branch from 9a53778 to 8bfb46f Compare October 9, 2025 07:31

elvischenv added 2 commits October 9, 2025 00:32

support TRTLLM FP8 sinks attn kernel

b18c17f

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

add sinks attn unit tests

868f3de

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv force-pushed the elvischenv/support-fp8-sinks-trtllm-attn branch from 8bfb46f to 868f3de Compare October 9, 2025 07:32

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025

Merge branch 'main' into elvischenv/support-fp8-sinks-trtllm-attn

c6a4290

mgoin approved these changes Oct 9, 2025

View reviewed changes

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Oct 9, 2025

mgoin merged commit 44f633d into vllm-project:main Oct 9, 2025
49 checks passed

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Oct 9, 2025

elvischenv deleted the elvischenv/support-fp8-sinks-trtllm-attn branch October 10, 2025 01:33

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attenti…

18fd37b

…on (vllm-project#25674) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attenti…

9b7b8d0

…on (vllm-project#25674) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention #25674

[Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention #25674

Uh oh!

elvischenv commented Sep 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

nvpohanh commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention #25674

[Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention #25674

Uh oh!

Conversation

elvischenv commented Sep 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan && Test Result

Kernel unit test:

E2E accuracy:

E2E perf:

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

nvpohanh commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elvischenv commented Sep 25, 2025 •

edited by github-actions bot

Loading