[0.9.1][2/N][Feat] Restore paged attention kernel in Full Graph for performence #1677

yiz-liu · 2025-07-08T12:15:48Z

What this PR does / why we need it?

Rectified the performance regression wherein the FIA kernel underperformed the PA kernel by enabling dynamic updates of PA parameters during graph replay.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

…erformence (vllm-project#1677) Rectified the performance regression wherein the FIA kernel underperformed the PA kernel by enabling dynamic updates of PA parameters during graph replay. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of #1503 and #1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

…m-project#2128) Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>

…m-project#2128) Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu changed the title ~~[WIP][Feat] Restore paged attention kernel for performence~~ [WIP][Feat] Restore paged attention kernel in Full Graph for performence Jul 8, 2025

yiz-liu force-pushed the pa branch from 6595435 to 0c77677 Compare July 10, 2025 01:51

github-actions bot added the module:tests label Jul 10, 2025

yiz-liu mentioned this pull request Jul 7, 2025

[RFC]: Support Full Graph with multiple attention kernels #1649

Closed

wangxiyuan changed the title ~~[WIP][Feat] Restore paged attention kernel in Full Graph for performence~~ [0.9.1][WIP][Feat] Restore paged attention kernel in Full Graph for performence Jul 10, 2025

yiz-liu force-pushed the pa branch 3 times, most recently from 054b2db to c40c808 Compare July 11, 2025 04:12

[Feat] Restore paged attention kernel for performence

dc067c7

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu force-pushed the pa branch from c40c808 to dc067c7 Compare July 11, 2025 06:21

yiz-liu changed the title ~~[0.9.1][WIP][Feat] Restore paged attention kernel in Full Graph for performence~~ [0.9.1][2/N][Feat] Restore paged attention kernel in Full Graph for performence Jul 11, 2025

ganyi1996ppo approved these changes Jul 11, 2025

View reviewed changes

ganyi1996ppo merged commit df18f1d into vllm-project:v0.9.1-dev Jul 11, 2025
16 checks passed

Yikun added the no-main label Jul 12, 2025

yiz-liu deleted the pa branch July 14, 2025 03:19

Yikun added the no-test label Jul 16, 2025

yiz-liu mentioned this pull request Jul 31, 2025

[Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models #2128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[0.9.1][2/N][Feat] Restore paged attention kernel in Full Graph for performence #1677

[0.9.1][2/N][Feat] Restore paged attention kernel in Full Graph for performence #1677

Uh oh!

yiz-liu commented Jul 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[0.9.1][2/N][Feat] Restore paged attention kernel in Full Graph for performence #1677

[0.9.1][2/N][Feat] Restore paged attention kernel in Full Graph for performence #1677

Uh oh!

Conversation

yiz-liu commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiz-liu commented Jul 8, 2025 •

edited

Loading