-
Couldn't load subscription status.
- Fork 523
Description
Motivation
Compared to piecewise graph capture, the “full graph” approach offers three primary advantages:
- Reduced dispatch latency: By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays.
- Stabilized multi-device performance: Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices.
- Stream/resource savings: Consolidating graph captures frees up streams, allowing more graphs to be captured.
Proposed Change
Our implementation will differ slightly from vLLM’s native mechanism for two main reasons:
-
Divergent Attention Operator Paths
In vLLM‑Ascend, the attention operator path is not uniform and is selected at inference time based on theAttentionState. To accommodate this, we will hoist the control flow logic outside the inference graph. We will pre‑compile distinct graphs for each possible attention state, then dispatch the appropriate graph at each step. -
Dynamic Attention Parameter Updates
Parameters such asseq_lensandblock_tablemust be updated at every decoding step to ensure correct tiling and maintain numerical accuracy. We will leveragegraph_task_updateto asynchronously re‑issue updated parameters for the attention operator on a separate stream, thereby minimizing compute bubbles caused by synchronous dispatch.
Implementation Plan
-
Decode‑Stage Full Graph
Build the foundational framework and implement full‑graph capture for theDecodeOnlystage—where host dispatch latency is most critical. (See PRs [1/N][Feat] Implement primal full graph with limited scenario #1503 and [0.9.1][2/N][Feat] Restore paged attention kernel in Full Graph for performence #1677.) -
Prefill‑Stage Full Graph
Extend the graph‑dispatching framework to support the Prefill stage, and refactor the implementation into its long‑term location within the codebase. -
Engineering Tasks
Address additional concerns such as memory profiling, stream management, batch sizes adjusting, and integration testing.
Feedback Period.
2 weeks.
CC List.
Any Other Things.
No response