-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Core] Nanoflow-style Computation-Communication Overlap #23592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a Nanoflow-style computation-communication overlap optimization, which is a significant performance enhancement. The implementation is non-intrusive, leveraging Torch FX graph transformations to partition batches and overlap operations. The changes are well-structured, with clear separation of concerns in the new nanoflow module. My main feedback is regarding a limitation in the batch splitting logic that currently only supports up to two nano-batches, which contradicts the max_num_nano_batches configuration. Addressing this would make the feature more flexible and powerful for performance tuning.
vllm/utils/nano_split.py
Outdated
| return cu_num_tokens, arange | ||
|
|
||
|
|
||
| def prepare_nano_split_and_set_hooks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should unify the logic here with: #21153 so there can be shared attention splitting between this and the upcoming MoE dual batch overlap implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion and this makes sense to me! The main challenge is that the current split_attn_metadata interface in that PR takes the original common_attn_metadata as input. This forces the splitting logic into _prepare_inputs, which couples it tightly with the existing preparation logic and makes the integration more intrusive. There is a few options to unify the logic while keeping things flexible:
- Add new interfaces that work directly from the scheduler output
- Have
_prepare_inputsreturn the originalcommon_attn_metadata - Put the original
common_attn_metadatainto the builder-generated metadata, so it can be accessed later through the forward context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think move the splitting logic into prepare inputs is fine; base on my understanding for 2+3 we still call _prepare_inputs which means duplicating builder.build calls, this this is on the hot path (directly impacts TPOT in low-qps regimes) we should be minimizing duplicated work as much as possible. I could potentially see 1 being an option but would likely lead to duplicated code.
I think micro-batching will become fairly commonly used both through nanoflow and the wide-ep micro batching @SageMoore and I are working on so I think its fine for it to be a first class citizen in the gpu_model_runner. We should have a draft PR up very soon so you can see our planned gpu_model_runner changes 👍 (cc @SageMoore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Looking forward to seeing the planned changes!
Signed-off-by: Yi Pan <conlesspan@outlook.com>
Signed-off-by: Yi Pan <conlesspan@outlook.com>
Signed-off-by: Yi Pan <conlesspan@outlook.com>
Signed-off-by: Yi Pan <conlesspan@outlook.com>
Signed-off-by: Yi Pan <conlesspan@outlook.com>
Signed-off-by: Yi Pan <conlesspan@outlook.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
@Conless is this ready for review again? |
Hi @ProExpertProg , the current version can work well but has some conflict with the latest DBO PRs #23693 #24845 . I'll fix them soon and ping you later! |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Yi Pan <conlesspan@outlook.com>
Signed-off-by: Yi Pan <conlesspan@outlook.com>
Signed-off-by: Yi Pan <conlesspan@outlook.com>
|
Hi @ProExpertProg, it's ready for review again! The performance of the current version is even better with the attention metadata splitting logic in DBO PRs. |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Yi Pan <conlesspan@outlook.com>
Signed-off-by: Yi Pan <conlesspan@outlook.com>
Signed-off-by: Yi Pan <conlesspan@outlook.com>
|
Thanks for this contribution, and sorry our reviews have been so delayed. I took a look and one high-level concern I have is that this fully hijacks the current Dynamo cudagraph partitioning logic by setting In the long run, we want to do the transformation directly on the torch.compile fx graph (once it supports different streams). In the short run, can you refactor the splitting logic such that it works with splitting on attention for piecewise cudagraphs? If you want to just wait for multistream support that's fine too. |
Sure! Let me refine the implementation to make it compatible with piecewise cudagraphs. Also, this computation-communication overlapping could work by only adding the network operation into the |
Overview
Following discussions with Woosuk Kwon (woosuk.kwon@berkeley.edu), this PR integrates the computation-communication overlap technique from Nanoflow [1] into vLLM using a non-intrusive, compilation-based approach. The integration operates at the Torch FX graph level by partitioning input batches into nano-batches and duplicating selected operations to overlap compute and communication operations. Key features include:
[1] NanoFlow: Towards Optimal Large Language Model Serving Throughput, OSDI 2025
Design and Implementation
To enable seamless and transparent intra-device parallelism, this PR introduces a graph-level transformation applied during model compilation. The transformation operates on the traced
torch.fx.Graph, rather than the original model source code, making it entirely transparent to model implementations. This allows for broad applicability across different models and deployment backends.The figure below illustrates the overall pipeline of our approach:
Graph Transformation: During the compilation phase, the traced
torch.fx.Graphis passed to a transformation function that partitions the graph into submodules based on resource usage patterns (e.g., computation vs. communication). These submodules are then duplicated to process different parts of the input batch, enabling pipelined execution with overlapping compute and communication. Notably, the graph is input batch size agnostic. The resulting transformed graphs are cached in a split manager to avoid runtime recompilation and thereby minimize CPU overhead.Attention Metadata Preparation: At run time, the model runner provides input batch information to a context preparation function. This function determines the nano-batch sizes, prepares the necessary attention metadata for each nano-batch, and stores all globally shared data (such as
vllm.ForwardContextinstances for each nano-batch) into the split manager.Run-time Hook: During execution, the model’s
forwardmethod is redirected to a custom runtime callable. This callable retrieves the appropriate cached graph module and executes it using custom forward hooks. These hooks dynamically override the global forward context for each nano-batch to ensure correct and efficient execution.Graceful Degradation: Splitting into nano-batches does incur overheads for small batch inputs. To maintain robustness across varying workloads and avoid GPU underutilization, the system automatically skips nano-batch splitting when the total token batch size is below a threshold (
min_nano_split_tokens, default: 1024). Additionally, because this is a graph-level optimization, the entire feature can be toggled via a simple configuration flag, ensuring no performance regressions when disabled.Evaluations
We tested the current implementation with LLaMA 3-8B model on 2xH200 GPUs (TP=2, use_cudagraph=False) by
benchmark_throughput.py. It reduces the single-iteration latency by 13% and increases the end-to-end throughput by up to 8%.Input 512 output 0
Input 512 output 512
Input 1024 output 512
Discussion & Future Work
In the future, we plan to add these features on top of the current design:
Co-authored-by: Kan Zhu <kanzhu@cs.washington.edu>
Co-authored-by: Yilong Zhao <yilongzhao@berkeley.edu>
Co-authored-by: Ziren Wang <zirenw2@cs.washington.edu>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Baris Kasikci <baris@cs.washington.edu>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.