[Core] Nanoflow-style Computation-Communication Overlap #23592

Conless · 2025-08-25T21:42:43Z

Overview

Following discussions with Woosuk Kwon (woosuk.kwon@berkeley.edu), this PR integrates the computation-communication overlap technique from Nanoflow [1] into vLLM using a non-intrusive, compilation-based approach. The integration operates at the Torch FX graph level by partitioning input batches into nano-batches and duplicating selected operations to overlap compute and communication operations. Key features include:

Non-intrusive code path: This design introduces fewer than 50 lines of changes to the core vLLM codebase and avoids any model-specific modifications.
Robust performance gain: It delivers ~10% speedup for certain workloads (e.g., large-batch inference), while guarding against performance regression for small-batch inference via a simple heuristic.

[1] NanoFlow: Towards Optimal Large Language Model Serving Throughput, OSDI 2025

Design and Implementation

To enable seamless and transparent intra-device parallelism, this PR introduces a graph-level transformation applied during model compilation. The transformation operates on the traced torch.fx.Graph, rather than the original model source code, making it entirely transparent to model implementations. This allows for broad applicability across different models and deployment backends.

The figure below illustrates the overall pipeline of our approach:

Graph Transformation: During the compilation phase, the traced torch.fx.Graph is passed to a transformation function that partitions the graph into submodules based on resource usage patterns (e.g., computation vs. communication). These submodules are then duplicated to process different parts of the input batch, enabling pipelined execution with overlapping compute and communication. Notably, the graph is input batch size agnostic. The resulting transformed graphs are cached in a split manager to avoid runtime recompilation and thereby minimize CPU overhead.
Attention Metadata Preparation: At run time, the model runner provides input batch information to a context preparation function. This function determines the nano-batch sizes, prepares the necessary attention metadata for each nano-batch, and stores all globally shared data (such as vllm.ForwardContext instances for each nano-batch) into the split manager.
Run-time Hook: During execution, the model’s forward method is redirected to a custom runtime callable. This callable retrieves the appropriate cached graph module and executes it using custom forward hooks. These hooks dynamically override the global forward context for each nano-batch to ensure correct and efficient execution.
Graceful Degradation: Splitting into nano-batches does incur overheads for small batch inputs. To maintain robustness across varying workloads and avoid GPU underutilization, the system automatically skips nano-batch splitting when the total token batch size is below a threshold (min_nano_split_tokens, default: 1024). Additionally, because this is a graph-level optimization, the entire feature can be toggled via a simple configuration flag, ensuring no performance regressions when disabled.

Evaluations

We tested the current implementation with LLaMA 3-8B model on 2xH200 GPUs (TP=2, use_cudagraph=False) by benchmark_throughput.py. It reduces the single-iteration latency by 13% and increases the end-to-end throughput by up to 8%.

Input 512 output 0

	Throughput (token/s)
vLLM	69533.7
vLLM w/ Nanoflow	74532.8

Input 512 output 512

	Throughput (token/s)
vLLM	23018.2
vLLM w/ Nanoflow	24066.4

Input 1024 output 512

	Throughput (token/s)
vLLM	31135.3
vLLM w/ Nanoflow	32587.0

Discussion & Future Work

In the future, we plan to add these features on top of the current design:

Explore CUDA graph compatibility
Implement more types of intra-device parallelism, such as overlapping compute- and memory-bound operators and dual-batch overlap for MoE models

Co-authored-by: Kan Zhu <kanzhu@cs.washington.edu>
Co-authored-by: Yilong Zhao <yilongzhao@berkeley.edu>
Co-authored-by: Ziren Wang <zirenw2@cs.washington.edu>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Baris Kasikci <baris@cs.washington.edu>

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

mergify · 2025-08-25T21:43:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a Nanoflow-style computation-communication overlap optimization, which is a significant performance enhancement. The implementation is non-intrusive, leveraging Torch FX graph transformations to partition batches and overlap operations. The changes are well-structured, with clear separation of concerns in the new nanoflow module. My main feedback is regarding a limitation in the batch splitting logic that currently only supports up to two nano-batches, which contradicts the max_num_nano_batches configuration. Addressing this would make the feature more flexible and powerful for performance tuning.

vllm/compilation/nanoflow/split_utils.py

LucasWilkinson · 2025-08-25T22:26:48Z

vllm/utils/nano_split.py

+    return cu_num_tokens, arange
+
+
+def prepare_nano_split_and_set_hooks(


We should unify the logic here with: #21153 so there can be shared attention splitting between this and the upcoming MoE dual batch overlap implementation

Thanks for the suggestion and this makes sense to me! The main challenge is that the current split_attn_metadata interface in that PR takes the original common_attn_metadata as input. This forces the splitting logic into _prepare_inputs, which couples it tightly with the existing preparation logic and makes the integration more intrusive. There is a few options to unify the logic while keeping things flexible:

Add new interfaces that work directly from the scheduler output

Have _prepare_inputs return the original common_attn_metadata

Put the original common_attn_metadata into the builder-generated metadata, so it can be accessed later through the forward context

I think move the splitting logic into prepare inputs is fine; base on my understanding for 2+3 we still call _prepare_inputs which means duplicating builder.build calls, this this is on the hot path (directly impacts TPOT in low-qps regimes) we should be minimizing duplicated work as much as possible. I could potentially see 1 being an option but would likely lead to duplicated code.

I think micro-batching will become fairly commonly used both through nanoflow and the wide-ep micro batching @SageMoore and I are working on so I think its fine for it to be a first class citizen in the gpu_model_runner. We should have a draft PR up very soon so you can see our planned gpu_model_runner changes 👍 (cc @SageMoore)

Got it. Looking forward to seeing the planned changes!

Signed-off-by: Yi Pan <conlesspan@outlook.com>

mergify · 2025-09-08T13:56:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Yi Pan <conlesspan@outlook.com>

ProExpertProg · 2025-09-25T00:20:30Z

@Conless is this ready for review again?

Conless · 2025-09-25T06:56:06Z

@Conless is this ready for review again?

Hi @ProExpertProg , the current version can work well but has some conflict with the latest DBO PRs #23693 #24845 . I'll fix them soon and ping you later!

mergify · 2025-09-25T07:47:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Yi Pan <conlesspan@outlook.com>

Conless · 2025-10-03T00:48:23Z

Hi @ProExpertProg, it's ready for review again! The performance of the current version is even better with the attention metadata splitting logic in DBO PRs.

mergify · 2025-10-08T14:33:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Conless.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Yi Pan <conlesspan@outlook.com>

ProExpertProg · 2025-10-31T15:31:20Z

Thanks for this contribution, and sorry our reviews have been so delayed. I took a look and one high-level concern I have is that this fully hijacks the current Dynamo cudagraph partitioning logic by setting splitting_ops. That means Nanoflow is incompatible to splitting_ops being set to anything other than empty.

In the long run, we want to do the transformation directly on the torch.compile fx graph (once it supports different streams). In the short run, can you refactor the splitting logic such that it works with splitting on attention for piecewise cudagraphs? If you want to just wait for multistream support that's fine too.

Conless · 2025-11-05T04:36:15Z

In the long run, we want to do the transformation directly on the torch.compile fx graph (once it supports different streams). In the short run, can you refactor the splitting logic such that it works with splitting on attention for piecewise cudagraphs? If you want to just wait for multistream support that's fine too.

Sure! Let me refine the implementation to make it compatible with piecewise cudagraphs. Also, this computation-communication overlapping could work by only adding the network operation into the splitting_ops, so I think it's still compatible with most of the current partitioning logic?

Conless requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, youkaichao, ywang96 and zou3519 as code owners August 25, 2025 21:42

mergify bot added the v1 label Aug 25, 2025

mergify bot added the needs-rebase label Aug 25, 2025

gemini-code-assist bot reviewed Aug 25, 2025

View reviewed changes

vllm/compilation/nanoflow/split_utils.py Outdated Show resolved Hide resolved

ProExpertProg mentioned this pull request Aug 25, 2025

[RFC]: Address piecewise graph splitting and attention fusion incompatibility #23261

Closed

1 task

LucasWilkinson reviewed Aug 25, 2025

View reviewed changes

Conless requested review from hmellor, houseroad, mgoin, simon-mo, tlrmchlsmth and yewentao256 as code owners August 25, 2025 23:29

mergify bot removed the needs-rebase label Aug 25, 2025

Conless added 8 commits August 25, 2025 16:38

feat: basic support of nano split.

a8a9f38

Signed-off-by: Yi Pan <conlesspan@outlook.com>

finish compute comm overlap

ad5fbcd

Signed-off-by: Yi Pan <conlesspan@outlook.com>

update

8712a26

Signed-off-by: Yi Pan <conlesspan@outlook.com>

fix cpu overhead

c240236

Signed-off-by: Yi Pan <conlesspan@outlook.com>

update model runner

49269f2

Signed-off-by: Yi Pan <conlesspan@outlook.com>

refine interface

39b878b

Signed-off-by: Yi Pan <conlesspan@outlook.com>

refine

4970a80

Signed-off-by: Yi Pan <conlesspan@outlook.com>

separate nanoflow logic

d00c4af

Signed-off-by: Yi Pan <conlesspan@outlook.com>

Merge remote-tracking branch 'upstream/main' into dev

f3b2dbe

mergify bot added the needs-rebase label Sep 8, 2025

ProExpertProg moved this from To triage to In progress in torch.compile integration Sep 12, 2025

ProExpertProg moved this from In progress to Done in torch.compile integration Sep 12, 2025

ProExpertProg moved this from Done to In review in torch.compile integration Sep 12, 2025

Conless added 2 commits September 21, 2025 15:00

minor

563fe72

Signed-off-by: Yi Pan <conlesspan@outlook.com>

Merge remote-tracking branch 'upstream/main' into dev

92903d9

mergify bot removed the needs-rebase label Sep 22, 2025

mergify bot added the needs-rebase label Sep 25, 2025

Conless added 3 commits September 29, 2025 14:22

Merge remote-tracking branch 'upstream/main' into dev

1a26d31

Signed-off-by: Yi Pan <conlesspan@outlook.com>

adapt to dbo design

ba26309

Signed-off-by: Yi Pan <conlesspan@outlook.com>

Merge remote-tracking branch 'upstream/main' into dev

d1e134e

Signed-off-by: Yi Pan <conlesspan@outlook.com>

mergify bot removed the needs-rebase label Oct 2, 2025

fix

f087b99

Signed-off-by: Yi Pan <conlesspan@outlook.com>

mergify bot added the needs-rebase label Oct 8, 2025

Merge remote-tracking branch 'upstream/main' into dev

28288f7

Signed-off-by: Yi Pan <conlesspan@outlook.com>

mergify bot removed the needs-rebase label Oct 10, 2025

fix num tokens across dp

eed29c1

Signed-off-by: Yi Pan <conlesspan@outlook.com>

Conless requested review from LucasWilkinson, hmellor and zou3519 October 10, 2025 02:48

format fix

d8b3573

Signed-off-by: Yi Pan <conlesspan@outlook.com>

		return cu_num_tokens, arange


		def prepare_nano_split_and_set_hooks(

Uh oh!

[Core] Nanoflow-style Computation-Communication Overlap #23592

Are you sure you want to change the base?

[Core] Nanoflow-style Computation-Communication Overlap #23592

Conversation

Conless commented Aug 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Design and Implementation

Evaluations

Discussion & Future Work

Uh oh!

mergify bot commented Aug 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

LucasWilkinson Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Conless Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Conless Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025

Uh oh!

Conless commented Sep 25, 2025

Uh oh!

mergify bot commented Sep 25, 2025

Uh oh!

Conless commented Oct 3, 2025

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

ProExpertProg commented Oct 31, 2025

Uh oh!

Conless commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conless commented Aug 25, 2025 •

edited by github-actions bot

Loading