Skip to content

Conversation

@yx0716
Copy link
Contributor

@yx0716 yx0716 commented Jul 28, 2025

Before refactoring cross-DP decoding metadata aggregation, clean up the token‐padding logic .

What this PR does:

  1. First checks whether any DP instance is in the prefill phase.

  2. If in the decode phase and torchair_graph_enabled is true, pads each DP instance’s token count up to the global maximum.

  3. If in the prefill phase, or in decode phase with graph mode disabled, returns each DP instance’s original token count without padding.

This reordering removes the previous two‐step padding/unpadding flow and ensures padding only occurs when strictly necessary.

@yx0716 yx0716 force-pushed the main branch 2 times, most recently from c4f2fcc to f3176d0 Compare July 28, 2025 12:00
@yx0716 yx0716 changed the title [Misc] Refractor forward metadata retrieval across DP nodes to reduce redundant padding. [main][refractor] Refractor forward metadata retrieval across DP nodes to reduce redundant padding. Jul 28, 2025
@yx0716 yx0716 marked this pull request as ready for review July 29, 2025 03:57
@yx0716 yx0716 force-pushed the main branch 5 times, most recently from c7a0720 to d58e96a Compare July 30, 2025 13:10
@codecov
Copy link

codecov bot commented Jul 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.34%. Comparing base (807f089) to head (b269550).
⚠️ Report is 613 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2062   +/-   ##
=======================================
  Coverage   76.34%   76.34%           
=======================================
  Files         110      110           
  Lines       12473    12473           
=======================================
  Hits         9522     9522           
  Misses       2951     2951           
Flag Coverage Δ
unittests 76.34% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yx0716 yx0716 force-pushed the main branch 2 times, most recently from ca23f38 to 81cd9c5 Compare July 31, 2025 01:33
@ApsarasX
Copy link
Collaborator

I think the third scenario should be considered, in the decode phase and torchair_graph_enabled is true, but with_prefill is true

@github-actions
Copy link

github-actions bot commented Aug 1, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@yx0716 yx0716 force-pushed the main branch 2 times, most recently from 5797eae to 7b133f6 Compare August 1, 2025 06:30
@yx0716 yx0716 force-pushed the main branch 2 times, most recently from 31b13af to 9ef1fb9 Compare August 5, 2025 02:31
…s to reduce redundant padding.

Signed-off-by: yx0716 <jinyx1007@foxmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
@yx0716
Copy link
Contributor Author

yx0716 commented Aug 5, 2025

I think the third scenario should be considered, in the decode phase and torchair_graph_enabled is true, but with_prefill is true

  • In a single DP instance, TorchAir’s graph mode and chunk-prefill mode are not enabled at the same time, thus prefill and decode will not exist simultaneously.
  • Across multiple DP instances, if even one instance enters the prefill stage, all instances automatically revert to non-graph mode.

Given these two scenarios, it seems unlikely that a third case would arise at this point.

@wangxiyuan wangxiyuan mentioned this pull request Aug 5, 2025
7 tasks
@wangxiyuan
Copy link
Collaborator

@ApsarasX I'm working on torchair model runner refacor, let's do more work there #2204

@ApsarasX
Copy link
Collaborator

ApsarasX commented Aug 5, 2025

@ApsarasX I'm working on torchair model runner refacor, let's do more work there #2204

OK

@wangxiyuan wangxiyuan merged commit 583ad8f into vllm-project:main Aug 5, 2025
25 checks passed
zzhx1 pushed a commit to lidenghui1110/vllm-ascend that referenced this pull request Aug 11, 2025
…s to reduce redundant padding. (vllm-project#2062)

Before refactoring cross-DP decoding metadata aggregation, clean up the
token‐padding logic .
### What this PR does:

1. First checks whether any DP instance is in the prefill phase.

2. If in the `decode` phase and `torchair_graph_enabled `is true, pads
each DP instance’s token count up to the global maximum.

3. If in the `prefill` phase, or in decode phase with graph mode
**disabled**, returns each DP instance’s original token count without
padding.

This reordering removes the previous two‐step padding/unpadding flow and
ensures padding only occurs when strictly necessary.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@bd3db7f

Signed-off-by: yx0716 <jinyx1007@foxmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
zzhx1 pushed a commit to lidenghui1110/vllm-ascend that referenced this pull request Aug 11, 2025
…s to reduce redundant padding. (vllm-project#2062)

Before refactoring cross-DP decoding metadata aggregation, clean up the
token‐padding logic .
### What this PR does:

1. First checks whether any DP instance is in the prefill phase.

2. If in the `decode` phase and `torchair_graph_enabled `is true, pads
each DP instance’s token count up to the global maximum.

3. If in the `prefill` phase, or in decode phase with graph mode
**disabled**, returns each DP instance’s original token count without
padding.

This reordering removes the previous two‐step padding/unpadding flow and
ensures padding only occurs when strictly necessary.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@bd3db7f

Signed-off-by: yx0716 <jinyx1007@foxmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
…s to reduce redundant padding. (vllm-project#2062)

Before refactoring cross-DP decoding metadata aggregation, clean up the
token‐padding logic .
### What this PR does:

1. First checks whether any DP instance is in the prefill phase.

2. If in the `decode` phase and `torchair_graph_enabled `is true, pads
each DP instance’s token count up to the global maximum.

3. If in the `prefill` phase, or in decode phase with graph mode
**disabled**, returns each DP instance’s original token count without
padding.

This reordering removes the previous two‐step padding/unpadding flow and
ensures padding only occurs when strictly necessary.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@bd3db7f

Signed-off-by: yx0716 <jinyx1007@foxmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…s to reduce redundant padding. (vllm-project#2062)

Before refactoring cross-DP decoding metadata aggregation, clean up the
token‐padding logic .
### What this PR does:

1. First checks whether any DP instance is in the prefill phase.

2. If in the `decode` phase and `torchair_graph_enabled `is true, pads
each DP instance’s token count up to the global maximum.

3. If in the `prefill` phase, or in decode phase with graph mode
**disabled**, returns each DP instance’s original token count without
padding.

This reordering removes the previous two‐step padding/unpadding flow and
ensures padding only occurs when strictly necessary.

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@bd3db7f

Signed-off-by: yx0716 <jinyx1007@foxmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants