-
Notifications
You must be signed in to change notification settings - Fork 544
[main][refractor] Refractor forward metadata retrieval across DP nodes to reduce redundant padding. #2062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c4f2fcc to
f3176d0
Compare
c7a0720 to
d58e96a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2062 +/- ##
=======================================
Coverage 76.34% 76.34%
=======================================
Files 110 110
Lines 12473 12473
=======================================
Hits 9522 9522
Misses 2951 2951
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ca23f38 to
81cd9c5
Compare
|
I think the third scenario should be considered, in the |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
5797eae to
7b133f6
Compare
31b13af to
9ef1fb9
Compare
…s to reduce redundant padding. Signed-off-by: yx0716 <jinyx1007@foxmail.com> Signed-off-by: MengqingCao <cmq0113@163.com>
Given these two scenarios, it seems unlikely that a third case would arise at this point. |
…s to reduce redundant padding. (vllm-project#2062) Before refactoring cross-DP decoding metadata aggregation, clean up the token‐padding logic . ### What this PR does: 1. First checks whether any DP instance is in the prefill phase. 2. If in the `decode` phase and `torchair_graph_enabled `is true, pads each DP instance’s token count up to the global maximum. 3. If in the `prefill` phase, or in decode phase with graph mode **disabled**, returns each DP instance’s original token count without padding. This reordering removes the previous two‐step padding/unpadding flow and ensures padding only occurs when strictly necessary. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@bd3db7f Signed-off-by: yx0716 <jinyx1007@foxmail.com> Signed-off-by: MengqingCao <cmq0113@163.com>
…s to reduce redundant padding. (vllm-project#2062) Before refactoring cross-DP decoding metadata aggregation, clean up the token‐padding logic . ### What this PR does: 1. First checks whether any DP instance is in the prefill phase. 2. If in the `decode` phase and `torchair_graph_enabled `is true, pads each DP instance’s token count up to the global maximum. 3. If in the `prefill` phase, or in decode phase with graph mode **disabled**, returns each DP instance’s original token count without padding. This reordering removes the previous two‐step padding/unpadding flow and ensures padding only occurs when strictly necessary. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@bd3db7f Signed-off-by: yx0716 <jinyx1007@foxmail.com> Signed-off-by: MengqingCao <cmq0113@163.com>
…s to reduce redundant padding. (vllm-project#2062) Before refactoring cross-DP decoding metadata aggregation, clean up the token‐padding logic . ### What this PR does: 1. First checks whether any DP instance is in the prefill phase. 2. If in the `decode` phase and `torchair_graph_enabled `is true, pads each DP instance’s token count up to the global maximum. 3. If in the `prefill` phase, or in decode phase with graph mode **disabled**, returns each DP instance’s original token count without padding. This reordering removes the previous two‐step padding/unpadding flow and ensures padding only occurs when strictly necessary. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@bd3db7f Signed-off-by: yx0716 <jinyx1007@foxmail.com> Signed-off-by: MengqingCao <cmq0113@163.com>
…s to reduce redundant padding. (vllm-project#2062) Before refactoring cross-DP decoding metadata aggregation, clean up the token‐padding logic . ### What this PR does: 1. First checks whether any DP instance is in the prefill phase. 2. If in the `decode` phase and `torchair_graph_enabled `is true, pads each DP instance’s token count up to the global maximum. 3. If in the `prefill` phase, or in decode phase with graph mode **disabled**, returns each DP instance’s original token count without padding. This reordering removes the previous two‐step padding/unpadding flow and ensures padding only occurs when strictly necessary. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@bd3db7f Signed-off-by: yx0716 <jinyx1007@foxmail.com> Signed-off-by: MengqingCao <cmq0113@163.com>
Before refactoring cross-DP decoding metadata aggregation, clean up the token‐padding logic .
What this PR does:
First checks whether any DP instance is in the prefill phase.
If in the
decodephase andtorchair_graph_enabledis true, pads each DP instance’s token count up to the global maximum.If in the
prefillphase, or in decode phase with graph mode disabled, returns each DP instance’s original token count without padding.This reordering removes the previous two‐step padding/unpadding flow and ensures padding only occurs when strictly necessary.