-
Notifications
You must be signed in to change notification settings - Fork 559
[0.9.1]DBO support EP parallel and optimize dual stream overlap #1589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.9.1]DBO support EP parallel and optimize dual stream overlap #1589
Conversation
Signed-off-by: shikang-hangzhou <459956190@qq.com>
| attn_cls = CustomDeepseekDBOMLAAttention | ||
| else: | ||
| attn_cls = DeepseekV2Attention | ||
| attn_cls = CustomDeepseekV2MLAAttention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove the branch when use_mla is False here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dual stream overlap is a kind of optimized mode. deepseek-mha did not include in our application scenario, and have no improvements. So I think mha mode is useless.
| hidden_states[i], router_logits[i], is_prefill, real_top_k, | ||
| enable_force_load_balance) | ||
|
|
||
| if global_num_experts == 256: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment that we use 256 here because the op npu_moe_gating_top_k only support this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment that we use 256 here because the op
npu_moe_gating_top_konly support this
Thanks for your review, we have add comments
| if self.dp_size > 1: | ||
| if (self.tp_size > 1 | ||
| and fused_moe_state != FusedMoEState.AllGather): | ||
| dist.all_gather(list(chunk_hidden_states[i]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recomand to use tensor_model_parallel_all_gather directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here align with the deepseekv2 code
Signed-off-by: shikang-hangzhou <459956190@qq.com>
| MSEventKey.MOE_ALL_TO_ALL_FINISH], | ||
| ) | ||
| context.before_comm_event.record() | ||
| with torch.npu.stream(ms_metadata.communicate_stream): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of stream control method seems can't be captured in torchair, so this is just a eager mode dual batch impl right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,so dual stream overlap only affect in prefill process.
|
|
||
| for i in range(num_micro_batchs): | ||
| ms_metadata.try_wait_event(layer_index, i, | ||
| MSEventKey.MOE_ALL_TO_ALL_FINISH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between event.wait and try_wait_event ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its same here, I will modify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between
event.waitandtry_wait_event?
sorry, its diff between event.wait and try_wait_event. the last could assign num of microbatch which need wait
| ep_group.world_size, -1).sum(-1) | ||
| scatter_sizes.append(scatter_size) | ||
| gather_sizes = torch.empty_like(scatter_sizes[i]) | ||
| dist.all_to_all_single(gather_sizes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if my understand is correct, you are trying to overlap the all_to_all with gating_topk right, since the second stream launch needs to wait for the end of gating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you overlap the combine phase of all to all with the calc of the shared expert.
Signed-off-by: shikang-hangzhou <459956190@qq.com>
Signed-off-by: shikang-hangzhou <459956190@qq.com>
…overlap (vllm-project#1589) 1. DBO model support EP parallel 2. optimize dual stream overlap max tokens:32784 input_len:1024 bs 32 dp2tp8ep16 before open dbo TTFT: 4017ms  after open dbo TTFT: 3017ms  None --------- Signed-off-by: shikang-hangzhou <459956190@qq.com>
…m-project#1420 vllm-project#1328 from v0.9.1-dev to main Signed-off-by: 22dimensions <waitingwind@foxmail.com>
…m-project#1420 vllm-project#1328 from v0.9.1-dev to main Signed-off-by: 22dimensions <waitingwind@foxmail.com>
What this PR does / why we need it?
max tokens:32784 input_len:1024 bs 32 dp2tp8ep16
before open dbo
TTFT: 4017ms
after open dbo

TTFT: 3017ms
Does this PR introduce any user-facing change?
None
How was this patch tested?