-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[FusedMoE] Remove cuda hard-code in dual stream execution #27397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Icey <1790571317@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully removes hard-coded CUDA dependencies from the FusedMoE layer by replacing CUDA-specific stream operations with platform-agnostic abstractions. This is a valuable change for improving hardware portability. However, I've identified a critical issue that could lead to a runtime crash on TPU platforms due to the new implementation. I have provided a suggested fix to address this potential crash.
|
I hope this PR can be included in the next version. |
| # and other execution modes | ||
| if envs.VLLM_DISABLE_SHARED_EXPERTS_STREAM: | ||
| logger.info_once("Disabling MoE shared_experts cuda stream") | ||
| if envs.VLLM_DISABLE_SHARED_EXPERTS_STREAM or current_platform.is_tpu(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add tpu check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tpu using xla and the namespace is different from torch.cuda.xxx, maybe we could use torch_xla.core.xla_model for __getattr__ of tpu backend. not sure of that, maybe @DarkLight1337 could help to take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @NickLucche
Purpose
Remove cuda hard-code in dual stream execution of FusedMoE
Test Plan
Test through the exsiting tests
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.