-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Bugfix] Set VLLM_ALLREDUCE_USE_SYMM_MEM default to False
#24696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Set VLLM_ALLREDUCE_USE_SYMM_MEM default to False
#24696
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly disables the VLLM_ALLREDUCE_USE_SYMM_MEM feature by default by changing its environment variable's default value from True to False. This is a sensible approach to temporarily mitigate a bug as described in the PR. The changes are consistently applied, and the existing tests for this feature are correctly configured to explicitly enable it, ensuring continued test coverage. The implementation looks good.
…-default Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let's get this bugfix in for now to fix release
VLLM_ALLREDUCE_USE_SYMM_MEM default to FalseVLLM_ALLREDUCE_USE_SYMM_MEM default to False
…ject#24696) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ject#24696) Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
@ilmarkov Could you fix the issue and re-enable VLLM_ALLREDUCE_USE_SYMM_MEM by default so that we can benefit from the faster AllReduce without any env vars? Thanks! |
|
@nvpohanh Yes, I am working on this. The easiest solution would be disable symm mem when DP is used (i.e. all devices do only TP or PP) but I am trying to find a way to enable it for all. The problem is torch incorrectly detects overlapping devices here in case when multiple DP processes are running on the same node. |
…ject#24696) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ject#24696) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ject#24696) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Fixes #24694
#24111 should be tested with DP case then open to default again
Test
(APIServer pid=454020) INFO: Started server process [454020] (APIServer pid=454020) INFO: Waiting for application startup. (APIServer pid=454020) INFO: Application startup complete.