-
-
Couldn't load subscription status.
- Fork 10.8k
Enable symmetric memory all reduce by default only enabling for TP #25070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable symmetric memory all reduce by default only enabling for TP #25070
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables symmetric memory all-reduce by default and fixes a crash when used with DataParallel. The core fix correctly sets TORCH_SYMM_MEM_ALLOW_OVERLAPPING_DEVICES when a DataParallel setup is detected, which is a sound approach. The new test case validates this fix. The changes are generally good, but I have identified a critical bug in the test cleanup logic that could lead to resource leaks, and a high-severity issue regarding environment variable modification that should be addressed to prevent potential side effects.
| if val is not None: | ||
| pytest.skip(val) | ||
| cleanup_dist_env_and_memory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical bug in the cleanup logic. pytest.skip() raises a special exception to stop the test execution. Because it's called before cleanup_dist_env_and_memory() within the finally block, the cleanup function will not be executed if a worker process sends a skip message. This will lead to resource leaks (e.g., dangling distributed processes), which can cause subsequent tests to fail or hang. The cleanup must be guaranteed to run before the test is skipped.
| if val is not None: | |
| pytest.skip(val) | |
| cleanup_dist_env_and_memory() | |
| cleanup_dist_env_and_memory() | |
| if val is not None: | |
| pytest.skip(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
29a56e9 to
0b3cb29
Compare
…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: gaojc <1055866782@qq.com>
…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…llm-project#25070) Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Enable torch symm mem by default and fix DataParallel crash. Enable torch symm mem only for TP communicator.
Conflict with DataParallel
In DataParallel we have multiple workers that are assigned to the same set of devices in their environment (with different CUDA_VISIBLE_DEVICES).
In case of DP communicator, we have wrong local_rank:rank mapping, e.g.
unique_name: dp:0, device: cuda:0, rank: 1, world_size: 2.In case of EP communicator in DP setup we have two workers with the same device a different ranks within group:
So in following line, we basically use the same
self.devicein the same group. That makes torch here think that we overlap devices, although the allocated tensors are in different environment. Torch symm mem fails to do rendevouz in this setup so we basically can not use symm memwith DataParallel. This PR only avoids the crash.
Fixes #24694.
Test Plan
Added test to
test_symm_mem_allreduce.pyTest Result
Passed test