Skip to content

Conversation

@Liangliang-Ma
Copy link
Contributor

In latest pytorch, torch.distributed.distributed_c10d.get_global_rank would return ValueError if passed with wrong parameter.
A few versions ago, it would return RuntimeError.
To be able to run in all those situations, I modified comm/ccl.py to be compatible.

@delock
Copy link
Collaborator

delock commented Oct 30, 2023

@Liangliang-Ma what is the specific case that would cause ValueError?

@Liangliang-Ma
Copy link
Contributor Author

@Liangliang-Ma what is the specific case that would cause ValueError?

The function def get_all_ranks_from_group would do a loop check for global rank of each local rank in the group from 0,1..., using torch.distributed.distributed_c10d.get_global_rank. When passed with a local rank that greater than group size, it will raise an error. In the past fewer commits of pytorch before, it was RuntimeError and now it's ValueError.

@delock
Copy link
Collaborator

delock commented Nov 17, 2023

@Liangliang-Ma After apply this patch I see a different error with pytorch 2.1. Can you also fix this error in this PR? I think it should due to difference between pytorch 2.0 and pytorch 2.1
Traceback (most recent call last):
File "/home/gma/ds_allreduce_bench/ds_comm_bench.py", line 80, in
Traceback (most recent call last):
File "/home/gma/ds_allreduce_bench/ds_comm_bench.py", line 80, in
t = test_allreduce(False, dtype, loop_count)
File "/home/gma/ds_allreduce_bench/ds_comm_bench.py", line 55, in test_allreduce
t = test_allreduce(False, dtype, loop_count)
File "/home/gma/ds_allreduce_bench/ds_comm_bench.py", line 55, in test_allreduce
dist.barrier(t)
File "/home/gma/DeepSpeed/deepspeed/comm/comm.py", line 117, in log_wrapper
dist.barrier(t)
File "/home/gma/DeepSpeed/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, **kwargs)
File "/home/gma/DeepSpeed/deepspeed/comm/comm.py", line 408, in barrier
return func(*args, **kwargs)
File "/home/gma/DeepSpeed/deepspeed/comm/comm.py", line 408, in barrier
return cdb.barrier(group=group, async_op=async_op)
File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 136, in barrier
return cdb.barrier(group=group, async_op=async_op)
File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 136, in barrier
return self.run_collective(name="barrier", group=group, async_op=async_op)
File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 64, in run_collective
return self.run_collective(name="barrier", group=group, async_op=async_op)
File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 64, in run_collective
kwargs['group'] = self.get_all_ranks_from_group(kwargs['group'])
File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 176, in get_all_ranks_from_group
kwargs['group'] = self.get_all_ranks_from_group(kwargs['group'])
File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 176, in get_all_ranks_from_group
self._new_group(results, group)
File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 158, in _new_group
self._new_group(results, group)
File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 158, in _new_group
sub_main_kvs = self.ccl_comm_op.get_sub_kvs_addr(rank == ranks[0])
IndexError: list index out of range
sub_main_kvs = self.ccl_comm_op.get_sub_kvs_addr(rank == ranks[0])
IndexError: list index out of range

@delock
Copy link
Collaborator

delock commented Nov 20, 2023

Its a test case issue. No further comments on this PR.

@Liangliang-Ma After apply this patch I see a different error with pytorch 2.1. Can you also fix this error in this PR? I think it should due to difference between pytorch 2.0 and pytorch 2.1 Traceback (most recent call last): File "/home/gma/ds_allreduce_bench/ds_comm_bench.py", line 80, in Traceback (most recent call last): File "/home/gma/ds_allreduce_bench/ds_comm_bench.py", line 80, in t = test_allreduce(False, dtype, loop_count) File "/home/gma/ds_allreduce_bench/ds_comm_bench.py", line 55, in test_allreduce t = test_allreduce(False, dtype, loop_count) File "/home/gma/ds_allreduce_bench/ds_comm_bench.py", line 55, in test_allreduce dist.barrier(t) File "/home/gma/DeepSpeed/deepspeed/comm/comm.py", line 117, in log_wrapper dist.barrier(t) File "/home/gma/DeepSpeed/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, **kwargs) File "/home/gma/DeepSpeed/deepspeed/comm/comm.py", line 408, in barrier return func(*args, **kwargs) File "/home/gma/DeepSpeed/deepspeed/comm/comm.py", line 408, in barrier return cdb.barrier(group=group, async_op=async_op) File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 136, in barrier return cdb.barrier(group=group, async_op=async_op) File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 136, in barrier return self.run_collective(name="barrier", group=group, async_op=async_op) File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 64, in run_collective return self.run_collective(name="barrier", group=group, async_op=async_op) File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 64, in run_collective kwargs['group'] = self.get_all_ranks_from_group(kwargs['group']) File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 176, in get_all_ranks_from_group kwargs['group'] = self.get_all_ranks_from_group(kwargs['group']) File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 176, in get_all_ranks_from_group self._new_group(results, group) File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 158, in _new_group self._new_group(results, group) File "/home/gma/DeepSpeed/deepspeed/comm/ccl.py", line 158, in _new_group sub_main_kvs = self.ccl_comm_op.get_sub_kvs_addr(rank == ranks[0]) IndexError: list index out of range sub_main_kvs = self.ccl_comm_op.get_sub_kvs_addr(rank == ranks[0]) IndexError: list index out of range

@Liangliang-Ma
Copy link
Contributor Author

Content duplicated with #4430 . Close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants