-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr 1668 #1680
base: pytorch
Are you sure you want to change the base?
Pr 1668 #1680
Conversation
sqchai
commented
Aug 2, 2024
- Fixed 'store_snapshot'. Previously each node will try to write a snapshot to the same root directory in multi-node multi-gpu mode. Restored 'store_snapshot' default value to True. Modified such that only node rank = 0 process will write a snapshot.
- Previous change to train.py on line 255 to line 311 (multi-gpu part) is not necessary. Restored to original code.
- For the 'disallow_scheduler' part in 'process_environment.py'. For some reasons my previous local code base do not have this 'disallow_scheduler' function, and does not import this function in the script. Therefore, this part of code always fails. I fixed those things according to the latest alf implementation.
- Modified README.md to include instructions to launch multi-node multi-gpu training.
- Fixed code style problem.
@@ -176,13 +179,12 @@ def training_worker(rank: int, | |||
# Specialization for distributed mode | |||
dist.init_process_group('nccl', rank=rank, world_size=world_size) | |||
# Recover the flags when spawned as a sub process | |||
if rank > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this if
condition, if present during "multi-gpu" mode, will raise a "root_dir" has been defined twice
error. Removing it seems to solve the problem. Need further confirmation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the contrary, if
condition is needed so that flags will not be redefined for rank0 because rank0 process is the main process, which already defined the flags.
@@ -176,13 +179,12 @@ def training_worker(rank: int, | |||
# Specialization for distributed mode | |||
dist.init_process_group('nccl', rank=rank, world_size=world_size) | |||
# Recover the flags when spawned as a sub process | |||
if rank > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the contrary, if
condition is needed so that flags will not be redefined for rank0 because rank0 process is the main process, which already defined the flags.
@@ -215,7 +279,7 @@ def main(_): | |||
|
|||
conf_file = common.get_conf_file() | |||
|
|||
if FLAGS.store_snapshot: | |||
if FLAGS.store_snapshot and int(os.environ['RANK']) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may not be RANK environment variable for single process / single node job.
@@ -264,6 +328,40 @@ def main(_): | |||
# exit code for the program. | |||
raise ChildProcessError(f'Training failed on subprocess exception') | |||
|
|||
elif FLAGS.distributed == 'multi-node-multi-gpu': | |||
local_rank = int(os.environ['LOCAL_RANK']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use FLAGS.local_rank?
rank = int(os.environ['RANK']) | ||
world_size = int(os.environ['WORLD_SIZE']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I couldn't find documentation about RANK or WORLD_SIZE environment variable for torch.distributed.launch. Can you give the link for the documentation?
@@ -78,6 +81,10 @@ def is_distributed(self): | |||
def ddp_rank(self): | |||
return self._ddp_rank | |||
|
|||
@property | |||
def local_rank(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since now there are two rank, ddp_rank and local_rank, it is better to clearly document what their meanings are.
@@ -31,6 +31,7 @@ class SpawnedProcessContext(NamedTuple): | |||
""" | |||
ddp_num_procs: int | |||
ddp_rank: int | |||
local_rank: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment ddp_rank and local_rank
# in different work processes. | ||
manager = mp.Manager() | ||
paras_queue = manager.Queue() | ||
training_worker_multi_node( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Queue does not work across multiple machines. If so, need to update Trainer._check_dpp_paras_consistency for multi-node mode. Perhaps change if self._rank > 0:
to if local_rank > 0:
. If so, need to update the doc-string for TrainerConfig.ddp_paras_check_interval to say that it only check the consistency within one node. And please test if TraomerConfig.check_dpp_paras_consistency=True works properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think paras_queue
doesn't work at all. In multi-gpu mode, paras_queue
can work because it is created by the main process. All the worker will use the same paras_queue. Here different worker will use the different queue, which makes no sense.
So it's better just to report unsupported error for TraomerConfig.check_dpp_paras_consistency=True in multi-node-multi-gpu mode.
try: | ||
_setup_logging(log_dir=root_dir, rank=rank) | ||
_setup_device(local_rank) | ||
if world_size > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's guaranteed that world_size > 1
, right? So there is no need for the if
if ddp_rank > -1 and local_rank > -1: | ||
ddp_rank = local_rank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason for this?