Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pr 1668 #1680

Open
wants to merge 5 commits into
base: pytorch
Choose a base branch
from
Open

Pr 1668 #1680

wants to merge 5 commits into from

Conversation

sqchai
Copy link

@sqchai sqchai commented Aug 2, 2024

  1. Fixed 'store_snapshot'. Previously each node will try to write a snapshot to the same root directory in multi-node multi-gpu mode. Restored 'store_snapshot' default value to True. Modified such that only node rank = 0 process will write a snapshot.
  2. Previous change to train.py on line 255 to line 311 (multi-gpu part) is not necessary. Restored to original code.
  3. For the 'disallow_scheduler' part in 'process_environment.py'. For some reasons my previous local code base do not have this 'disallow_scheduler' function, and does not import this function in the script. Therefore, this part of code always fails. I fixed those things according to the latest alf implementation.
  4. Modified README.md to include instructions to launch multi-node multi-gpu training.
  5. Fixed code style problem.

@@ -176,13 +179,12 @@ def training_worker(rank: int,
# Specialization for distributed mode
dist.init_process_group('nccl', rank=rank, world_size=world_size)
# Recover the flags when spawned as a sub process
if rank > 0:
Copy link

@JingyuQian JingyuQian Aug 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if condition, if present during "multi-gpu" mode, will raise a "root_dir" has been defined twice error. Removing it seems to solve the problem. Need further confirmation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the contrary, if condition is needed so that flags will not be redefined for rank0 because rank0 process is the main process, which already defined the flags.

@@ -176,13 +179,12 @@ def training_worker(rank: int,
# Specialization for distributed mode
dist.init_process_group('nccl', rank=rank, world_size=world_size)
# Recover the flags when spawned as a sub process
if rank > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the contrary, if condition is needed so that flags will not be redefined for rank0 because rank0 process is the main process, which already defined the flags.

@@ -215,7 +279,7 @@ def main(_):

conf_file = common.get_conf_file()

if FLAGS.store_snapshot:
if FLAGS.store_snapshot and int(os.environ['RANK']) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may not be RANK environment variable for single process / single node job.

@@ -264,6 +328,40 @@ def main(_):
# exit code for the program.
raise ChildProcessError(f'Training failed on subprocess exception')

elif FLAGS.distributed == 'multi-node-multi-gpu':
local_rank = int(os.environ['LOCAL_RANK'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use FLAGS.local_rank?

Comment on lines +333 to +334
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I couldn't find documentation about RANK or WORLD_SIZE environment variable for torch.distributed.launch. Can you give the link for the documentation?

@@ -78,6 +81,10 @@ def is_distributed(self):
def ddp_rank(self):
return self._ddp_rank

@property
def local_rank(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since now there are two rank, ddp_rank and local_rank, it is better to clearly document what their meanings are.

@@ -31,6 +31,7 @@ class SpawnedProcessContext(NamedTuple):
"""
ddp_num_procs: int
ddp_rank: int
local_rank: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment ddp_rank and local_rank

# in different work processes.
manager = mp.Manager()
paras_queue = manager.Queue()
training_worker_multi_node(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Queue does not work across multiple machines. If so, need to update Trainer._check_dpp_paras_consistency for multi-node mode. Perhaps change if self._rank > 0: to if local_rank > 0:. If so, need to update the doc-string for TrainerConfig.ddp_paras_check_interval to say that it only check the consistency within one node. And please test if TraomerConfig.check_dpp_paras_consistency=True works properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think paras_queue doesn't work at all. In multi-gpu mode, paras_queue can work because it is created by the main process. All the worker will use the same paras_queue. Here different worker will use the different queue, which makes no sense.
So it's better just to report unsupported error for TraomerConfig.check_dpp_paras_consistency=True in multi-node-multi-gpu mode.

try:
_setup_logging(log_dir=root_dir, rank=rank)
_setup_device(local_rank)
if world_size > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's guaranteed that world_size > 1, right? So there is no need for the if

Comment on lines +138 to +139
if ddp_rank > -1 and local_rank > -1:
ddp_rank = local_rank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants