Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterations completing out of order (possibly) in ddp with torchelastic? #3403

Closed
jloveric opened this issue Sep 8, 2020 · 2 comments · Fixed by #3819
Closed

Iterations completing out of order (possibly) in ddp with torchelastic? #3403

jloveric opened this issue Sep 8, 2020 · 2 comments · Fixed by #3819
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on waiting on author Waiting on user action, correction, or update
Milestone

Comments

@jloveric
Copy link

jloveric commented Sep 8, 2020

This might be bug or might be expected. I'm running a pytorchlightning with torchelastic and ddp. I'm noticing the iterations are being dumped out of order (below iteration 632 preceeds iteration 574). This could be due to delays in parallel writing... or perhaps just issues in logging. Is this expected behavior?

Validating: 60it [00:21,  3.61it/s]�[A
Epoch 26: : 632it [08:13,  1.28it/s, loss=0.111, v_num=0]

Validating: 62it [00:22,  4.62it/s]�[A
Validating: 0it [00:00, ?it/s]�[A
Epoch 26: : 572it [07:51,  1.21it/s, loss=0.111, v_num=0]

Validating: 2it [00:00, 18.62it/s]�[A
Epoch 26: : 574it [07:52,  1.22it/s, loss=0.111, v_num=0]

Running with 6 gpus in ddp.

@jloveric jloveric added bug Something isn't working help wanted Open to be worked on labels Sep 8, 2020
@Borda Borda added the distributed Generic distributed-related topic label Sep 8, 2020
@awaelchli
Copy link
Contributor

This happens with torchelastic only?
can we check that the rank is correctly set, by printing trainer.global_rank somewhere in the training_step for example? I suspect these progress bars are from different ranks. It should only show on rank 0.
which PL version?

@edenlightning
Copy link
Contributor

@jloveric mind giving more details?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on waiting on author Waiting on user action, correction, or update
Projects
None yet
4 participants