Slow down between epochs when using ddp with num_workers > 0 #576

5c4lar · 2023-05-12T02:21:01Z

When using ddp with num_workers > 0, the training slow down between epochs. I tried using ddp_spawn with persistent_workers according to the doc,

When using strategy="ddp_spawn" and num_workers>0, consider setting persistent_workers=True inside your DataLoader since it can result in data-loading bottlenecks and slowdowns.

but in the same document, they said num_workers > 0 is should not be used with ddp_spawn, which is confusing.

Also, when using wandb logger to log images in ddp_spawn mode, the images are logged into the root project dir, and not sent to the server correctly. how can we fix this problem?

ashleve · 2023-05-12T13:14:35Z

Normal ddp should work correctly now. Have you tried it? I have updated the default ddp config recently #571

I'm not sure what's going on when you log images to wandb, but have you made sure to execute logging only on rank 0 process? You don't want for each DDP process to log the same image independently

5c4lar · 2023-05-12T13:29:20Z

Normal ddp should work correctly now. Have you tried it? I have updated the default ddp config recently #571

yes, I used the normal ddp before, when observing the slow down, I played around with many different settings to solve it, and finally find that when using ddp_spawn, the pause between epochs disappear.

I'm sure I log only on rank 0 process. The function is decorated with rank_zero_only.

I think the problem is that the output dir of wandb logger is set to output_dir: ${hydra:runtime.output_dir}, which doesn't work as desired for ddp spawn mode.

ashleve · 2023-05-12T13:39:23Z

I think the problem is that the output dir of wandb logger is set to output_dir: ${hydra:runtime.output_dir}, which doesn't work as desired for ddp spawn mode.

It seems like it. I guess you could set output_dir: ${paths.root_dir}/.wandb as a fix for now, so wandb dir will be always the same

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow down between epochs when using ddp with num_workers > 0 #576

Slow down between epochs when using ddp with num_workers > 0 #576

5c4lar commented May 12, 2023

ashleve commented May 12, 2023

5c4lar commented May 12, 2023 •

edited

Loading

ashleve commented May 12, 2023

Slow down between epochs when using ddp with num_workers > 0 #576

Slow down between epochs when using ddp with num_workers > 0 #576

Comments

5c4lar commented May 12, 2023

ashleve commented May 12, 2023

5c4lar commented May 12, 2023 • edited Loading

ashleve commented May 12, 2023

5c4lar commented May 12, 2023 •

edited

Loading