You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using ddp with num_workers > 0, the training slow down between epochs. I tried using ddp_spawn with persistent_workers according to the doc,
When using strategy="ddp_spawn" and num_workers>0, consider setting persistent_workers=True inside your DataLoader since it can result in data-loading bottlenecks and slowdowns.
but in the same document, they said num_workers > 0 is should not be used with ddp_spawn, which is confusing.
Also, when using wandb logger to log images in ddp_spawn mode, the images are logged into the root project dir, and not sent to the server correctly. how can we fix this problem?
The text was updated successfully, but these errors were encountered:
Normal ddp should work correctly now. Have you tried it? I have updated the default ddp config recently #571
I'm not sure what's going on when you log images to wandb, but have you made sure to execute logging only on rank 0 process? You don't want for each DDP process to log the same image independently
Normal ddp should work correctly now. Have you tried it? I have updated the default ddp config recently #571
yes, I used the normal ddp before, when observing the slow down, I played around with many different settings to solve it, and finally find that when using ddp_spawn, the pause between epochs disappear.
I'm sure I log only on rank 0 process. The function is decorated with rank_zero_only.
I think the problem is that the output dir of wandb logger is set to output_dir: ${hydra:runtime.output_dir}, which doesn't work as desired for ddp spawn mode.
I think the problem is that the output dir of wandb logger is set to output_dir: ${hydra:runtime.output_dir}, which doesn't work as desired for ddp spawn mode.
It seems like it. I guess you could set output_dir: ${paths.root_dir}/.wandb as a fix for now, so wandb dir will be always the same
When using ddp with num_workers > 0, the training slow down between epochs. I tried using ddp_spawn with persistent_workers according to the doc,
but in the same document, they said num_workers > 0 is should not be used with ddp_spawn, which is confusing.
Also, when using wandb logger to log images in ddp_spawn mode, the images are logged into the root project dir, and not sent to the server correctly. how can we fix this problem?
The text was updated successfully, but these errors were encountered: