-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with DDP + hydra #393
Comments
Thanks for the summary👍🏻. Looking forward to future fixs. |
@ashleve Can you explain why |
@turian As I mentioned, normal DDP generates multiple unwanted files. This is due to the fact that ddp launches a new process for each GPU, which doesn't go well with the way hydra creates different output dir each time a program is launched. The problem doesn't exist with |
@ashleve Just curious because I am using hydra + DDP in a current project. How would I be able to detect if this issue is occurring for me? What evidence should I look for? Thank you for the tip |
@turian There will be more output directories as explained in facebookresearch/hydra#2070 Just to make this clear, normal ddp actually computes correctly in hydra single run mode, but you will have multiple output directories with |
@ashleve woof that's gross. If you have a good fix, we might considering seeing if we can push it upstream to lightning. |
@ashleve Lightning team appears to be working on this issue? Lightning-AI/pytorch-lightning#11617 (comment) I've been lightly commenting in that PR |
@turian @ashleve what worked for me as a workaround is making the experiment dirs static (especially for multiruns/sweeps) e.g.
you would loose the ability to have a separate directory for sweeper results, but you could override this specifically for optuna optimization sweeps if you like |
It looks like the pr has been merged into lightning main! |
Has this issue been fixed? Can I use |
@ashleve Was this fixed by the newest release to use PyTorch 2.0 and PyTorch Lightning 2.0? Thank you for your time. |
@AiEson @libokj It seems like the issue with ddp is indeed fixed. I've checked on multi-gpu instance and at first glance, everything seems to be computed correctly with no redundant logging directories. Although issues with ddp are often hard to spot so let me know if you encounter some problems. For reference, here are some of the commands I've checked:
I made the appropriate changes to ddp config #571 |
I really appreciate your update! Thank you again. |
Hi, when using ddp i still end up with two, sometimes three, directories per sweep under I am executing My lightning version is 2.0.4. Is there anything i am missing? Thanks! |
There are two files ( |
run:
dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
sweep:
dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
# Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation
subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'}
job_logging:
handlers:
file:
filename: ${hydra:runtime.output_dir}/job.log With my current configuration for hydra, multiple folders will be created for the same ddp job, one for each ddp process with a slight time difference, e.g. |
@libokj what if you try
wher |
@hovnatan Is there any method for not creating folder for worker node? I could solve not generation log files above scripts, but still generate logging folder with a slight time difference. |
There have been numerous issues about using DDP with hydra:
#231 #289 #229 #226 #194 #352
Current state of things is well described here:
facebookresearch/hydra#2070
tl;dr:
You should be good when using current lightning-hydra-template with
ddp_spawn
:This works correctly with normal runs as well as multiruns as far as I'm aware.
(
ddp_spawn
works a bit slower than normalddp
and should be run withdatamodule.num_workers=0
only)Normal
ddp
computes correctly but generates multiple output directories.I have not tested what happens when using SLURM.
For now, I don't see anything that can be done on the template part to fix this. This might change with future hydra releases.
Update (April 2023):
Nornal DDP seems to be working correctly with current lightning release (2.0.2). There are no longer multiple output directories.
The text was updated successfully, but these errors were encountered: