-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support fit with DDP then test without DDP #8375
Comments
Dear @collinmccarthy, Thanks for sharing your thoughts there. You could use ddp_spawn and re-create a Trainer with different parameters. DDP creates subprocesses and I don t really see how we could make it work without killing them. Best, |
@tchaton Thank you for the quick response. Do you mean use ddp_spawn for the initial |
And for the subprocesses, I wasn't thinking we would kill them, rather just let them exit the program normally and reset the DDP backend such that This does assume that whatever happens after Or another idea would be rather than try to remove the DDP context after |
Hey @collinmccarthy, I understand better the use-case now.
|
@tchaton Nice! Thank you, this is what I was missing. Working on it now, I'll let you know soon but I'm sure I can get this to work. |
@tchaton Okay, I think I've found two pseudo-solutions here. First, a slight modification of what you have above: trainer = Trainer(accelerator="ddp", gpus=2)
trainer.fit(...)
torch.distributed.destroy_process_group()
if trainer.is_global_zero:
trainer.training_type_plugin.num_nodes = 1
trainer.training_type_plugin.num_processes = 1
else:
return
# With 1 process, 1 node -> num replicas = 1, so ddp sampler is safe but not necessary
trainer.accelerator_connector.replace_sampler_ddp = False
trainer.test(...) The big problem is that the state of the DDPPlugin ( To me the safer solution is to just re-create the trainer with the correct parameters for single GPU testing without any distributed backend. What my earlier attempt at this was missing is the call to trainer = Trainer(accelerator="ddp", gpus=2)
trainer.fit(...)
torch.distributed.destroy_process_group()
if trainer.is_global_zero:
trainer = Trainer(gpus=1)
model = MyModel.load_from_checkpoint(best_ckpt_path)
trainer.test(model, datamodule=my_datamodule) If you agree that the second method seems more "robust", one thing PTL could do is abstract away the "clean up" of the training_type_plugin (e.g. the call I'll circle back to this issue in a couple of weeks and see if there's anything I can do to help out. Maybe at a minimum I could add some documentation and an example with the Thanks for your help! |
Hey @ananthsub, any thoughts on this ? |
Hey @collinmccarthy, I am not sure Lightning should provide this out of the box. @awaelchli Any thoughts there ? Best, |
I personally would never run the test on multiple GPUs. For correctness, one should always run on a single GPU when evaluating the test set. Uneven inputs are not supported and the distributed sampler will influence the metric computation. If running the test after fit is still desired, one could do torch.distributed.destroy_process_group()
if trainer.global_rank == 0:
trainer = Trainer(gpus=1)
trainer.test(model) as @collinmccarthy suggested. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
#8632 addresses one of these concerns, which is the model does not need to be wrapped with DDP for testing. However, I completely agree with @awaelchli that testing for evaluation purposed ought to be run in a separate script, on a single GPU to avoid any (mis)handling of uneven inputs and the distributed sampler. the process group destruction also came up in #8080 . i think this is potentially something the trainer could do if we are certain that it is the agent which created the initial process group. however, this would still be a breaking change and requires more consideration |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Hey @collinmccarthy, I believe we can close this issue as you have a work around and Lightning won't support this out of the box. Best, |
Any update on this? |
🚀 Feature
This issue has been discussed a few times it seems like (#7929, #3325, #3600), but I haven't seen or been able to find a clear response as to whether the PTL team would like to support this workflow.
I would like to train on a dataset with multiple GPUs, then test using a single GPU, without having the call the script multiple times.
Motivation
I do this with fit-test sequence with every run because I want to get the final, accurate validation results using the best checkpoint from training. This means without a "join" context for validation I need to run on a single GPU without a DistributedSampler, as discussed in #7929.
Even with a join context for validation, I usually create my own validation set and use the original validation set as a test set. In this case I want to be able to call
trainer.test()
on both the validation set and the test set with every run.I believe this workflow I'm describing is not uncommon and therefore should be considered as something that PTL should be able to handle.
Finally, calling the script multiple times is a significant hassle for me and likely other people as well. For my own workflow I'm launching several grid searches from a single config file using Wandb sweeps behind the scenese. I don't want to wait until all the runs are finished to re-launch everything to generate testing results, since the sweeps can take several days. And adding logic to be able to call the script twice and start testing while some runs are still training adds additional complexities that I really want to avoid so the code is as readable as possible.
Pitch
I have hacked together a solution that re-initializes the trainer for testing on a single GPU without any distributed backend, but it's a bit tricky because DDP is still initialized and things like
torchmetrics
try to use the default working group to synchronize things in testing still. So in that case I need to pass in my own "dummy gather" function that by-passes the built-in gather function thattorchmetrics
uses with the default DDP working group. The whole workflow seems fragile at best and if/when it breaks in future updates it will lead to deadlocks and block all of my runs.Alternatives
One solution is a join context to enable testing with multiple GPUs. If this is robust enough to be used for research papers and production then that would be great. This is likely the best solution long-term but it sounds like this won't be possible in the near future, as discussed in #3325.
A second solution is to somehow "clean up" the acceleartor / DDP initialization so its as if it were never initialized at all. Then the call to
trainer.test()
would be independent of any DDP environment created byfit()
, and I believe everything should just work. My attempts at this failed, but I'm not very experienced with DDP so it could be easy and I'm just missing something. To me this would be just as good as a join context, and perhaps its easy to implement for each accelerator type, in which case this may be the best solution for now.A third solution would be a more robust version of what I'm currently doing. This might involve wrapping any barrier() calls with a check to see if we're testing on a single GPU, and if so, skip the barrier. Other synchronization methods such as gathers would have to be wrapped as well. This seems like the worst solution of the three since it's the most "hacky" to me.
Thank you!
Side note, I don't think I'm comfortable with a PR for this one, but if I can help in any other way please let me know.
-Collin
The text was updated successfully, but these errors were encountered: