Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fit with DDP then test without DDP #8375

Closed
collinmccarthy opened this issue Jul 11, 2021 · 14 comments
Closed

Support fit with DDP then test without DDP #8375

collinmccarthy opened this issue Jul 11, 2021 · 14 comments
Labels
feature Is an improvement or enhancement help wanted Open to be worked on won't fix This will not be worked on

Comments

@collinmccarthy
Copy link

collinmccarthy commented Jul 11, 2021

🚀 Feature

This issue has been discussed a few times it seems like (#7929, #3325, #3600), but I haven't seen or been able to find a clear response as to whether the PTL team would like to support this workflow.

I would like to train on a dataset with multiple GPUs, then test using a single GPU, without having the call the script multiple times.

Motivation

I do this with fit-test sequence with every run because I want to get the final, accurate validation results using the best checkpoint from training. This means without a "join" context for validation I need to run on a single GPU without a DistributedSampler, as discussed in #7929.

Even with a join context for validation, I usually create my own validation set and use the original validation set as a test set. In this case I want to be able to call trainer.test() on both the validation set and the test set with every run.

I believe this workflow I'm describing is not uncommon and therefore should be considered as something that PTL should be able to handle.

Finally, calling the script multiple times is a significant hassle for me and likely other people as well. For my own workflow I'm launching several grid searches from a single config file using Wandb sweeps behind the scenese. I don't want to wait until all the runs are finished to re-launch everything to generate testing results, since the sweeps can take several days. And adding logic to be able to call the script twice and start testing while some runs are still training adds additional complexities that I really want to avoid so the code is as readable as possible.

Pitch

I have hacked together a solution that re-initializes the trainer for testing on a single GPU without any distributed backend, but it's a bit tricky because DDP is still initialized and things like torchmetrics try to use the default working group to synchronize things in testing still. So in that case I need to pass in my own "dummy gather" function that by-passes the built-in gather function that torchmetrics uses with the default DDP working group. The whole workflow seems fragile at best and if/when it breaks in future updates it will lead to deadlocks and block all of my runs.

Alternatives

One solution is a join context to enable testing with multiple GPUs. If this is robust enough to be used for research papers and production then that would be great. This is likely the best solution long-term but it sounds like this won't be possible in the near future, as discussed in #3325.

A second solution is to somehow "clean up" the acceleartor / DDP initialization so its as if it were never initialized at all. Then the call to trainer.test() would be independent of any DDP environment created by fit(), and I believe everything should just work. My attempts at this failed, but I'm not very experienced with DDP so it could be easy and I'm just missing something. To me this would be just as good as a join context, and perhaps its easy to implement for each accelerator type, in which case this may be the best solution for now.

A third solution would be a more robust version of what I'm currently doing. This might involve wrapping any barrier() calls with a check to see if we're testing on a single GPU, and if so, skip the barrier. Other synchronization methods such as gathers would have to be wrapped as well. This seems like the worst solution of the three since it's the most "hacky" to me.

Thank you!

Side note, I don't think I'm comfortable with a PR for this one, but if I can help in any other way please let me know.

-Collin

@collinmccarthy collinmccarthy added feature Is an improvement or enhancement help wanted Open to be worked on labels Jul 11, 2021
@tchaton
Copy link
Contributor

tchaton commented Jul 11, 2021

Dear @collinmccarthy,

Thanks for sharing your thoughts there. You could use ddp_spawn and re-create a Trainer with different parameters.

DDP creates subprocesses and I don t really see how we could make it work without killing them.

Best,
T.C

@collinmccarthy
Copy link
Author

@tchaton Thank you for the quick response. Do you mean use ddp_spawn for the initial fit(), or just for the test() part? I want to avoid using ddp_spawn for fit because I need the absolute best performance that comes from ddp and ddp2 so I can scale this to many nodes and many GPUs.

@collinmccarthy
Copy link
Author

collinmccarthy commented Jul 11, 2021

And for the subprocesses, I wasn't thinking we would kill them, rather just let them exit the program normally and reset the DDP backend such that torch.distributed.is_initialized() returns False, so future calls to trainer.test() don't try to use an old DDP context.

This does assume that whatever happens after trainer.fit() does not need the DDP context anymore. So I'm sure there are use cases I'm not thinking about here, but this does seem like it should at least work for the case where all you want to do is call fit() once with multiple GPUs and then test() once with one GPU.

Or another idea would be rather than try to remove the DDP context after fit(), just let it be re-initalized (e.g. in test()) with a different number of GPUs. This would require some updates to DDPPlugin.setup_distributed() but at first glance it seems possible as long as PyTorch allows you to call torch.distributed.init_process_group() after DDP is already initialized?

@tchaton
Copy link
Contributor

tchaton commented Jul 12, 2021

Hey @collinmccarthy,

I understand better the use-case now.
Here is a pseudo code, but with some hacking, I think It could work ! Mind giving it a try ?

trainer = Trainer(accelerator="ddp", gpus=2)
trainer.fit(...)

torch.distributed.destroy_process_group()

if trainer.is_global_zero:
    trainer.training_type_plugin.cluster_environment.set_world_size(1)
else:
    return

trainer.accelerator_connector.replace_sampler_ddp = False
trainer.test(...)

@collinmccarthy
Copy link
Author

@tchaton Nice! Thank you, this is what I was missing. Working on it now, I'll let you know soon but I'm sure I can get this to work.

@collinmccarthy
Copy link
Author

@tchaton Okay, I think I've found two pseudo-solutions here.

First, a slight modification of what you have above:

trainer = Trainer(accelerator="ddp", gpus=2)
trainer.fit(...)

torch.distributed.destroy_process_group()

if trainer.is_global_zero:
    trainer.training_type_plugin.num_nodes = 1
    trainer.training_type_plugin.num_processes = 1
else:
    return

# With 1 process, 1 node -> num replicas = 1, so ddp sampler is safe but not necessary
trainer.accelerator_connector.replace_sampler_ddp = False
trainer.test(...)

The big problem is that the state of the DDPPlugin (trainer.training_type_plugin) is mostly "wrong". That includes some environment variables, the interactive ddp processes, parallel devices, etc. It just so happens that this stuff isn't used in test() or didn't cause any obvious issues for me. It feels super fragile at best.

To me the safer solution is to just re-create the trainer with the correct parameters for single GPU testing without any distributed backend. What my earlier attempt at this was missing is the call to torch.distributed.destroy_process_group() which is key to avoiding issues in torchmetrics without resorting to things like a dummy gather function (for dist_sync_fn arguments).

trainer = Trainer(accelerator="ddp", gpus=2)
trainer.fit(...)

torch.distributed.destroy_process_group()
if trainer.is_global_zero:
    trainer = Trainer(gpus=1)
    model = MyModel.load_from_checkpoint(best_ckpt_path)
    trainer.test(model, datamodule=my_datamodule)

If you agree that the second method seems more "robust", one thing PTL could do is abstract away the "clean up" of the training_type_plugin (e.g. the call torch.distributed.destroy_process_group()) and force each plugin type to implement something like it. Then allow the user to specify a flag to trainer.test() to use a single GPU / accelerator type. Then the trainer could call this "clean up" method and initailize a SingleDevicePlugin instead, and then start testing / predicting. Just some thoughts.

I'll circle back to this issue in a couple of weeks and see if there's anything I can do to help out. Maybe at a minimum I could add some documentation and an example with the BoringModel.

Thanks for your help!
-Collin

@tchaton
Copy link
Contributor

tchaton commented Jul 13, 2021

Hey @ananthsub, any thoughts on this ?

@tchaton
Copy link
Contributor

tchaton commented Jul 14, 2021

Hey @collinmccarthy,

I am not sure Lightning should provide this out of the box.
But this would definitely be a great addition to the doc.

@awaelchli Any thoughts there ?

Best,
T.C

@awaelchli
Copy link
Contributor

@awaelchli Any thoughts there ?

I personally would never run the test on multiple GPUs. For correctness, one should always run on a single GPU when evaluating the test set. Uneven inputs are not supported and the distributed sampler will influence the metric computation.
Evaluating the test error in a separate script is the only and best option imo.

If running the test after fit is still desired, one could do

torch.distributed.destroy_process_group()
if trainer.global_rank == 0:
    trainer = Trainer(gpus=1)
    trainer.test(model)

as @collinmccarthy suggested.

@stale
Copy link

stale bot commented Aug 14, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Aug 14, 2021
@awaelchli awaelchli removed the won't fix This will not be worked on label Aug 14, 2021
@ananthsub
Copy link
Contributor

#8632 addresses one of these concerns, which is the model does not need to be wrapped with DDP for testing.

However, I completely agree with @awaelchli that testing for evaluation purposed ought to be run in a separate script, on a single GPU to avoid any (mis)handling of uneven inputs and the distributed sampler.

the process group destruction also came up in #8080 . i think this is potentially something the trainer could do if we are certain that it is the agent which created the initial process group. however, this would still be a breaking change and requires more consideration

@stale
Copy link

stale bot commented Sep 14, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Sep 14, 2021
@tchaton
Copy link
Contributor

tchaton commented Sep 14, 2021

Hey @collinmccarthy,

I believe we can close this issue as you have a work around and Lightning won't support this out of the box.

Best,
T.C

@mfoglio
Copy link

mfoglio commented May 1, 2024

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

5 participants