-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple fix for memory leak on GPU0 #1094
Conversation
Hello @shubhamagarwal92! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-03-31 16:50:39 UTC |
@williamFalcon Could you suggest why these tests are failing. |
@shubhamagarwal92 likely because you are calling torch.cuda.* when cuda is not available or test expects it to run on cpu. |
@awaelchli I am assuming that we would Omitting this usually causes this memory leak on GPU0 as suggested by Soumith here. I am not sure if this is the right place to set the device though. Any suggestions? |
hey there, we have added GPU CI test, so could we kindly ask to rebase/merge master which will trigger these tests so we do not need to test it manually... Thx for your understanding 🤖 |
@Borda I have rebased/merged the master! Could you suggest any reason? |
@shubhamagarwal92 could you please rather rebase on master because now it shows all recent master edits and it had to check your changes... |
52778a8
to
83c291d
Compare
# set cuda device to root gpu | ||
# related to https://github.com/PyTorchLightning/pytorch-lightning/issues/958 | ||
# Refer solution: https://github.com/pytorch/pytorch/issues/9871#issuecomment-408304190 | ||
# root_device = torch.device("cuda", root_gpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this now
# related to https://github.com/PyTorchLightning/pytorch-lightning/issues/958 | ||
# Refer solution: https://github.com/pytorch/pytorch/issues/9871#issuecomment-408304190 | ||
# root_device = torch.device("cuda", root_gpu) | ||
root_device = (torch.device("cuda", root_gpu) if root_gpu >= 0 else torch.device("cpu")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would root_device be set if the user wants CPU? None? -1? Maybe we should check for that explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user wants CPU, the function determine_root_gpu_device
should not be called (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case it is getting called with gpus=None, and returns None (see first lines of determine_root_gpu_device). So your else torch.device("cpu"))
is never relevant.
I don't see anything wrong with just
root_device = torch.device("cuda", root_gpu)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think the device should be set outside this function anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli where do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would search the code base for occurrences of self.root_gpu
and check whether it is needed to set the device in each case.
Maybe consider setting directly after here but I am not sure if that's the best place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to ask the core team on this :)
@Borda is this dead? |
@shubhamagarwal92 @awaelchli how is it going? |
If I understand correctly, the fix to the memory leak is simply to add a |
@awaelchli @Borda @williamFalcon I have incorporated the suggestions. See the latest commit 6cba621 Could you please have a look. Thanks. |
seems to fail on returned None:
|
@shubhamagarwal92 this is great. mind fixing the issue so we can get into 0.7.2? :) |
@williamFalcon I updated the if-check in 2c7b802 but I do not understand why the automated checks are failing, such as this requires a cuda device: I can only tell that this is probably not the right place to set the device (?) Any suggestions? |
if isinstance(self.data_parallel_device_ids, list): | ||
root_gpu = self.data_parallel_device_ids[0] | ||
root_device = (torch.device("cuda", root_gpu) | ||
if root_gpu else torch.device("cpu")) | ||
torch.cuda.set_device(root_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also need to add tpu device...
@shubhamagarwal92 that's not where the device should be set. i put it in the correct place in #1349. Let's just merge that one. I think it will still credit you as a co-author |
Thanks for doing this. Ahh, now I see where the issue was. Should we close this PR then? |
* SA: for #958: set torch cuda device when finding root * SA: for #958: removing root gpu hack in trainer/evaluation_loop * SA: setting torch cuda device * comment line too long * check if root gpu exists or available * Incorporating suggestions on #1094 * since root gpu returns none instead of -1 for cpu * undo changes * fixed dp memory thing Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>
* SA: for Lightning-AI#958: set torch cuda device when finding root * SA: for Lightning-AI#958: removing root gpu hack in trainer/evaluation_loop * SA: setting torch cuda device * comment line too long * check if root gpu exists or available * Incorporating suggestions on Lightning-AI#1094 * since root gpu returns none instead of -1 for cpu * undo changes * fixed dp memory thing Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>
* SA: for Lightning-AI#958: set torch cuda device when finding root * SA: for Lightning-AI#958: removing root gpu hack in trainer/evaluation_loop * SA: setting torch cuda device * comment line too long * check if root gpu exists or available * Incorporating suggestions on Lightning-AI#1094 * since root gpu returns none instead of -1 for cpu * undo changes * fixed dp memory thing Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>
Before submitting
What does this PR do?
Fixes #958 related to memory leak. Pretty old PyTorch issue related to setting device.
determine_root_gpu_device
in trainer.distrib_parts.pyPR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃 lol. Yeah I did!