Simple fix for memory leak on GPU0 #1094

shubhamagarwal92 · 2020-03-08T20:17:34Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #958 related to memory leak. Pretty old PyTorch issue related to setting device.

Set the device directly while finding the root gpu in determine_root_gpu_device in trainer.distrib_parts.py
Remove also the hack from trainer.evaluation_loop

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃 lol. Yeah I did!

…n_loop

pep8speaks · 2020-03-08T20:17:38Z

Hello @shubhamagarwal92! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-31 16:50:39 UTC

shubhamagarwal92 · 2020-03-08T21:14:07Z

@williamFalcon Could you suggest why these tests are failing.

awaelchli · 2020-03-09T00:20:56Z

@shubhamagarwal92 likely because you are calling torch.cuda.* when cuda is not available or test expects it to run on cpu.

shubhamagarwal92 · 2020-03-09T09:46:41Z

@shubhamagarwal92 likely because you are calling torch.cuda.* when cuda is not available or test expects it to run on cpu.

@awaelchli I am assuming that we would determine_root_gpu_device only when cuda is available. Anyways, I have added a simple check root_gpu >= 0 before setting the device in torch.cuda.set_device().

Omitting this usually causes this memory leak on GPU0 as suggested by Soumith here. I am not sure if this is the right place to set the device though.

Any suggestions?

Borda · 2020-03-11T20:43:01Z

hey there, we have added GPU CI test, so could we kindly ask to rebase/merge master which will trigger these tests so we do not need to test it manually... Thx for your understanding 🤖

shubhamagarwal92 · 2020-03-12T13:58:48Z

@Borda I have rebased/merged the master! Could you suggest any reason?

Borda · 2020-03-14T01:18:51Z

@shubhamagarwal92 could you please rather rebase on master because now it shows all recent master edits and it had to check your changes...
it seems that there some issue while asking for nb GPU on no-GPU machine

shubhamagarwal92 · 2020-03-17T13:44:09Z

@Borda I don't know why I am having multiple replicas of the same commit when I am trying to merge this repo's master. For now, I have force-pushed the commit before merge.

Basically, in this PR, I am trying to set device here

jeffling · 2020-03-20T19:43:26Z

pytorch_lightning/trainer/distrib_parts.py

+    # set cuda device to root gpu
+    # related to https://github.com/PyTorchLightning/pytorch-lightning/issues/958
+    # Refer solution: https://github.com/pytorch/pytorch/issues/9871#issuecomment-408304190
+    # root_device = torch.device("cuda", root_gpu)


We can remove this now

jeffling · 2020-03-20T22:15:43Z

pytorch_lightning/trainer/distrib_parts.py

+    # related to https://github.com/PyTorchLightning/pytorch-lightning/issues/958
+    # Refer solution: https://github.com/pytorch/pytorch/issues/9871#issuecomment-408304190
+    # root_device = torch.device("cuda", root_gpu)
+    root_device = (torch.device("cuda", root_gpu) if root_gpu >= 0 else torch.device("cpu"))


What would root_device be set if the user wants CPU? None? -1? Maybe we should check for that explicitly

If the user wants CPU, the function determine_root_gpu_device should not be called (?)

in this case it is getting called with gpus=None, and returns None (see first lines of determine_root_gpu_device). So your else torch.device("cpu")) is never relevant.
I don't see anything wrong with just
root_device = torch.device("cuda", root_gpu)

but I think the device should be set outside this function anyway

@awaelchli where do you suggest?

I would search the code base for occurrences of self.root_gpu and check whether it is needed to set the device in each case.
Maybe consider setting directly after here but I am not sure if that's the best place.

better to ask the core team on this :)

williamFalcon · 2020-03-30T22:43:11Z

@Borda is this dead?

Borda · 2020-03-30T23:00:56Z

@shubhamagarwal92 @awaelchli how is it going?

awaelchli · 2020-03-31T00:49:43Z

If I understand correctly, the fix to the memory leak is simply to add a torch.cuda.set_device(root_device) before training. The question is where is the best place to put it. I made a suggestion (see comment above).

shubhamagarwal92 · 2020-03-31T15:55:43Z

@awaelchli @Borda @williamFalcon

I have incorporated the suggestions. See the latest commit 6cba621

Could you please have a look. Thanks.

Borda · 2020-03-31T16:11:48Z

seems to fail on returned None:

        root_device = (torch.device("cuda", self.root_gpu)
>                      if self.root_gpu >= 0 else torch.device("cpu"))
E       TypeError: '>=' not supported between instances of 'NoneType' and 'int'

williamFalcon · 2020-04-02T16:22:26Z

@shubhamagarwal92 this is great. mind fixing the issue so we can get into 0.7.2? :)

shubhamagarwal92 · 2020-04-02T16:39:44Z

@williamFalcon I updated the if-check in 2c7b802

but I do not understand why the automated checks are failing, such as this requires a cuda device:
https://github.com/PyTorchLightning/pytorch-lightning/pull/1094/checks?check_run_id=549408374#step:9:252

I can only tell that this is probably not the right place to set the device (?) Any suggestions?

williamFalcon · 2020-04-02T17:26:21Z

pytorch_lightning/trainer/evaluation_loop.py

            if isinstance(self.data_parallel_device_ids, list):
                root_gpu = self.data_parallel_device_ids[0]
+                root_device = (torch.device("cuda", root_gpu)
+                               if root_gpu else torch.device("cpu"))
+                torch.cuda.set_device(root_device)


also need to add tpu device...

williamFalcon · 2020-04-02T17:50:13Z

@shubhamagarwal92 that's not where the device should be set. i put it in the correct place in #1349.

Let's just merge that one. I think it will still credit you as a co-author

shubhamagarwal92 · 2020-04-02T18:53:00Z

@shubhamagarwal92 that's not where the device should be set. i put it in the correct place in #1349.

Let's just merge that one. I think it will still credit you as a co-author

Thanks for doing this. Ahh, now I see where the issue was. Should we close this PR then?

* SA: for #958: set torch cuda device when finding root * SA: for #958: removing root gpu hack in trainer/evaluation_loop * SA: setting torch cuda device * comment line too long * check if root gpu exists or available * Incorporating suggestions on #1094 * since root gpu returns none instead of -1 for cpu * undo changes * fixed dp memory thing Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>

* SA: for Lightning-AI#958: set torch cuda device when finding root * SA: for Lightning-AI#958: removing root gpu hack in trainer/evaluation_loop * SA: setting torch cuda device * comment line too long * check if root gpu exists or available * Incorporating suggestions on Lightning-AI#1094 * since root gpu returns none instead of -1 for cpu * undo changes * fixed dp memory thing Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>

shubhamagarwal92 added 3 commits March 8, 2020 19:58

SA: for Lightning-AI#958: set torch cuda device when finding root

5c554a1

SA: for Lightning-AI#958: removing root gpu hack in trainer/evaluatio…

2f17b2f

…n_loop

SA: setting torch cuda device

6d89505

comment line too long

54e9a5e

check if root gpu exists or available

83c291d

shubhamagarwal92 mentioned this pull request Mar 10, 2020

NER - pl example huggingface/transformers#3180

Merged

Borda added this to the 0.7.2 milestone Mar 12, 2020

Borda added the bug Something isn't working label Mar 12, 2020

shubhamagarwal92 mentioned this pull request Mar 12, 2020

setting PGU device #1128

Closed

5 tasks

shubhamagarwal92 force-pushed the master branch from 52778a8 to 83c291d Compare March 17, 2020 13:08

Borda requested a review from jeffling March 17, 2020 13:54

jeffling reviewed Mar 20, 2020

View reviewed changes

Borda assigned neggert Mar 25, 2020

Borda mentioned this pull request Mar 26, 2020

empty_cache calls in training occupy memory on gpu #0 #614

Closed

mergify bot requested a review from a team March 30, 2020 22:33

Incorporating suggestions on Lightning-AI#1094

6cba621

since root gpu returns none instead of -1 for cpu

2c7b802

williamFalcon reviewed Apr 2, 2020

View reviewed changes

williamFalcon pushed a commit that referenced this pull request Apr 3, 2020

Incorporating suggestions on #1094

4808d1f

williamFalcon closed this Apr 3, 2020

williamFalcon mentioned this pull request Apr 4, 2020

Simple fix for memory leak on GPU0 #1349

Merged

Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple fix for memory leak on GPU0 #1094

Simple fix for memory leak on GPU0 #1094

shubhamagarwal92 commented Mar 8, 2020

pep8speaks commented Mar 8, 2020 •

edited

Loading

shubhamagarwal92 commented Mar 8, 2020

awaelchli commented Mar 9, 2020

shubhamagarwal92 commented Mar 9, 2020 •

edited

Loading

Borda commented Mar 11, 2020

shubhamagarwal92 commented Mar 12, 2020 •

edited

Loading

Borda commented Mar 14, 2020

shubhamagarwal92 commented Mar 17, 2020

jeffling Mar 20, 2020

jeffling Mar 20, 2020

shubhamagarwal92 Mar 26, 2020

awaelchli Mar 26, 2020

awaelchli Mar 26, 2020

shubhamagarwal92 Mar 27, 2020

awaelchli Mar 31, 2020

awaelchli Mar 31, 2020

williamFalcon commented Mar 30, 2020

Borda commented Mar 30, 2020

awaelchli commented Mar 31, 2020

shubhamagarwal92 commented Mar 31, 2020

Borda commented Mar 31, 2020

williamFalcon commented Apr 2, 2020

shubhamagarwal92 commented Apr 2, 2020

williamFalcon Apr 2, 2020

williamFalcon commented Apr 2, 2020 •

edited

Loading

shubhamagarwal92 commented Apr 2, 2020

Simple fix for memory leak on GPU0 #1094

Simple fix for memory leak on GPU0 #1094

Conversation

shubhamagarwal92 commented Mar 8, 2020

Before submitting

What does this PR do?

PR review

Did you have fun?

pep8speaks commented Mar 8, 2020 • edited Loading

Comment last updated at 2020-03-31 16:50:39 UTC

shubhamagarwal92 commented Mar 8, 2020

awaelchli commented Mar 9, 2020

shubhamagarwal92 commented Mar 9, 2020 • edited Loading

Borda commented Mar 11, 2020

shubhamagarwal92 commented Mar 12, 2020 • edited Loading

Borda commented Mar 14, 2020

shubhamagarwal92 commented Mar 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williamFalcon commented Mar 30, 2020

Borda commented Mar 30, 2020

awaelchli commented Mar 31, 2020

shubhamagarwal92 commented Mar 31, 2020

Borda commented Mar 31, 2020

williamFalcon commented Apr 2, 2020

shubhamagarwal92 commented Apr 2, 2020

Choose a reason for hiding this comment

williamFalcon commented Apr 2, 2020 • edited Loading

shubhamagarwal92 commented Apr 2, 2020

pep8speaks commented Mar 8, 2020 •

edited

Loading

shubhamagarwal92 commented Mar 9, 2020 •

edited

Loading

shubhamagarwal92 commented Mar 12, 2020 •

edited

Loading

williamFalcon commented Apr 2, 2020 •

edited

Loading