Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long time between calls to training_step when there are multiple optimizers #1354

Closed
karlinjf opened this issue Apr 3, 2020 · 12 comments
Closed
Assignees
Labels
help wanted Open to be worked on priority: 0 High priority task question Further information is requested won't fix This will not be worked on
Milestone

Comments

@karlinjf
Copy link
Contributor

karlinjf commented Apr 3, 2020

I have a GAN model with two optimizers that is running 30-40% slower in lightning than without. I've discovered that the lost time comes between the end of training_step for optimizer_idx 0 and the start of the call for optimizer_idx 1. There is 120ms of time (cpu, not wall) spent there. 30ms of that time is the backwards step. The other 90ms is unaccounted for. Note that after optimizer_idx 1 is run, there is only 20ms cpu time before optimizer_idx 0 is called again for the next batch.

So why might there be extra time between the optimizers?

This is happening in both the latest release as well as master.

Thanks!

@karlinjf karlinjf added bug Something isn't working help wanted Open to be worked on labels Apr 3, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Apr 3, 2020

Hi! thanks for your contribution!, great first issue!

@Borda Borda added question Further information is requested and removed bug Something isn't working labels Apr 3, 2020
@karlinjf
Copy link
Contributor Author

karlinjf commented Apr 3, 2020

I think this might actually be more of a bug than a question. I can reproduce this on the sample gan.py as well. Each forward step takes 130ms, and there are 270ms between forward steps, of which only 180ms is used for the backwards step. Again, leaving 90ms of processing overhead, which is simply too much for real use.

Looking at cProfile data, I'm seeing most of this time spent in add_tqdm_metrics, detect_nan_tensors, and process_output in decreasing order of time.

@williamFalcon
Copy link
Contributor

ummm thanks for highlightning the potential speed issues. @Borda @jeremyjordan mind looking into this?

@williamFalcon williamFalcon added this to the 0.7.2 milestone Apr 4, 2020
@williamFalcon williamFalcon added the priority: 0 High priority task label Apr 4, 2020
@Borda
Copy link
Member

Borda commented Apr 7, 2020

well all mentioned func contains look with converting from Tensor to native types aka v.item()

@Borda
Copy link
Member

Borda commented Apr 8, 2020

@karlinjf talking about GAN are you using the PL example or your own GAN?

@Borda
Copy link
Member

Borda commented Apr 8, 2020

Name Call count Time (ms) Own Time (ms)
detect_nan_tensors 3752 88229 639
format_dict 2842 182 31
add_tqdm_metrics 3752 93 30
get_tqdm_dict 1876 127 30
training_tqdm_dict 1876 153 16

Screenshot at 2020-04-08 10-17-28

@Borda Borda modified the milestones: 0.7.2, 0.7.3 Apr 8, 2020
@karlinjf
Copy link
Contributor Author

Both the PL example and my own GAN had the 90ms of overhead between iterations.

@karlinjf
Copy link
Contributor Author

karlinjf commented Apr 12, 2020

Here is what happens with generative_adversarial_network.py on tip-of-tree 3f1e4b9:

master: 14.3 it/s
Comment out self.detect_nan_tensors(loss): 16.4 it/s

And on my own GAN on lightning:
master: 6.78 it/s
Comment out self.detect_nan_tensors(loss):8.08 it/s
So it seems like detect_nan_tensors should either be optimized or dropped, as it's very expensive.

That change fixes half of the gap between my own setup and my setup with lightning. Will continue digging.

@awaelchli
Copy link
Contributor

Sorry about the nan detection perf issue, but it is now disabled by default.
Were you able to further close the performance gap?

@karlinjf
Copy link
Contributor Author

karlinjf commented May 4, 2020

No worries. There is still a good ways to go. #1576 was another big chunk. I believe the rest comes from .item() calls in add_tqdm_metrics and the like.

@awaelchli
Copy link
Contributor

I think there is no way around calling .item(). However, we could update the progress bar metrics with the refresh rate, and then it would be possible to set the refresh rate and only get calls to .item() when the bar is actually getting updated. It would require some refactoring though, because the progress bar is now a callback.

@Borda Borda modified the milestones: 0.7.6, 0.8.0, 0.7.7 May 12, 2020
@Borda Borda modified the milestones: 0.7.7, 0.8.0 May 26, 2020
@Borda Borda modified the milestones: 0.8.0, 0.9.0 Jun 9, 2020
@stale
Copy link

stale bot commented Aug 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Aug 8, 2020
@stale stale bot closed this as completed Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Open to be worked on priority: 0 High priority task question Further information is requested won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants