-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long time between calls to training_step when there are multiple optimizers #1354
Comments
Hi! thanks for your contribution!, great first issue! |
I think this might actually be more of a bug than a question. I can reproduce this on the sample gan.py as well. Each forward step takes 130ms, and there are 270ms between forward steps, of which only 180ms is used for the backwards step. Again, leaving 90ms of processing overhead, which is simply too much for real use. Looking at cProfile data, I'm seeing most of this time spent in |
ummm thanks for highlightning the potential speed issues. @Borda @jeremyjordan mind looking into this? |
well all mentioned func contains look with converting from Tensor to native types aka |
@karlinjf talking about GAN are you using the PL example or your own GAN? |
Both the PL example and my own GAN had the 90ms of overhead between iterations. |
Here is what happens with generative_adversarial_network.py on tip-of-tree 3f1e4b9: master: 14.3 it/s And on my own GAN on lightning: That change fixes half of the gap between my own setup and my setup with lightning. Will continue digging. |
Sorry about the nan detection perf issue, but it is now disabled by default. |
No worries. There is still a good ways to go. #1576 was another big chunk. I believe the rest comes from .item() calls in add_tqdm_metrics and the like. |
I think there is no way around calling .item(). However, we could update the progress bar metrics with the refresh rate, and then it would be possible to set the refresh rate and only get calls to .item() when the bar is actually getting updated. It would require some refactoring though, because the progress bar is now a callback. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I have a GAN model with two optimizers that is running 30-40% slower in lightning than without. I've discovered that the lost time comes between the end of training_step for optimizer_idx 0 and the start of the call for optimizer_idx 1. There is 120ms of time (cpu, not wall) spent there. 30ms of that time is the backwards step. The other 90ms is unaccounted for. Note that after optimizer_idx 1 is run, there is only 20ms cpu time before optimizer_idx 0 is called again for the next batch.
So why might there be extra time between the optimizers?
This is happening in both the latest release as well as master.
Thanks!
The text was updated successfully, but these errors were encountered: