Long time between calls to training_step when there are multiple optimizers #1354

karlinjf · 2020-04-03T00:12:57Z

I have a GAN model with two optimizers that is running 30-40% slower in lightning than without. I've discovered that the lost time comes between the end of training_step for optimizer_idx 0 and the start of the call for optimizer_idx 1. There is 120ms of time (cpu, not wall) spent there. 30ms of that time is the backwards step. The other 90ms is unaccounted for. Note that after optimizer_idx 1 is run, there is only 20ms cpu time before optimizer_idx 0 is called again for the next batch.

So why might there be extra time between the optimizers?

This is happening in both the latest release as well as master.

Thanks!

github-actions · 2020-04-03T00:13:46Z

Hi! thanks for your contribution!, great first issue!

karlinjf · 2020-04-03T16:06:02Z

I think this might actually be more of a bug than a question. I can reproduce this on the sample gan.py as well. Each forward step takes 130ms, and there are 270ms between forward steps, of which only 180ms is used for the backwards step. Again, leaving 90ms of processing overhead, which is simply too much for real use.

Looking at cProfile data, I'm seeing most of this time spent in add_tqdm_metrics, detect_nan_tensors, and process_output in decreasing order of time.

williamFalcon · 2020-04-04T12:23:05Z

ummm thanks for highlightning the potential speed issues. @Borda @jeremyjordan mind looking into this?

Borda · 2020-04-07T23:01:02Z

well all mentioned func contains look with converting from Tensor to native types aka v.item()

Borda · 2020-04-08T07:50:32Z

@karlinjf talking about GAN are you using the PL example or your own GAN?

Borda · 2020-04-08T08:19:03Z

Name	Call count	Time (ms)	Own Time (ms)
detect_nan_tensors	3752	88229	639
format_dict	2842	182	31
add_tqdm_metrics	3752	93	30
get_tqdm_dict	1876	127	30
training_tqdm_dict	1876	153	16

karlinjf · 2020-04-12T13:55:47Z

Both the PL example and my own GAN had the 90ms of overhead between iterations.

karlinjf · 2020-04-12T23:11:22Z

Here is what happens with generative_adversarial_network.py on tip-of-tree 3f1e4b9:

master: 14.3 it/s
Comment out self.detect_nan_tensors(loss): 16.4 it/s

And on my own GAN on lightning:
master: 6.78 it/s
Comment out self.detect_nan_tensors(loss):8.08 it/s
So it seems like detect_nan_tensors should either be optimized or dropped, as it's very expensive.

That change fixes half of the gap between my own setup and my setup with lightning. Will continue digging.

awaelchli · 2020-05-04T00:53:08Z

Sorry about the nan detection perf issue, but it is now disabled by default.
Were you able to further close the performance gap?

karlinjf · 2020-05-04T01:30:50Z

No worries. There is still a good ways to go. #1576 was another big chunk. I believe the rest comes from .item() calls in add_tqdm_metrics and the like.

awaelchli · 2020-05-04T10:14:37Z

I think there is no way around calling .item(). However, we could update the progress bar metrics with the refresh rate, and then it would be possible to set the refresh rate and only get calls to .item() when the bar is actually getting updated. It would require some refactoring though, because the progress bar is now a callback.

stale · 2020-08-08T20:57:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

karlinjf added bug Something isn't working help wanted Open to be worked on labels Apr 3, 2020

Borda added question Further information is requested and removed bug Something isn't working labels Apr 3, 2020

williamFalcon added this to the 0.7.2 milestone Apr 4, 2020

williamFalcon assigned Borda Apr 4, 2020

williamFalcon added the priority: 0 High priority task label Apr 4, 2020

Borda added the information needed label Apr 8, 2020

Borda modified the milestones: 0.7.2, 0.7.3 Apr 8, 2020

Borda mentioned this issue Apr 13, 2020

Add an option to disable Trainer.detect_nan_tensors #1392

Closed

Borda modified the milestones: 0.7.4, 0.7.5 Apr 24, 2020

Borda modified the milestones: 0.7.6, 0.8.0, 0.7.7 May 12, 2020

Borda modified the milestones: 0.7.7, 0.8.0 May 26, 2020

Borda modified the milestones: 0.8.0, 0.9.0 Jun 9, 2020

stale bot added the won't fix This will not be worked on label Aug 8, 2020

stale bot closed this as completed Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long time between calls to training_step when there are multiple optimizers #1354

Long time between calls to training_step when there are multiple optimizers #1354

karlinjf commented Apr 3, 2020 •

edited

Loading

github-actions bot commented Apr 3, 2020

karlinjf commented Apr 3, 2020 •

edited by Borda

Loading

williamFalcon commented Apr 4, 2020

Borda commented Apr 7, 2020

Borda commented Apr 8, 2020

Borda commented Apr 8, 2020 •

edited

Loading

karlinjf commented Apr 12, 2020

karlinjf commented Apr 12, 2020 •

edited

Loading

awaelchli commented May 4, 2020

karlinjf commented May 4, 2020 •

edited

Loading

awaelchli commented May 4, 2020

stale bot commented Aug 8, 2020

Long time between calls to training_step when there are multiple optimizers #1354

Long time between calls to training_step when there are multiple optimizers #1354

Comments

karlinjf commented Apr 3, 2020 • edited Loading

github-actions bot commented Apr 3, 2020

karlinjf commented Apr 3, 2020 • edited by Borda Loading

williamFalcon commented Apr 4, 2020

Borda commented Apr 7, 2020

Borda commented Apr 8, 2020

Borda commented Apr 8, 2020 • edited Loading

karlinjf commented Apr 12, 2020

karlinjf commented Apr 12, 2020 • edited Loading

awaelchli commented May 4, 2020

karlinjf commented May 4, 2020 • edited Loading

awaelchli commented May 4, 2020

stale bot commented Aug 8, 2020

karlinjf commented Apr 3, 2020 •

edited

Loading

karlinjf commented Apr 3, 2020 •

edited by Borda

Loading

Borda commented Apr 8, 2020 •

edited

Loading

karlinjf commented Apr 12, 2020 •

edited

Loading

karlinjf commented May 4, 2020 •

edited

Loading