-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lose performance between 0.6.0 and 0.7.1 #1136
Comments
Could you be more precise what you mean with performance?
|
Yeah true, sorry.
Training, validation and testing losses are worse with 0.7.1, that's what I
meant.
|
@mpariente could you pls give us some numbers? |
just crossed some other comment about slower performances #525 (comment) |
I also noticed a similar issue with my code after upgrading to 0.7.1 from 0.6, so I tried running the MNIST example, and confirmed performance difference (both of my environments used torch==1.4.0 and torchvision==0.5.0) Orange curve is version 0.6, pink curve is version 0.7.1 |
If it even happens with MNIST, we have to find the bug asap! |
Just to be sure, @mpariente and @gwichern did you make the training deterministic before running this comparison? |
there are test cases to make sure performance doesn't change. please rerun using the exact same seed and only change the versions. |
In my case yes. The pytorch-lightning MNIST example sets both the numpy and torch seeds to 2334. Everything is the same except environments. |
@gwichern can you post a colab here? we can test it there. |
Agreed, on that point about the accuracy getting to the same value, but I just ran pl_examples/basic_examples/cpu_template.py in two different environments (passing --hidden_dim 500 from the command line in both cases). The seed is set to the same value in both versions, so I would have expected things to match better |
No, my bad. But I did make each run twice, and have consistent difference between 0.6.0 and 0.7.1 |
It looks like the MNIST example doesn't set shuffle=True in the dataloader, that could be the cause of the poor MNIST performance. @mpariente is data being shuffled in your case? |
ok, yeah... finding the same. Let's dig into this a bit. |
@PyTorchLightning/core-contributors |
the @pl.dataloader decorator got removed in 0.7. could it be related? (can't test right now). |
I didn't try to find the cause of this yet, I thought I should report first.
Yes it is. |
might be the refresh rate of the progress bar. maybe that also changed the update freq of the loggers by mistake |
Had considered this possibility as well. But for a given epoch (while training), the results are significantly degraded. |
Using this colab (https://colab.research.google.com/drive/1NUrJ7LZqblKW_OIpiGYVaOLGJ2l_tFxs) 0.6.0At the end of 1 epoch. 0.7.1Removing the decorators has no effect Using the new epoch_end signature When i decrease the refresh rate (gets closet to the 0.6.0 value) (actually, lower loss than 0.6.0) So, i don't see a huge difference here. Mind playing with it for a bit? |
We'll try to see if we get the same problem when we are setting seeds. |
@mpariente was thinking about this. maybe it has to do with distributedSampler if you're using that? since we now inject that automatically, it may be that your effective batch size is differnt now and thus if you use the same learning rate, you won't get the best results? It might be that you have to readjust your learning rate. (from this graph, i would make it slower). |
Hmm, we use |
Well, this persists. Training is not over but the differences are already not negligible. This script to reproduce are here but the training dataset is under license.. Info that might be useful, distributed backend is |
ok awesome. will look into this. @mpariente i'm looking into this today and tomorrow, will push a fix if I find something. It's weird because the tests do test a specific performance goal |
IIRC 0.7.0 was not backward compatible because of
I know you're doing the best you can about this, no worries. For now, both architecture involved LSTMs, did you change anything about BBTT? |
ummm... i don't think we did but that's good to know. Maybe it is RNN related. I want to create the following tests:
|
Tried on two convolutional architectures and the training and validation curves are a perfect match. |
Any update on that please? |
will do an rnn test. however, we now have a parity test between pytorch pure and lightning with convnets in continuous integration. the test forces a match across trials to 5 decimal points. i’ll add an rnn test as well |
Sounds great ! |
this runs on every PR to make sure no PR breaks parity. so, speedwise it's not a fair comparison because the pure pytorch version has no logging or any of that, whereas lightning does |
Oh I didn't see the PR, I thought you'd ping this Issue with it. |
yeah, that would be super helpful! maybe the addition task is a good dataset to test? Can do the colab here: |
I took the code from #1351 and ran it for 10 epochs and 3 runs on CPU first (because @mpariente also has no GPU), then I noticed that there is a performance gap between 0.6.0 and 0.7.1 in the third decimal point. |
Thanks for looking into this, could you grant me access to the colab please? Did you also try on GPU? |
Try again, I had sharing turned off. |
Ok, I can reproduce the same results as you and checked that the pytorch vanilla_loop also passes, and this is the case. |
@williamFalcon have you seen this?
And the results @awaelchli mentionned shows exactly this right? Parity test passes but the performances are different, how can we explain that? |
we should also probably include |
Sorry to ping you @williamFalcon, but this is not resolved. |
we have parity tests now with an exact performance match... |
I would not call it exact performance match actually : performance are matching between So something is happening under the hood, see those lines for example, in 0.6.0, the train dataloader is also changed by lightning right? I don't think these parity tests are as valuable as they should be. |
the performance comparison has to be against pure pytorch because that’s the bound for speed and accuracy. comparing across lightning versions makes no sense. again, can’t help without a real example that breaks on colab. every other time anyone has brought up a performance difference they’ve ended up finding a bug in their code. happy to fix if something is broken, but we need tangible proof to find a possible problem. |
But I don't think it qualifies as pure pytorch, everything still comes from a lightning module.
I understand, I'll try to build an example that fails next week, thanks again |
it’s literally the same code. it’s like saying 2 = (2) are different haha. it’s written this way for convenience because the pytorch code is exactly the same... |
I've tried with 0.7.5 against 0.6.0 and got the same results on several of our architectures. We'll finally upgrade and get all the new features you integrated 😀 |
🐛 Bug
When I train exactly the same model with pl 0.7.1, I get worse performance compared to pl0.6.0.
I did a fresh install or Asteroid with both versions and ran exactly the same script on the same hardware.
I get significantly worse performance with pl0.7.1.
Are there some known issues I should be aware of? In the mean time, I'll have to downgrade to 0.6.0
Environment
PL 0.6.0
Collecting environment information... [8/105]
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Debian GNU/Linux 10 (buster)
GCC version: (Debian 8.3.0-6) 8.3.0
CMake version: version 3.14.0
Python version: 3.6
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Versions of relevant libraries:
[pip3] numpy==1.18.1
[pip3] pytorch-lightning==0.6.0
[pip3] torch==1.4.0
[pip3] torchvision==0.4.2
[conda] blas 1.0 mkl
[conda] mkl 2019.4 243
[conda] mkl-include 2020.0 166
[conda] mkl-service 2.3.0 py36he904b0f_0
[conda] mkl_fft 1.0.14 py36ha843d7b_0
[conda] mkl_random 1.1.0 py36hd6b4f25_0
[conda] torch 1.3.1 pypi_0 pypi
[conda] torchvision 0.4.2 pypi_0 pypi
Diff between 0.6.0 and 0.7.1 envs
diff env_0.7 env_0.6
The text was updated successfully, but these errors were encountered: