-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble tracing why convergence is slower in Lightning #2643
Comments
Hi! thanks for your contribution!, great first issue! |
Convergence problem.
|
Thank you for the quick reply and suggestion, I changed that and tracked the results. I do not believe it fixed the problem. It appears to have improved things slightly but maybe it is variance as I am currently using colab(differing hardware) and not seeding. If there is anything you'd suggest I run to ensure reproducibility please let me know. Here are the results: As you can see train loss vs step seems to be about the same as before. Train loss vs duration improves and more closely matches the non-lightning version, but it could also be that colab allocated better hardware for that session (or that in the previous training, I was running both at once, I'm not sure what the impact of that is). |
Ok, found 2 more mistakes.
|
Thank you Rohit, you are a lifesaver. It was #2, updating the lightning version by changing steps_per_epoch in the scheduler to the length of the train loader fixed the problem. I really appreciate you taking the time to help. |
Np. Also there is BTW thanks for blog link above. Was looking for something similar to start with speech. It is quite good. |
I recently refactored some code from [this tutorial])(https://www.assemblyai.com/blog/end-to-end-speech-recognition-pytorch) (trains speech-to-text using librispeech 100 hr) into Lightning and found it to be converging slower and never reaching the same level of loss. I made a lot of changes when I refactored into Pytorch lightning, and I slowly undid them until I was left with the original code and the lightning version. I even set comet to be the logger so that they had the same logging, and the results are the same. I can't figure out what's going on. As far as I can tell, the code is identical (except for the training loop) and I don't know how to proceed. (Edit: I've now added the training loop code from both notebooks at the bottom of this post for ease of access)
Here are the notebooks: The final few cells (train/test/main functions) are the relevant/distinct portion.
Non-lightning Notebook - converges faster
Lightning Notebook - converges slower
)
Non-Lightning Version:
Lightning Version:
The text was updated successfully, but these errors were encountered: