Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation question #5

Open
dhpollack opened this issue Jun 21, 2019 · 5 comments
Open

Implementation question #5

dhpollack opened this issue Jun 21, 2019 · 5 comments

Comments

@dhpollack
Copy link

I noticed that in your implementation you've clamped the values of the weight_norm to min of 0 and max of 10. I have seen this 10 before in other implementations and noticed that this comes from the first version of the lamb paper. However, this number refers to the trust_ratio and not the weight_norm. Have you done any further experiments with this or were you looking at other implementations of the paper and decided to use 10 for that reason. I also implemented lamb with both v1 and the latest version and I didn't notice a difference. Just wanted to know if you did additional testing or were aware of this issue.

@8enmann
Copy link
Contributor

8enmann commented Jun 21, 2019

I haven't tried different values, but yes I took 10 from v1 of the paper and am aware of the discrepancy. I have tested the algorithm on a large scale language model and it seems to scale well. I've also tracked the values of the weight norm of different layers and didn't see a clear reason to use a number other than 10. Let me know if you experiment and find a better value!

@Tony-Y
Copy link

Tony-Y commented Aug 28, 2019

@8enmann
Copy link
Contributor

8enmann commented Sep 2, 2019

@Tony-Y 's link shows a comment from the original author that they use identity function instead of clipping. Thanks Tony!

@hitvoice
Copy link

hitvoice commented Apr 9, 2020

DeepSpeed trains Bert with LAMB and clips to [0.08,0.5]: https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/bert-pretraining.md#reproducing-bert-training-results-with-deepspeed

It's quite interesting and confusing that such different values are used in different implementations.

@8enmann
Copy link
Contributor

8enmann commented Apr 24, 2020

Author open sourced theirs!
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants