[Appendix D] apply gradient clipping after warmup steps to avoid exploding gradient. #445
Unanswered
Nevermetyou65
asked this question in
Q&A
Replies: 1 comment
-
Good question. It's basically to avoid the gradients from becoming too large, because gradients that are too large can cause weight updates that are too large, which can cause overshooting the loss (local) minimum you are aiming towards. During warmup, the learning rate is smaller, so the updates are smaller, but yeah, with a larger learning rate, large gradients become more of a problem. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
Appendix D's modified train model function has a code for applying gradient clipping.
Here is the note
I can understand why we want to apply this technique after the warm-up period, but I don't understand why this is to avoid an exploding gradient. Could someone explain the reason to me?
Beta Was this translation helpful? Give feedback.
All reactions