[Appendix D] apply gradient clipping after warmup steps to avoid exploding gradient. #445

Nevermetyou65 · 2024-11-21T02:15:12Z

Nevermetyou65
Nov 21, 2024

Hi,

Appendix D's modified train model function has a code for applying gradient clipping.
Here is the note

Applies gradient clipping
after the warmup phase
to avoid exploding
gradients

I can understand why we want to apply this technique after the warm-up period, but I don't understand why this is to avoid an exploding gradient. Could someone explain the reason to me?

rasbt · 2024-11-21T09:50:25Z

rasbt
Nov 21, 2024
Maintainer

Good question. It's basically to avoid the gradients from becoming too large, because gradients that are too large can cause weight updates that are too large, which can cause overshooting the loss (local) minimum you are aiming towards. During warmup, the learning rate is smaller, so the updates are smaller, but yeah, with a larger learning rate, large gradients become more of a problem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Appendix D] apply gradient clipping after warmup steps to avoid exploding gradient. #445

{{title}}

Replies: 1 comment

{{title}}

Select a reply

[Appendix D] apply gradient clipping after warmup steps to avoid exploding gradient. #445

Nevermetyou65 Nov 21, 2024

Replies: 1 comment

rasbt Nov 21, 2024 Maintainer

Nevermetyou65
Nov 21, 2024

rasbt
Nov 21, 2024
Maintainer