generated from ApolloResearch/sample
-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Labels
Description
Grad norm clipping slows down runs by 20-25% for local runs and 5-15% slower for e2e runs. We have not thoroughly tested its value in these experiments.
We can either:
- Remove it completely. This should be tested by running a small sweep and comparing runs to those reported in the paper (i.e. any run in https://wandb.ai/sparsify/gpt2/.
- Only use it at the start of training when grad norms are high.
Reactions are currently unavailable