-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding wsd schedule with (1-sqrt) decay #508
Conversation
hmm it doesn't really appear better |
Yes, it's not really the performance that's the main argument. The nice is that you can train your model without fixing the number of training steps beforehand, unlike with cosine, which is quite annoying i think. (And it's also a bit less LR sensitive according to our experiments but it's just bonus) I'm also training the 124M model for ~30B; it will probably show a more significant difference. (I'm using only one H100, so it takes a bit of time :( ) |
Comparison on a longer run (for 50k steps). To be honest, I think the main reason for the huge difference here is because we decay to 0 with the cosine schedule (and i think that's also one of the reason why it works better with a 3x higher learning rate https://x.com/karpathy/status/1797078746350207182). |
@eliebak I'm warming up to this idea a bit, planning to read the paper in full. Esp now that we have a nice |
Yes seems good to me! Looking at it today. |
Update the constant lr schedule branch with the master branchg
A few things:
Wdyt @karpathy? The experiment i did with |
adding wsd here and closing this one |
Adding new learning rate schedule support:
WSD learning rate schedule:
Not sure if it's in the philosophy of the repo, but I wanted to implement it in llm.c and compare it to the classical cosine schedule (experiment below). It's also very convenient for training a model (and require very little code change), as you can start from the previous checkpoint in the stable phase if you want to extend the training (something you can't do with cosine). So, I guess it can save some compute for those experimenting with the codebase! :)
Training code for the experiment :
mpirun ./train_gpt2cu
-i "dev/data/fineweb10B/fineweb_train_.bin"
-j "dev/data/fineweb10B/fineweb_val_.bin"
-o test_pr
-v 250 -s 20000 -g 144
-h 1
-b 64 -t 1024
-d 524288
-r 0
-z 1
-c 0.1
-l 0.0006
-p 1
-k 3865
-q 0.0
-u 700
-n 5000
-y 1
-e "d12"