Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding wsd schedule with (1-sqrt) decay #508

Closed
wants to merge 12 commits into from

Conversation

eliebak
Copy link
Contributor

@eliebak eliebak commented Jun 1, 2024

Adding new learning rate schedule support:

WSD learning rate schedule:

Not sure if it's in the philosophy of the repo, but I wanted to implement it in llm.c and compare it to the classical cosine schedule (experiment below). It's also very convenient for training a model (and require very little code change), as you can start from the previous checkpoint in the stable phase if you want to extend the training (something you can't do with cosine). So, I guess it can save some compute for those experimenting with the codebase! :)

Training code for the experiment :
mpirun ./train_gpt2cu
-i "dev/data/fineweb10B/fineweb_train_.bin"
-j "dev/data/fineweb10B/fineweb_val_
.bin"
-o test_pr
-v 250 -s 20000 -g 144
-h 1
-b 64 -t 1024
-d 524288
-r 0
-z 1
-c 0.1
-l 0.0006
-p 1
-k 3865
-q 0.0
-u 700
-n 5000
-y 1
-e "d12"
Screenshot 2024-05-31 at 21 40 30

@karpathy
Copy link
Owner

karpathy commented Jun 1, 2024

hmm it doesn't really appear better

@eliebak
Copy link
Contributor Author

eliebak commented Jun 1, 2024

Yes, it's not really the performance that's the main argument. The nice is that you can train your model without fixing the number of training steps beforehand, unlike with cosine, which is quite annoying i think. (And it's also a bit less LR sensitive according to our experiments but it's just bonus)

I'm also training the 124M model for ~30B; it will probably show a more significant difference. (I'm using only one H100, so it takes a bit of time :( )

@eliebak
Copy link
Contributor Author

eliebak commented Jun 2, 2024

Comparison on a longer run (for 50k steps). To be honest, I think the main reason for the huge difference here is because we decay to 0 with the cosine schedule (and i think that's also one of the reason why it works better with a 3x higher learning rate https://x.com/karpathy/status/1797078746350207182).
image

@karpathy
Copy link
Owner

karpathy commented Jun 3, 2024

@eliebak I'm warming up to this idea a bit, planning to read the paper in full. Esp now that we have a nice llmc library directory I could see us support this as an option in a scheduler.h. Agree with the Abstract RE: cosine schedule.

@eliebak
Copy link
Contributor Author

eliebak commented Jun 4, 2024

Yes seems good to me! Looking at it today.

@eliebak
Copy link
Contributor Author

eliebak commented Jun 4, 2024

A few things:

  • For now, there is only the learning rate schedule in <schedule.h>, but I plan to add a batch size schedule when I have time (it seems to have some impact according to DeepSeek papers and this tweet: https://x.com/andrew_n_carr/status/1788220414067605796).
  • I can also add different "decay functions", such as the classic linear decay but when doing experiments for the paper I've found that the (1-sqrt) function is always better so I don't really see the point.

Wdyt @karpathy?

The experiment i did with learning_rate=1e-3 to validate bellow (with 100B fineweb token):
image

@karpathy
Copy link
Owner

adding wsd here and closing this one
#627

@karpathy karpathy closed this Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants