adding wsd schedule with (1-sqrt) decay #508

eliebak · 2024-06-01T01:38:27Z

Adding new learning rate schedule support:

WSD learning rate schedule:

Warmup: classical linear warmup
Stable: constant lr
Decay: Decaying to min_lr in a (1-sqrt) shape. (more info here https://arxiv.org/abs/2405.18392)

Not sure if it's in the philosophy of the repo, but I wanted to implement it in llm.c and compare it to the classical cosine schedule (experiment below). It's also very convenient for training a model (and require very little code change), as you can start from the previous checkpoint in the stable phase if you want to extend the training (something you can't do with cosine). So, I guess it can save some compute for those experimenting with the codebase! :)

Training code for the experiment :
mpirun ./train_gpt2cu
-i "dev/data/fineweb10B/fineweb_train_.bin"
-j "dev/data/fineweb10B/fineweb_val_.bin"
-o test_pr
-v 250 -s 20000 -g 144
-h 1
-b 64 -t 1024
-d 524288
-r 0
-z 1
-c 0.1
-l 0.0006
-p 1
-k 3865
-q 0.0
-u 700
-n 5000
-y 1
-e "d12"

karpathy · 2024-06-01T03:48:59Z

hmm it doesn't really appear better

eliebak · 2024-06-01T21:17:12Z

Yes, it's not really the performance that's the main argument. The nice is that you can train your model without fixing the number of training steps beforehand, unlike with cosine, which is quite annoying i think. (And it's also a bit less LR sensitive according to our experiments but it's just bonus)

I'm also training the 124M model for ~30B; it will probably show a more significant difference. (I'm using only one H100, so it takes a bit of time :( )

eliebak · 2024-06-02T15:00:22Z

Comparison on a longer run (for 50k steps). To be honest, I think the main reason for the huge difference here is because we decay to 0 with the cosine schedule (and i think that's also one of the reason why it works better with a 3x higher learning rate https://x.com/karpathy/status/1797078746350207182).

karpathy · 2024-06-03T23:54:29Z

@eliebak I'm warming up to this idea a bit, planning to read the paper in full. Esp now that we have a nice llmc library directory I could see us support this as an option in a scheduler.h. Agree with the Abstract RE: cosine schedule.

eliebak · 2024-06-04T12:50:56Z

Yes seems good to me! Looking at it today.

Update the constant lr schedule branch with the master branchg

eliebak · 2024-06-04T16:43:45Z

A few things:

For now, there is only the learning rate schedule in <schedule.h>, but I plan to add a batch size schedule when I have time (it seems to have some impact according to DeepSeek papers and this tweet: https://x.com/andrew_n_carr/status/1788220414067605796).
I can also add different "decay functions", such as the classic linear decay but when doing experiments for the paper I've found that the (1-sqrt) function is always better so I don't really see the point.

Wdyt @karpathy?

The experiment i did with learning_rate=1e-3 to validate bellow (with 100B fineweb token):

karpathy · 2024-06-21T00:49:24Z

adding wsd here and closing this one
#627

adding wsd schedule with (1-sqrt) decay

a5261c6

resolve conflict during merge with the upstream/master branch

30fceba

eliebak added 4 commits June 4, 2024 15:42

add learning rate schedule support

19d2be7

Merge remote-tracking branch 'upstream/master' into constant-lr-schedule

8d7e61a

Update the constant lr schedule branch with the master branchg

add schedule.h

334821b

add more inftips on how to used in schedule.h

26d7db0

eliebak and others added 6 commits June 5, 2024 00:28

Merge branch 'master' into constant-lr-schedule

b7cc17a

update branch as we now log the learning rate

99fe394

Merge branch 'karpathy:master' into constant-lr-schedule

4cdad6d

update commit w/ master branch

d196f86

fixed typo

a499fc1

Merge branch 'master' into constant-lr-schedule

3033eeb

gordicaleksa mentioned this pull request Jun 17, 2024

Add learning rate schedulers #605

Closed

karpathy closed this Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding wsd schedule with (1-sqrt) decay #508

adding wsd schedule with (1-sqrt) decay #508

eliebak commented Jun 1, 2024

karpathy commented Jun 1, 2024

eliebak commented Jun 1, 2024 •

edited

Loading

eliebak commented Jun 2, 2024

karpathy commented Jun 3, 2024

eliebak commented Jun 4, 2024

eliebak commented Jun 4, 2024 •

edited

Loading

karpathy commented Jun 21, 2024

adding wsd schedule with (1-sqrt) decay #508

adding wsd schedule with (1-sqrt) decay #508

Conversation

eliebak commented Jun 1, 2024

karpathy commented Jun 1, 2024

eliebak commented Jun 1, 2024 • edited Loading

eliebak commented Jun 2, 2024

karpathy commented Jun 3, 2024

eliebak commented Jun 4, 2024

eliebak commented Jun 4, 2024 • edited Loading

karpathy commented Jun 21, 2024

eliebak commented Jun 1, 2024 •

edited

Loading

eliebak commented Jun 4, 2024 •

edited

Loading