Builds up the code step-by-step following the video #14

llewelld · 2024-10-24T11:40:16Z

This is the code developing as the video progresses.

The code was entered while following along with the videos, so there are probably mistakes (but it does at least run). Please say if you notice any.

Done for my benefit, but maybe it's useful to others.

This is the code for the video up to 40:00 https://youtu.be/l8pRSuU81PU?t=2400 The functionality loads in the wieghts from Huggting Face and outputs some inferences using them.

Switches to using random initialisation for the output rather than the pre-trained model. This is the code up to 42:40: https://youtu.be/l8pRSuU81PU?t=2560

Adds code to load in the Shakespeare training data as batches. Takes us to 55:00 in the video: https://youtu.be/l8pRSuU81PU?t=3301

Adds 50 optimisation steps so that we can follow the resulting loss. Brings us to 1:01:00 in the video: https://youtu.be/l8pRSuU81PU?t=3661

Adds a dataloader for loading batches. Takes us to 1:05:40 in the video. https://youtu.be/l8pRSuU81PU?t=3941

Ensure the wte and wpe weights are shared, to align with the original GPT-2 implementation. Takes us to 1:12:50 in the video: https://youtu.be/l8pRSuU81PU?t=4370

Applies the same statistical parameters to the layer initialisation values as in the original GPT-2. Takes us to 1:17:17: https://youtu.be/l8pRSuU81PU?t=4637

Adds weight scaling to the c_proj layer. Takes us to 1:22:14 in the video: https://youtu.be/l8pRSuU81PU?t=4934

Adds timing/token throughput code. Takes us to 1:40:00 in the video: https://youtu.be/l8pRSuU81PU?t=5978

Adds optimisations to use autocast and model compilation, but it doesn't seem to work with MPS (mac) acceleration. Takes us to 1:49:53: https://youtu.be/l8pRSuU81PU?t=6593

Since torchcompile doesn't pick up the code that can be optimised for flash attention, we have to make a manual change to allow flash attention to be used when using torchcompile.

Increase the vocab size to something that's nicer (divides by many powers of two). This increases the training speed.

Adds a function to adjust the learning rate during training. Takes us to 2:24:55 in the video: https://youtu.be/l8pRSuU81PU?t=8695

Adds a weight decay optimiser and the ability to split batches into minibatches so as to support batch sizes larger than the GPU can support in one go. Takes us to 2:46:53 in the video: https://youtu.be/l8pRSuU81PU?t=10013

A couple of small errors in the requirements file and the training code prevented this from working on CUDA devices. This change fixes this.

llewelld added 8 commits October 24, 2024 12:34

Add gpt-2 video code

aa7342d

This is the code for the video up to 40:00 https://youtu.be/l8pRSuU81PU?t=2400 The functionality loads in the wieghts from Huggting Face and outputs some inferences using them.

Use random model for output

cd56ccf

Switches to using random initialisation for the output rather than the pre-trained model. This is the code up to 42:40: https://youtu.be/l8pRSuU81PU?t=2560

Load in Shakespeare training data

6163273

Adds code to load in the Shakespeare training data as batches. Takes us to 55:00 in the video: https://youtu.be/l8pRSuU81PU?t=3301

Apply the optimiser to see the resulting loss

60f07f4

Adds 50 optimisation steps so that we can follow the resulting loss. Brings us to 1:01:00 in the video: https://youtu.be/l8pRSuU81PU?t=3661

Add a dataloader

14127f6

Adds a dataloader for loading batches. Takes us to 1:05:40 in the video. https://youtu.be/l8pRSuU81PU?t=3941

Apply weight sharing

2cee1b8

Ensure the wte and wpe weights are shared, to align with the original GPT-2 implementation. Takes us to 1:12:50 in the video: https://youtu.be/l8pRSuU81PU?t=4370

Initialise layers to follow GPT-2 values

c3312ff

Applies the same statistical parameters to the layer initialisation values as in the original GPT-2. Takes us to 1:17:17: https://youtu.be/l8pRSuU81PU?t=4637

Add weight scaling

fa51f2f

Adds weight scaling to the c_proj layer. Takes us to 1:22:14 in the video: https://youtu.be/l8pRSuU81PU?t=4934

llewelld marked this pull request as draft October 24, 2024 11:40

llewelld added 7 commits October 24, 2024 13:31

Add timing code

85b319d

Adds timing/token throughput code. Takes us to 1:40:00 in the video: https://youtu.be/l8pRSuU81PU?t=5978

Add autocast and model compile optimisations

ef98e87

Adds optimisations to use autocast and model compilation, but it doesn't seem to work with MPS (mac) acceleration. Takes us to 1:49:53: https://youtu.be/l8pRSuU81PU?t=6593

Allow tochcompile to use flash attention

adb7192

Since torchcompile doesn't pick up the code that can be optimised for flash attention, we have to make a manual change to allow flash attention to be used when using torchcompile.

Increase vocab size

4506043

Increase the vocab size to something that's nicer (divides by many powers of two). This increases the training speed.

Apply variable learning rate

54fbc4e

Adds a function to adjust the learning rate during training. Takes us to 2:24:55 in the video: https://youtu.be/l8pRSuU81PU?t=8695

Add weight decay and minibatches

8f0317b

Adds a weight decay optimiser and the ability to split batches into minibatches so as to support batch sizes larger than the GPU can support in one go. Takes us to 2:46:53 in the video: https://youtu.be/l8pRSuU81PU?t=10013

Fix requirements and typo

ebb56a0

A couple of small errors in the requirements file and the training code prevented this from working on CUDA devices. This change fixes this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Builds up the code step-by-step following the video #14

Builds up the code step-by-step following the video #14

llewelld commented Oct 24, 2024

Builds up the code step-by-step following the video #14

Are you sure you want to change the base?

Builds up the code step-by-step following the video #14

Conversation

llewelld commented Oct 24, 2024