Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds up the code step-by-step following the video #14

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

llewelld
Copy link

This is the code developing as the video progresses.

The code was entered while following along with the videos, so there are probably mistakes (but it does at least run). Please say if you notice any.

Done for my benefit, but maybe it's useful to others.

This is the code for the video up to 40:00

https://youtu.be/l8pRSuU81PU?t=2400

The functionality loads in the wieghts from Huggting Face and outputs
some inferences using them.
Switches to using random initialisation for the output rather than the pre-trained model.

This is the code up to 42:40:

https://youtu.be/l8pRSuU81PU?t=2560
Adds code to load in the Shakespeare training data as batches.

Takes us to 55:00 in the video:

https://youtu.be/l8pRSuU81PU?t=3301
Adds 50 optimisation steps so that we can follow the resulting loss.

Brings us to 1:01:00 in the video:

https://youtu.be/l8pRSuU81PU?t=3661
Adds a dataloader for loading batches.

Takes us to 1:05:40 in the video.

https://youtu.be/l8pRSuU81PU?t=3941
Ensure the wte and wpe weights are shared, to align with the original
GPT-2 implementation.

Takes us to 1:12:50 in the video:

https://youtu.be/l8pRSuU81PU?t=4370
Applies the same statistical parameters to the layer initialisation
values as in the original GPT-2.

Takes us to 1:17:17:

https://youtu.be/l8pRSuU81PU?t=4637
Adds weight scaling to the c_proj layer.

Takes us to 1:22:14 in the video:

https://youtu.be/l8pRSuU81PU?t=4934
@llewelld llewelld marked this pull request as draft October 24, 2024 11:40
Adds timing/token throughput code.

Takes us to 1:40:00 in the video:

https://youtu.be/l8pRSuU81PU?t=5978
Adds optimisations to use autocast and model compilation, but it doesn't
seem to work with MPS (mac) acceleration.

Takes us to 1:49:53:

https://youtu.be/l8pRSuU81PU?t=6593
Since torchcompile doesn't pick up the code that can be optimised for
flash attention, we have to make a manual change to allow flash
attention to be used when using torchcompile.
Increase the vocab size to something that's nicer (divides by many
powers of two). This increases the training speed.
Adds a function to adjust the learning rate during training.

Takes us to 2:24:55 in the video:

https://youtu.be/l8pRSuU81PU?t=8695
Adds a weight decay optimiser and the ability to split batches into
minibatches so as to support batch sizes larger than the GPU can support
in one go.

Takes us to 2:46:53 in the video:

https://youtu.be/l8pRSuU81PU?t=10013
A couple of small errors in the requirements file and the training code
prevented this from working on CUDA devices. This change fixes this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant