[LLM] llm.c training for GPT 2 #3611

Michaelvll · 2024-05-29T02:36:40Z

This PR adds a reproducible task yaml for karpathy/llm.c#481

~~Blocked by skypilot-org/skypilot-catalog#71 and #3610, as the g++ version, C++ libraries are more recent on Ubuntu distribution on GCP.~~

We now use docker image to solve the g++ version issue.

Future TODO:

Make bucket more general
Add to AI gallery
Fix the disconnection issue with ubuntu image (changed to docker instead)

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- Only keep GCP: sky launch -c gpt2 llm/gpt-2/gpt2-train.yaml --use-spot --env BUCKET_NAME=gpt2-data-skypilot
- Only keep lambda: sky launch -c gpt2 llm/gpt-2/gpt2-train.yaml --use-spot --env BUCKET_NAME=gpt2-data-skypilot
- Only keep aws: sky launch --gpus A10g llm/gpt-2/gpt2-train.yaml --env BUCKET_NAME=gpt2-data-skypilot
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

… gpt-2

…o gpt-2

romilbhardwaj

This is awesome @Michaelvll! 🔥 Left some comments

llm/gpt-2/README.md

llm/gpt-2/gpt2-train.yaml

llm/gpt-2/gpt2-data.yaml

llm/gpt-2/gpt2-train.yaml

romilbhardwaj · 2024-05-30T23:11:35Z

llm/gpt-2/README.md

+After the data is processed, you can then train the model on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name):
+
+```bash
+sky launch -c gpt2-train --detach-setup gpt2-train.yaml --env BUCKET_NAME=your-bucket-name


Since cost + time seems like a big motivation behind this work ("Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20"), should we mention that here? Perhaps we can show the optimizer output?

Good point. Added the comparison in the sentence. How does it look to you?

romilbhardwaj · 2024-05-30T23:15:15Z

llm/gpt-2/README.md

+wget https://github.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-train.yaml
+```
+
+## Data processing


nit: I like the fact that we have different YAMLs for pre-processing and training, but one concern is it may alienate lambda-only, azure-only or fluidstack-only users since they won't have any cloud object store access to write preprocessed data to.

If it's not too complicated, can we add a one-shot YAML that does all pre-processing and training in a single YAML?

Good point! Just separate it into two different sections. One for a combined YAML anothe for the pipeline. Wdyt?

romilbhardwaj · 2024-05-30T23:16:23Z

llm/gpt-2/README.md

+```bash
+sky launch -c gpt2-train --detach-setup gpt2-train.yaml --gpu A100 --env BUCKET_NAME=your-bucket-name
+```
+


nit: if we have graphs similar to karpathy's, it will be nice to put them here :) No worries if not.

Added the loss for training. The eval figure requires additional dependency which I unfortunately did not installed before the current training.

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

… gpt-2

romilbhardwaj

Thanks @Michaelvll!

Michaelvll added 8 commits May 28, 2024 23:00

add gpt-2 example

374f0f2

Use ubuntu for GCP

79323f7

fix ncl

03623ee

Fix GPT-2

3636ea6

add train and data

1694ecd

use 8 gpus

2c80dcb

revert gcp change

1bef798

update readme

9282873

Michaelvll mentioned this pull request May 29, 2024

[GCP] Switch to Ubuntu image for GCP for better alignment with other clouds #3610

Closed

7 tasks

Michaelvll added 13 commits May 29, 2024 03:00

Add GCP image

0ee942c

make file_mounts more general

5af0d93

avoid any_of

71bcdd0

change back to use ubuntu image with wait for GPU

488347f

Merge branch 'gpt-2' of https://github.com/skypilot-org/skypilot into…

8ec06a8

… gpt-2

wait cuda installation

2e5bacf

Add retry for file mount and use env for bucket name

c070da0

revert retries

87d2a3c

update the image

d6e9554

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

0c2d799

…o gpt-2

change to docker for better dependency

ef26ecd

revert changes in gcp template

2b0a085

avoid using docker on lambda

aa8ecfe

Michaelvll requested review from concretevitamin and romilbhardwaj May 30, 2024 22:12

Michaelvll added 2 commits May 30, 2024 22:17

Add single GPU

265e43c

Elaborate readme

598dca5

romilbhardwaj reviewed May 30, 2024

View reviewed changes

Michaelvll and others added 3 commits May 30, 2024 16:31

Update llm/gpt-2/README.md

3056c2c

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>

fix

815d23c

Merge branch 'gpt-2' of https://github.com/skypilot-org/skypilot into…

faf63d8

… gpt-2

Michaelvll added 6 commits May 31, 2024 00:02

address comments

4c44935

Fix data fetching

3b7312e

Add visualization

b6566d7

update

bea72d5

reduce cpu cost

8887435

update loss curve

7609990

Michaelvll requested a review from romilbhardwaj May 31, 2024 01:25

romilbhardwaj approved these changes May 31, 2024

View reviewed changes

Michaelvll merged commit ea5aa50 into master May 31, 2024
20 checks passed

Michaelvll deleted the gpt-2 branch May 31, 2024 07:21

Michaelvll mentioned this pull request May 31, 2024

[Core] Allow override env for all task in dag #3623

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM] llm.c training for GPT 2 #3611

[LLM] llm.c training for GPT 2 #3611

Michaelvll commented May 29, 2024 •

edited

Loading

romilbhardwaj left a comment

romilbhardwaj May 30, 2024

Michaelvll May 30, 2024

romilbhardwaj May 30, 2024

Michaelvll May 31, 2024

romilbhardwaj May 30, 2024

Michaelvll May 31, 2024

romilbhardwaj left a comment

[LLM] llm.c training for GPT 2 #3611

[LLM] llm.c training for GPT 2 #3611

Conversation

Michaelvll commented May 29, 2024 • edited Loading

romilbhardwaj left a comment

Choose a reason for hiding this comment

romilbhardwaj May 30, 2024

Choose a reason for hiding this comment

Michaelvll May 30, 2024

Choose a reason for hiding this comment

romilbhardwaj May 30, 2024

Choose a reason for hiding this comment

Michaelvll May 31, 2024

Choose a reason for hiding this comment

romilbhardwaj May 30, 2024

Choose a reason for hiding this comment

Michaelvll May 31, 2024

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

Michaelvll commented May 29, 2024 •

edited

Loading