Memory, tokens per sec, and MFU behavior in train_gpt2cu #503

chinthysl · 2024-05-31T08:16:10Z

chinthysl
May 31, 2024

Following results are acquired to get a better understanding of training in a single node before expanding into multiple nodes.

I generated the results in a single DGX H100 node using all 8 GPUs.
Total batch size is set to 524288. Varied the model from D12 to D48 and batch size from 2 until GPU memory max out.

Major insights,

MFU and tokens per second start to plateau when batch size reaches around 32, 64.
For D48 model we can go up to batch size 16 before GPU OOM happens.
If we keep using micro batch size 16 for D48, single node can process 16 * 1024 * 8 (524288 / 4) tokens in 4 grad accumulation step. So we can reach no grad accumulation by using 4 nodes.
For better gpu utilization, can we keep a higher global_batch_size at 16 * 1024 * 8 * n_nodes ???

Sample script I used to sweep the results,

 mpirun -np 8 ./train_gpt2cu \
    -i "dev/data/tinystories/TinyStories_train*.bin" \
    -j "dev/data/tinystories/TinyStories_val*.bin" \
    -o log_d12_b64 \
    -t 1024 \
    -d 524288 \
    -r 0 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -v 250 \
    -s 20000 \
    -h 0 \
    -e "d12" \
    -b 64

karpathy · 2024-06-04T22:27:59Z

karpathy
Jun 4, 2024
Maintainer

Very cool thank you for posting! I am really eager to also get a multi-node setup sometime soon for myself to run similar things.
(And yes it is quite likely that the optimization can tolerate larger batch size.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory, tokens per sec, and MFU behavior in train_gpt2cu #503

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Memory, tokens per sec, and MFU behavior in train_gpt2cu #503

chinthysl May 31, 2024

Replies: 1 comment

karpathy Jun 4, 2024 Maintainer

chinthysl
May 31, 2024

karpathy
Jun 4, 2024
Maintainer