Large Language Model - GPT-3 175B
Our codebase is capable of training large language models with both model and data parallelism.
To use this repository, please install a supported version of PyTorch with GPU support (python 3.8, pytorch 1.12, cuda 11.6.2, and nccl 2.12.10 and above) and NVIDIA APEX. We recommend using one of NGC's PyTorch containers. The latest tested compatible version is nvcr.io/nvidia/pytorch:24.04-py3
).
To train GPT-3, set COM_DIR
in gpt3_blend.sh
to point to the C4 dataset location which contains the dataset after preprocessing.
sbatch run_gpt3.sh <path to log directory> <path to BPE processed directory> <container>
Use script run_gpt3.sh
as shown above to run GPT-3 175B on clusters using slurm. You can adjust number of nodes (tested only with nodes>=8) and job run time in the sbatch command in line #3 of the run_gpt3.sh
script.
Note that the model trains for 15 mins lesser than that actual run time because the last 15 mins are set aside for storing a checkpoint of the last iteration.
Command line arguments are described in detail in this source file arguments.py
.
We use c4/en/3.0.1 dataset from HuggingFace/AllenAI.
For training in the benchmarking region, only 1/4th of the 1024 original json.gz
files are used. Specifically, the last 1/4th of the files from 768 till 1024 json.gz
are required.
For validation, a subset of the validation dataset has been selected. This was done by randomly selecting 24,567 examples using select_example.md to get a smaller evaluation dataset.
The dataset is preprocessed using Sentence Piece Model. These instructions were used to train the SPM.
The preprocessed dataset in a training ready format is available in the S3 bucket at gpt3/megatron-lm/dataset_c4_spm.tar
path
(the S3 download instructions are available at the end of this README in "S3 artifacts download" section).
After downloading and unpacking the tarball, the preprocessed_c4_spm
directory contains the preprocessed dataset and this is where the COM_DIR
env variable (explained above) should point to.
Additionally, the training script expects BPE vocab.json and merges.txt files. These files are used to create a BPE tokenizer which is only used for two things at this point in the code since tokenization is already done in the preprocessed dataset:
- To find out the eod entry index (value is 50256)
- To find out the vocab size (value is 50257)
Correctness of the dataset can be verified by comparing the checksums provided here
Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.
The model largely follows the GPT-3 paper, Some of the modifications are described below:
- Tokenizer is changed from BPE to SentencePiece with BPE.
- Alternating sparse attention layers are not used.
- Model parameters are set here.
In the benchmarking region, we should resume training from a checkpoint which is trained with Global Batch Size of 1536 for 4000 iterations.
The FP32 Megatron checkpoint is available in the S3 bucket at gpt3/megatron-lm/checkpoint_megatron_fp32.tar
path
(the S3 download instructions are available at the end of this README in "S3 artifacts download" section).
After downloading and unpacking the tarball, the ckpt4000_fp32
directory with the checkpoint should be set as the EXTERNAL_MODEL_CHECKPOINT_DIR
env variable.
Correctness of the Megatron format checkpoint can be verified by comparing the checksums provided here. Checksum for two files (metadata.json
and common.pt
) may not match. These files are provided here for verification.
Validation log perplexity can also be used as a metric to verify the correctness of the checkpoint and the loading scripts. To do this, the model should be evaluated on the entire validation dataset after loading weights from the checkpoint. We have observed an average log perplexity of 2.7767 and a standard deviation of 0.00035 (data obtained from 16 runs).
BF16 training requires a different checkpoint, available in the S3 bucket at gpt3/megatron-lm/checkpoint_nemo_bf16.tar
path.
The BF16 checkpoint is not ready to instantly train in Megatron, but requires only a very simple postprocessing step:
# Add framework-specific common.pt file to the checkpoint (instantaneous):
python json_to_torch.py -i common_bf16.json -o $EXTERNAL_MODEL_CHECKPOINT_DIR/common.pt
where EXTERNAL_MODEL_CHECKPOINT_DIR
points to the ckpt4000-consumed_samples=0
directory available after unpacking the tarball.
There are four groups of parameters in the checkpoint:
- model FP32 weights (or BF16 weights)
- first moments of the optimizer state
- second moments of the optimizer state
- model FP32 weights copy (created only for BF16 training)
For each model layer we store a separate directory for each of those groups, e.g. for position embeddings:
language_model.embedding.position_embeddings.weight
optimizer.state.exp_avg.language_model.embedding.position_embeddings.weight
(first moments of the optimizer state)optimizer.state.exp_avg_sq.language_model.embedding.position_embeddings.weight
(second moments of the optimizer state)optimizer.state.fp32_from_fp16.language_model.embedding.position_embeddings.weight
(model FP32 weights copy created only for BF16 training)
Each directory contains a single Zarr array (see Zarr section below) and corresponds to a single parameter tensor
(that might be split into different devices during model training).
Pipeline parallel layers are stacked together in a single array.
E.g. for a model with 96 transformer layers, the array corresponding to the self-attention QKV bias
(language_model.encoder.layers.self_attention.query_key_value.bias
) has shape [96, 36864, 12288].
All non-parameters data is stored in a common.pt
torch file and contains framework specific information.
An example content of a Megatron specific common.pt file is presented in scripts/common_bf16.json
file.
Apart from that the checkpoint metadata is stored in metadata.json
file.
Each parameter is stored in a separate directory as a Zarr array to allow parallel access.
The content of a single directory is an array fragmented into multiple files (e.g. 0.0
, 0.1
, ...) and should be manipulated
only with Zarr or Zarr-compatible libraries such as TensorStore.
Megatron features a small library in megatron.core.dist_checkpointing
that builds on the Zarr and TensorStore primitives
and allows operating on arrays split into different devices (in tensor or pipeline parallel groups).
We recommend to familiarize with the aforementioned libraries, but for convenience here is a snippet allowing to read a single layer array into a numpy array with either tensorstore or zarr:
import tensorstore as ts
import zarr
def open_with_ts(layer_dir):
spec = {'driver': 'zarr',
'metadata_key': '.zarray',
'kvstore': {'driver': 'file', 'path': layer_dir}}
return ts.open(ts.Spec(spec), open=True).result().read().result()
def open_with_zarr(layer_dir):
return zarr.open(layer_dir)[:]
# e.g.
layer_norm_weights_optim_state = open_with_ts('/llm_checkpoint/optimizer.state.exp_avg.language_model.encoder.final_layernorm.weight')
Currently NumPy does not support BF16 datatype natively, but it can be added by just importing the tensorstore library (import tensorstore
).
To load an external Megatron format checkpoint (in this case, it is a PAXML checkpoint converted to Megatron format) before training, set the following env variables:
EXTERNAL_MODEL_CHECKPOINT_DIR
pointing to the checkpoint directoryEXTERNAL_TRAINING_ITERATIONS
to number of iterations the external checkpoint was trained with (default: 4000)EXTERNAL_GBS
to global batch size the external checkpoint was trained with to determine number of samples already consumed (default: 1536)
Note that using an external checkpoint is needed only while training from a checkpoint that was not generated during the current training process in the benchmarking region. When resuming Megatron training (e.g. after hitting a preset node time limit), EXTERNAL_MODEL_CHECKPOINT_DIR
should not be set.
- Set
USE_BF16
env variable to true for BF16 training.
Log Perplexity
2.69
Evaluate after every 24576 samples (=50.33B tokens)
Evaluation on the validation subset that consists of 24567 examples.
The dataset and the checkpoints are available to download from an S3 bucket. You can download this data from the bucket using Rclone as follows:
To run Rclone on Windows, you can download the executable here. To install Rclone on Linux/macOS/BSD systems, run:
sudo -v ; curl https://rclone.org/install.sh | sudo bash
Once Rclone is installed, run the following command to authenticate with the bucket:
rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
You can then navigate in the terminal to your desired download directory and run the following commands to download the dataset and checkpoints:
dataset_c4_spm.tar
rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/dataset_c4_spm.tar ./ -P
checkpoint_megatron_fp32.tar
rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoint_megatron_fp32.tar ./ -P
checkpoint_nemo_bf16
rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoint_nemo_bf16.tar ./ -P
Alternatively to downloading the checkpoint in Megatron ready format, it can be obtained by converting a Paxml checkpoint.
To convert Paxml checkpoint to the Megatron's format, a script has been provided:
# Convert model and optimizer parameters to Megatron format (runs in ~40 minutes on DGXA100, requires 1TB of CPU memory):
python -u convert_paxml_to_megatron_distributed.py -gckpt $PAXML_CKPT_PATH -o $EXTERNAL_MODEL_CHECKPOINT_DIR --dtype fp32 # or `--dtype bf16` for BF16 checkpoint
# Add framework-specific common.pt file to the checkpoint (instantaneous):
python json_to_torch.py -i common_fp32.json -o $EXTERNAL_MODEL_CHECKPOINT_DIR/common.pt # or `-i common_bf16.json` for BF16 checkpoint
This should result in the same checkpoint as described in the "Checkpoint download" section above.
Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in S3 artifacts download section.
Training dataset -
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
mkdir <${C4_PATH}>
cd <${C4_PATH}>
git lfs pull --include "en/c4-train.007*.json.gz"
git lfs pull --include "en/c4-train.008*.json.gz"
git lfs pull --include "en/c4-train.009*.json.gz"
git lfs pull --include "en/c4-train.01*.json.gz"
Validation data subset can be downloaded from gs://mlperf-llm-public2/c4/en_val_subset_json/c4-validation_24567exp.json
to ${C4_PATH}.
Run the following commands to merge these 256 files into 2 json.gz
files. Each of the json.gz
files will be preprocessed into a pair of megatron dataset files (.bin
and .idx
).
cd <${C4_PATH}>
# create softlinks to store each shard before merging
mkdir -p softlinks
for shard in {6..7}; do
start=$((shard * 128))
end=$((shard * 128 + 127))
mkdir -p softlinks/en_$shard
for ind in $(seq -f "%05g" $start $end); do
ln -s ../../en/c4-train.${ind}-of-01024.json.gz softlinks/en_${shard}/c4-train.${ind}-of-01024.json.gz
done
done
# merge
mkdir -p en_merge
for shard in {6..7}; do
cat softlinks/en_${shard}/*gz > en_merge/c4-train.en_${shard}.json.gz
done
After preparing the data folder, download tokenizer model. The tokenizer model c4_en_301_5Mexp2_spm.model
can be downloaded by following instructions in S3 artifacts download and renamed as ${C4_PATH}/tokenizers/c4_spm/sentencepiece.model
. Make sure an output directory ${C4_PATH}/preprocessed_c4_spm
exists before the next step.
Modify C4_PATH
in preprocess.sh
and preprocess_val.sh
to specify
the correct input/output paths and run preprocessing as follows
cd scripts
sbatch preprocess.sh <path to c4>
sbatch preprocess_val.sh <path to c4> <path to validation json>
Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.