1. Problem

Large Language Model - GPT-3 175B

2. Directions

Our codebase is capable of training large language models with both model and data parallelism.

Steps to configure machine

To use this repository, please install a supported version of PyTorch with GPU support (python 3.8, pytorch 1.12, cuda 11.6.2, and nccl 2.12.10 and above) and NVIDIA APEX. We recommend using one of NGC's PyTorch containers. The latest tested compatible version is nvcr.io/nvidia/pytorch:24.04-py3).

Steps to run and time

To train GPT-3, set COM_DIR in gpt3_blend.sh to point to the C4 dataset location which contains the dataset after preprocessing.

sbatch run_gpt3.sh <path to log directory> <path to BPE processed directory> <container>

Use script run_gpt3.sh as shown above to run GPT-3 175B on clusters using slurm. You can adjust number of nodes (tested only with nodes>=8) and job run time in the sbatch command in line #3 of the run_gpt3.sh script.

Note that the model trains for 15 mins lesser than that actual run time because the last 15 mins are set aside for storing a checkpoint of the last iteration.

Command line arguments are described in detail in this source file arguments.py.

3. Dataset/Environment

Background

We use c4/en/3.0.1 dataset from HuggingFace/AllenAI.

For training in the benchmarking region, only 1/4th of the 1024 original json.gz files are used. Specifically, the last 1/4th of the files from 768 till 1024 json.gz are required.

For validation, a subset of the validation dataset has been selected. This was done by randomly selecting 24,567 examples using select_example.md to get a smaller evaluation dataset.

The dataset is preprocessed using Sentence Piece Model. These instructions were used to train the SPM.

Preprocessed data download

The preprocessed dataset in a training ready format is available in the S3 bucket at gpt3/megatron-lm/dataset_c4_spm.tar path (the S3 download instructions are available at the end of this README in "S3 artifacts download" section). After downloading and unpacking the tarball, the preprocessed_c4_spm directory contains the preprocessed dataset and this is where the COM_DIR env variable (explained above) should point to.

Additionally, the training script expects BPE vocab.json and merges.txt files. These files are used to create a BPE tokenizer which is only used for two things at this point in the code since tokenization is already done in the preprocessed dataset:

To find out the eod entry index (value is 50256)
To find out the vocab size (value is 50257)

Correctness of the dataset can be verified by comparing the checksums provided here

4. Model

Publication/Attribution

Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

List of Layers

The model largely follows the GPT-3 paper, Some of the modifications are described below:

Tokenizer is changed from BPE to SentencePiece with BPE.
Alternating sparse attention layers are not used.
Model parameters are set here.

Model checkpoint

In the benchmarking region, we should resume training from a checkpoint which is trained with Global Batch Size of 1536 for 4000 iterations.

Checkpoint download

The FP32 Megatron checkpoint is available in the S3 bucket at gpt3/megatron-lm/checkpoint_megatron_fp32.tar path (the S3 download instructions are available at the end of this README in "S3 artifacts download" section). After downloading and unpacking the tarball, the ckpt4000_fp32 directory with the checkpoint should be set as the EXTERNAL_MODEL_CHECKPOINT_DIR env variable.

Correctness of the Megatron format checkpoint can be verified by comparing the checksums provided here. Checksum for two files (metadata.json and common.pt) may not match. These files are provided here for verification.

Validation log perplexity can also be used as a metric to verify the correctness of the checkpoint and the loading scripts. To do this, the model should be evaluated on the entire validation dataset after loading weights from the checkpoint. We have observed an average log perplexity of 2.7767 and a standard deviation of 0.00035 (data obtained from 16 runs).

BF16 training

BF16 training requires a different checkpoint, available in the S3 bucket at gpt3/megatron-lm/checkpoint_nemo_bf16.tar path. The BF16 checkpoint is not ready to instantly train in Megatron, but requires only a very simple postprocessing step:

# Add framework-specific common.pt file to the checkpoint (instantaneous):
python json_to_torch.py -i common_bf16.json -o $EXTERNAL_MODEL_CHECKPOINT_DIR/common.pt

where EXTERNAL_MODEL_CHECKPOINT_DIR points to the ckpt4000-consumed_samples=0 directory available after unpacking the tarball.

Checkpoint Parameters

There are four groups of parameters in the checkpoint:

model FP32 weights (or BF16 weights)
first moments of the optimizer state
second moments of the optimizer state
model FP32 weights copy (created only for BF16 training)

For each model layer we store a separate directory for each of those groups, e.g. for position embeddings:

language_model.embedding.position_embeddings.weight
optimizer.state.exp_avg.language_model.embedding.position_embeddings.weight (first moments of the optimizer state)
optimizer.state.exp_avg_sq.language_model.embedding.position_embeddings.weight (second moments of the optimizer state)
optimizer.state.fp32_from_fp16.language_model.embedding.position_embeddings.weight (model FP32 weights copy created only for BF16 training)

Each directory contains a single Zarr array (see Zarr section below) and corresponds to a single parameter tensor (that might be split into different devices during model training). Pipeline parallel layers are stacked together in a single array. E.g. for a model with 96 transformer layers, the array corresponding to the self-attention QKV bias (language_model.encoder.layers.self_attention.query_key_value.bias) has shape [96, 36864, 12288].

Checkpoint Metadata

All non-parameters data is stored in a common.pt torch file and contains framework specific information. An example content of a Megatron specific common.pt file is presented in scripts/common_bf16.json file.

Apart from that the checkpoint metadata is stored in metadata.json file.

Checkpoint Zarr format

Each parameter is stored in a separate directory as a Zarr array to allow parallel access. The content of a single directory is an array fragmented into multiple files (e.g. 0.0, 0.1, ...) and should be manipulated only with Zarr or Zarr-compatible libraries such as TensorStore.

Megatron features a small library in megatron.core.dist_checkpointing that builds on the Zarr and TensorStore primitives and allows operating on arrays split into different devices (in tensor or pipeline parallel groups).

We recommend to familiarize with the aforementioned libraries, but for convenience here is a snippet allowing to read a single layer array into a numpy array with either tensorstore or zarr:

import tensorstore as ts
import zarr

def open_with_ts(layer_dir):
    spec = {'driver': 'zarr',
            'metadata_key': '.zarray',
            'kvstore': {'driver': 'file', 'path': layer_dir}}
    return ts.open(ts.Spec(spec), open=True).result().read().result()

def open_with_zarr(layer_dir):
    return zarr.open(layer_dir)[:]

# e.g.
layer_norm_weights_optim_state = open_with_ts('/llm_checkpoint/optimizer.state.exp_avg.language_model.encoder.final_layernorm.weight')

Currently NumPy does not support BF16 datatype natively, but it can be added by just importing the tensorstore library (import tensorstore).

How to run

To load an external Megatron format checkpoint (in this case, it is a PAXML checkpoint converted to Megatron format) before training, set the following env variables:

EXTERNAL_MODEL_CHECKPOINT_DIR pointing to the checkpoint directory
EXTERNAL_TRAINING_ITERATIONS to number of iterations the external checkpoint was trained with (default: 4000)
EXTERNAL_GBS to global batch size the external checkpoint was trained with to determine number of samples already consumed (default: 1536)

Note that using an external checkpoint is needed only while training from a checkpoint that was not generated during the current training process in the benchmarking region. When resuming Megatron training (e.g. after hitting a preset node time limit), EXTERNAL_MODEL_CHECKPOINT_DIR should not be set.

Set USE_BF16 env variable to true for BF16 training.

5. Quality

Quality metric

Log Perplexity

Quality target

2.69

Evaluation frequency

Evaluate after every 24576 samples (=50.33B tokens)

Evaluation thoroughness

Evaluation on the validation subset that consists of 24567 examples.

6. Other

S3 artifacts download

The dataset and the checkpoints are available to download from an S3 bucket. You can download this data from the bucket using Rclone as follows:

To run Rclone on Windows, you can download the executable here. To install Rclone on Linux/macOS/BSD systems, run:

sudo -v ; curl https://rclone.org/install.sh | sudo bash

Once Rclone is installed, run the following command to authenticate with the bucket:

rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com

You can then navigate in the terminal to your desired download directory and run the following commands to download the dataset and checkpoints:

dataset_c4_spm.tar

rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/dataset_c4_spm.tar ./ -P

checkpoint_megatron_fp32.tar

rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoint_megatron_fp32.tar ./ -P

checkpoint_nemo_bf16

rclone copy mlc-training:mlcommons-training-wg-public/gpt3/megatron-lm/checkpoint_nemo_bf16.tar ./ -P

Model conversion from Paxml checkpoints

Alternatively to downloading the checkpoint in Megatron ready format, it can be obtained by converting a Paxml checkpoint.

To convert Paxml checkpoint to the Megatron's format, a script has been provided:

# Convert model and optimizer parameters to Megatron format (runs in ~40 minutes on DGXA100, requires 1TB of CPU memory):
python -u convert_paxml_to_megatron_distributed.py -gckpt $PAXML_CKPT_PATH -o $EXTERNAL_MODEL_CHECKPOINT_DIR --dtype fp32  # or `--dtype bf16` for BF16 checkpoint
# Add framework-specific common.pt file to the checkpoint (instantaneous):
python json_to_torch.py -i common_fp32.json -o $EXTERNAL_MODEL_CHECKPOINT_DIR/common.pt  # or `-i common_bf16.json` for BF16 checkpoint

This should result in the same checkpoint as described in the "Checkpoint download" section above.

Dataset preprocessing

Here are the instructions to prepare the preprocessed dataset from scratch. Data preprocessing is already done and the final dataset can be accessed by following instructions in S3 artifacts download section.

Data Download

Training dataset -

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
mkdir <${C4_PATH}>
cd <${C4_PATH}>
git lfs pull --include "en/c4-train.007*.json.gz"
git lfs pull --include "en/c4-train.008*.json.gz"
git lfs pull --include "en/c4-train.009*.json.gz"
git lfs pull --include "en/c4-train.01*.json.gz"

Validation data subset can be downloaded from gs://mlperf-llm-public2/c4/en_val_subset_json/c4-validation_24567exp.json to ${C4_PATH}.

Data Preprocessing for Megatron-LM

Run the following commands to merge these 256 files into 2 json.gz files. Each of the json.gz files will be preprocessed into a pair of megatron dataset files (.bin and .idx).

cd <${C4_PATH}>

# create softlinks to store each shard before merging
mkdir -p softlinks
for shard in {6..7}; do
  start=$((shard * 128))
  end=$((shard * 128 + 127))
  mkdir -p softlinks/en_$shard
  for ind in $(seq -f "%05g" $start $end); do
    ln -s ../../en/c4-train.${ind}-of-01024.json.gz softlinks/en_${shard}/c4-train.${ind}-of-01024.json.gz
  done
done

# merge
mkdir -p en_merge
for shard in {6..7}; do
  cat softlinks/en_${shard}/*gz > en_merge/c4-train.en_${shard}.json.gz 
done

After preparing the data folder, download tokenizer model. The tokenizer model c4_en_301_5Mexp2_spm.model can be downloaded by following instructions in S3 artifacts download and renamed as ${C4_PATH}/tokenizers/c4_spm/sentencepiece.model. Make sure an output directory ${C4_PATH}/preprocessed_c4_spm exists before the next step.

Modify C4_PATH in preprocess.sh and preprocess_val.sh to specify the correct input/output paths and run preprocessing as follows

cd scripts
sbatch preprocess.sh <path to c4>
sbatch preprocess_val.sh <path to c4> <path to validation json>

4. Model

Publication/Attribution

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

1. Problem

2. Directions

Steps to configure machine

Steps to run and time

3. Dataset/Environment

Background

Preprocessed data download

4. Model

Publication/Attribution

List of Layers

Model checkpoint

Checkpoint download

BF16 training

Checkpoint Parameters

Checkpoint Metadata

Checkpoint Zarr format

How to run

5. Quality

Quality metric

Quality target

Evaluation frequency

Evaluation thoroughness

6. Other

S3 artifacts download

Model conversion from Paxml checkpoints

Dataset preprocessing

Data Download

Data Preprocessing for Megatron-LM

4. Model

Publication/Attribution

Files

README.md

Latest commit

History

README.md

File metadata and controls

1. Problem

2. Directions

Steps to configure machine

Steps to run and time

3. Dataset/Environment

Background

Preprocessed data download

4. Model

Publication/Attribution

List of Layers

Model checkpoint

Checkpoint download

BF16 training

Checkpoint Parameters

Checkpoint Metadata

Checkpoint Zarr format

How to run

5. Quality

Quality metric

Quality target

Evaluation frequency

Evaluation thoroughness

6. Other

S3 artifacts download

Model conversion from Paxml checkpoints

Dataset preprocessing

Data Download

Data Preprocessing for Megatron-LM

4. Model

Publication/Attribution