This repository contains a PyTorch implementation of GPT-2 model training, fine-tuning, and evaluation using distributed data parallelism (DDP) with optional support for multi-GPU setups.
To run this script, you'll need the following libraries:
- Python 3.x
- PyTorch
- Hugging Face Transformers
- NumPy
- Tiktoken (for tokenization)
- HellaSwag (optional dataset for evaluation)
Install the dependencies using:
pip install torch transformers numpy tiktoken
- GPT-2 Architecture: The script implements the GPT-2 architecture, with support for different GPT-2 variants (
gpt2
,gpt2-medium
,gpt2-large
,gpt2-xl
). - Self-Attention: Uses scaled dot-product attention for faster computations, especially on larger models.
- Distributed Training: Supports distributed training using PyTorch's DistributedDataParallel (DDP) with multi-GPU capabilities.
- Gradient Accumulation: Uses gradient accumulation to handle large batch sizes.
- Learning Rate Scheduler: Implements cosine decay learning rate schedule with warmup.
- Validation and HellaSwag Evaluation: Supports validation using cross-entropy loss and optional evaluation on the HellaSwag dataset.
The architecture is based on the GPT-2 model with self-attention blocks. It includes the following components:
- Self-Attention Layers: Multi-head attention mechanism.
- MLP Layers: Feedforward neural network layers with GELU activation.
- Layer Normalization: Applied after attention and MLP layers.
- Positional and Token Embeddings: Handles sequence inputs and outputs.
To train the model, run the script with the following command:
python train-gpt2.py
For distributed training with multiple GPUs, use the following command:
torchrun --standalone --nproc_per_node=<NUM_GPUS> train-gpt2.py
block_size
: Maximum length of the input sequence (default: 1024).n_emb
: Dimensionality of token embeddings (default: 768).vocab_size
: Vocabulary size (default: 50,257).n_head
: Number of attention heads (default: 12).n_layers
: Number of Transformer layers (default: 12).max_steps
: Number of training steps (default: 19,073).warmup_step
: Steps for learning rate warmup (default: 715).B
: Batch size (default: 4).T
: Sequence length (default: 1024).total_batch_size
: Total batch size (default: 32,768).learning_rate
: Initial learning rate (default: 6e-4).
The script uses AdamW optimizer with the following configuration:
- Weight decay: 0.1
- Betas: (0.9, 0.95)
- Epsilon: 1e-8
The script evaluates the model on the validation set every 250 steps and also supports evaluation on the HellaSwag dataset. To enable HellaSwag evaluation, download the dataset and update the paths in the script.
- Validation is performed using cross-entropy loss on the validation split of the dataset.
- Logs validation loss every 250 steps.
- The script computes accuracy on the HellaSwag dataset every 250 steps.
- Reports accuracy and logs the results.
The script also supports text generation using the trained GPT-2 model. You can generate text by running the inference step in the script, which outputs generated sequences based on a given prompt.
To generate text:
- Adjust the
tokens
variable in the generation block with your input prompt. - The script will generate multiple sequences and print them to the console.
Checkpoints are saved every 5,000 steps and at the end of training. The checkpoint includes:
- Model state dictionary
- Training step
- Validation loss
The script logs training and validation losses to log/log.txt
. It logs:
- Training loss every step
- Validation loss every 250 steps
- HellaSwag accuracy (if enabled)
This script supports multi-GPU training using PyTorch's Distributed Data Parallel (DDP). Use the torchrun
command to launch the script across multiple GPUs. The script automatically detects available GPUs and sets the device accordingly.
- Set
device
andddp
flags in the script to enable DDP. - Use
torchrun
for launching the training process.
The script is compatible with:
- CUDA: Multi-GPU support for distributed training.
- MPS: For Apple Silicon devices.
This project is open-source under the MIT License.