- September 10'th: We have added support for multi-task training on GLUE. Simply specify
all
when selecting the dataset after specifyingglue
as the benchmark. More detailed instructions/details coming soon...
This repository provides a complete toolkit for training and fine-tuning T5 models using span masking (an extension of Masked language Modeling, a.k.a., MLM, as described in BERT).
Leveraging libraries like Hugging Face's transformers
and accelerate
, Microsoft's DeepSpeed
, and wandb
it offers a robust and user-friendly
platform for experimentation with state-of-the-art models.
- Distributed Training
- Tips for environment setup (within PyCharm)
- Setup and Installation
- Distributed Training
- Pipelines
- Design Decisions
- Expanding Fine-Tuning Capabilities
- Troubleshooting and FAQs
- Acknowledgments and References
- Create a new project in PyCharm, and use the Conda interpreter with Python 3.10. See a useful guide linked here.
- Use PyCharm's VCS functionality to clone this repository into your project as described here.
- Install the required packages (see Setup and Installation below).
- Within your distributed computing server, set up a new conda virtual environment with Python 3.10 as you did locally.
- Set up a deployment configuration in PyCharm such that local changes are automatically uploaded to the server.
You can find a useful guide here
- It is recommended to work with GitHub Co-Pilot for continued development locally. This is free for students (as descirbed on this page)!
- On the remote server, install the required packages (as you did above).
- Run
accelerate config
to set up the distributed training configuration as described in distributed training below. - Run
wandb login
to set up the Weights and Biases integration. - Run a training script:
- Run
accelerate launch fine_tune_t5.py
to start fine-tuning with accelerate, or runaccelerate launch pre_train_t5.py
to start pre-training with accelerate. - If you want to use DeepSpeed instead, run
deepspeed <script_name>
(Make sure you specified the correct settings in the configuration step above). You will need to point to the correct deepspeed configuration file (zer0_stage2_config.json
).
- Run
Note: At this point we only support ZeRO-2 with DeepSpeed. We are working on adding support for ZeRO-3 in the near future. For more information about these tools for distributed training, see the ZeRO paper.
First, install Anaconda or Miniconda. Then, create a new conda environment and install the required packages with the following commands (see reference here):
conda env create -f conda_environment_slurm.yml
conda activate ml_training
We currently support only single-node multi-GPU training. To train on a single node with 4 GPUs, run:
accelerate config
When prompted, select the following options:
In which compute environment are you running? <This machine>
Which type of machine are you using? <multi-GPU>
How many different machines will you use (use more than 1 for multi-node distributed training)? <1>
Do you wish to optimize your script with torch dynamo? <no>
Do you want to use DeepSpeed? <no>
Do you want to use FullyShardedDataParallel? <no>
Do you want to use Megatron-LM? <no>
How many GPUs should be used for distributed training? <4>
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all] <enter>
Do you wiush to use FP16 or BF16? <FP16>
Next, make sure you are logged into wandb so that you can track your training runs (if prompted, follow the
instructions to create a free account):
wandb login
Once you've configured the accelerator, and set up wandb, you can run a training script such as:
accelerate launch fine_tune_t5.py
To train with DeepSpeed, you must first install it:
pip install deepspeed
Then, you can configure the accelerator with:
accelerate config
When prompted, select the following options:
In which compute environment are you running? <This machine>
Which type of machine are you using? <multi-GPU>
How many different machines will you use (use more than 1 for multi-node distributed training)? <1>
Do you wish to optimize your script with torch dynamo? <no>
Do you want to use DeepSpeed? [yes/NO]: <yes>
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: <yes>
Please enter the path to the json DeepSpeed config file: <zero_stage2_config.json>
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: <no>
How many GPU(s) should be used for distributed training? [1]: <4>
Next, make sure you are logged into wandb so that you can track your training runs (if prompted, follow the
instructions to create a free account):
wandb login
Once you've configured the accelerator, and set up wandb, you can run a training script such as:
deepspeed fine_tune_t5.py
Note: Zero3 is not yet supported.
- Flag Parsing: Parse the necessary flags using
HfArgumentParser
. - Accelerator Initialization: Initialize
accelerate.Accelerator
. - Model and Tokenizer Loading: Load the T5 model and tokenizer.
- W&B Initialization: Set up Weights & Biases for experiment tracking.
- Data Loading and Preprocessing: Load datasets and tokenize.
- Data Collator Initialization: Utilize
T5DataCollator
with span corruption. - Metrics Configuration: Set up evaluation metrics.
- Trainer Initialization: Initialize
transformers.Seq2SeqTrainer
. - Training Execution: Start pre-training.
- Flag Parsing: Parse the necessary flags.
- Seed Setting: Set random seed.
- Accelerator Initialization: Initialize
accelerate.Accelerator
. - Model and Tokenizer Loading: Load the T5 model and tokenizer.
- W&B Initialization: Set up Weights & Biases.
- Data Loading and Preprocessing: Load dataset and encode.
- Metrics Configuration: Set up evaluation metrics.
- Trainer Initialization: Initialize
transformers.Seq2SeqTrainer
. - Training Execution: Start fine-tuning.
We use Weights & Biases sweeps to run hyperparameter optimization.
To run a sweep, first set up the sweep configuration file (sweep_config.yaml
) with the desired hyperparameters.
Then, run the sweep with wandb sweep sweep_config.yaml
.
Finally, run the sweep agent with wandb agent <sweep_id>
.
We have provided sweep configurations for fine-tuning T5 on GLUE tasks (see the sweeps
directory).
If you are running sweeps on a remote server, you can run wandb sweep <sweep_config_file>
without srun
or sbatch
.
However, you will need to run wandb agent <sweep_id>
with srun
or sbatch
to ensure that the sweep agent is running
on the correct machine.
See the following example sequence of commands:
wandb sweep sweeps/glue_sst2_sweep.yaml
srun --account <account_name> --partition <partition_name> --gres=gpu:<num_gpus> wandb agent <sweep_id>`
Where the sweep ID is outputted by the wandb sweep
command. Make sure that the number of GPUs you request is the same
as the number of GPUs you specified when running accelerate config
.
See an example yaml file for a sweep configuration below:
program: fine_tune_t5.py
project: "T5 Evaluation"
name: "T5 Evaluation -- GLUE: SST-2"
method: bayes
metric:
name: eval/accuracy
goal: maximize
parameters:
learning_rate:
distribution: uniform
min: 1e-5
max: 1e-3
lr_scheduler_type:
values: [
"constant_with_warmup",
"linear",
"cosine",
]
benchmark:
value: 'glue'
dataset_name:
value: "sst2"
command:
- accelerate
- launch
- ${program}
- ${args}
Note: We use the accelerate
launcher to run the training script. This is necessary for distributed training.
Note: We focus specifically on the learning rate and scheduler type hyperparameters since these are the most important for fine-tuning T5. However, you can add more hyperparameters as needed.
- Weights & Biases (W&B): For experiment tracking and logging.
- Accelerate: For distributed training.
- DeepSpeed: An alternative for distributed training. Flexibility to support larger models.
- Absil Parameterized Unit Tests: For efficient and readable testing.
- Lambda Functions: For dynamic function instantiation (metric computation and tokenizer functions).
Permissible for the
map
function for dataset tokenization, and the metrics computation hook executed by HuggingFace trainers.
The fine-tuning process in this repository is designed to be extensible, allowing users to easily add more datasets for task-specific training. Here's how you can expand the fine-tuning capabilities to more datasets:
The constants.py
file includes a dictionary DATASET_VALS
that defines the configurations for various datasets,
such as prefixes, column names, metrics, and labels. You can add new datasets by extending this dictionary with the required details.
The fine-tuning script (fine_tune_t5.py
) leverages the constants to dynamically instantiate functions for tokenization and metric computation using lambda functions.
This approach enhances flexibility and modularity, allowing seamless integration of new datasets.
DATASET_VALS = {
'DATASET_NAME': {
'task_name': {
'prefix': 'your_prefix',
'text_column_name': 'text_column',
'label_column_name': 'label_column',
'metric_to_optimize': 'chosen_metric',
'metric_names': ['metric1', 'metric2'],
'labels': {
0: 'label0',
1: 'label1',
},
},
},
}
This section provides guidance on common issues and frequently asked questions:
-
Q: How can I set up distributed training?
A: Follow the distributed training guide provided in the Distributed Training section above.
-
Q: Can I use this repository for other benchmarks and tasks?
A: Yes, the repository is designed to be extensible and can be adapted for various NLP benchmarks and tasks.
-
Q: What is the maximum batch size for various T5 model sizes?
A: The maximum batch size for T5 models is as follows:
- T5-small: 8
- T5-base: 4
-
Q: What is the ideal learning rate range for each T5 model size?
A: The ideal learning rate range for T5 models is as follows:
- T5-small: 1e-3 - 1e-5
- T5-base: ??? - ???
-
Q: What are the ideal hyper-parameters to train T5-Base?
A: Optimal Hyperparameters for T5-Base on GLUE tasks:
Task | Learning Rate | Scheduler Type | Batch Size | Epochs | Training Accumulation Steps | Eval Steps | Logging Steps |
---|---|---|---|---|---|---|---|
SST-2 | 5e-4 | linear (10% warmup) | 4 | 10 | 4 | 500 | 50 |
MRPC | 5e-4 | linear (10% warmup) | 4 | 50 | 4 | 100 | 50 |
RTE | 5e-4 | linear (10% warmup) | 4 | 50 | 4 | 100 | 50 |
QNLI | 5e-4 | linear (10% warmup) | 4 | 5 | 4 | 2,000 | 50 |
QQP | 5e-4 | linear (10% warmup) | 4 | 3 | 4 | 50 | |
MNLI | 5e-4 | linear (10% warmup) | 4 | 3 | 4 | 50 | |
CoLA | 5e-4 | linear (10% warmup) | 4 | 20 | 4 | 50 | |
STS-B | 5e-4 | linear (10% warmup) | 4 | 20 | 4 | 50 | |
WNLI | 5e-4 | linear (10% warmup) | 4 | 20 | 4 | 50 | |
Note that for all datasets, we use a batch size of 4, and a learning rate of 5e-4 with a linear scheduler with 10% | |||||||
warmup. We also always log every 50 steps, and accumulate gradients every 4 steps. The number of epochs and | |||||||
evaluation steps are dataset-specific. |
For more specific inquiries or troubleshooting, please feel free to open an issue on GitHub or contact the maintainers.
- Original T5 Paper: link
- Contributors and Collaborators: Zach Bamberger, Ofek Glick, Jonny Gal
- Supervision: Yonatan Belinkov and Chaim Baskin.