Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add T5-Small training model #201

Merged
merged 8 commits into from
Sep 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,7 @@ under review表示对应case的支持已开发完毕,在review中;Incoming
<tr height="16.80" style='height:16.80pt;'>
<td class="xl65" height="33.60" rowspan="1" style='height:33.60pt;border-right:none;border-bottom:none;' x:str>T5_small</td>
<td class="xl69" x:str>PyTorch</td>
<td class="xl69" x:str>under review</td>
<td class="xl69" x:str><a href="https://github.com/FlagOpen/FlagPerf/tree/main/training/nvidia/t5_small-pytorch" style="text-decoration:none" target="_parent">✅</a></td>
<td class="xl69" x:str>Incoming</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</a></td>
Expand Down
68 changes: 68 additions & 0 deletions training/benchmarks/t5_small/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@

## Model Introduction
### What is T5-Small(Text-To-Text Transfer Transformer)?
The developers of the Text-To-Text Transfer Transformer (T5) [write](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html):

> With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task.

T5-Small is the checkpoint with 60 million parameters.

- **Developed by:** Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. See [associated paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) and [GitHub repo](https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints)
- **Model type:** Language model
- **Language(s) (NLP):** English, French, Romanian, German
- **License:** Apache 2.0
- **Related Models:** [All T5 Checkpoints](https://huggingface.co/models?search=t5)
- Resources for more information:
- [Research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf)
- [Google's T5 Blog Post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)
- [GitHub Repo](https://github.com/google-research/text-to-text-transfer-transformer)
- [Hugging Face T5 Docs](https://huggingface.co/docs/transformers/model_doc/t5)

## Model and Training Scripts source code
Pytorch case:
This repository includes software from https://github.com/huggingface/transformers/blob/v4.31.0/examples/pytorch/summarization/run_summarization_no_trainer.py
licensed under the Apache License 2.0.

Some of the files in this directory were modified by BAAI in 2023 to support FlagPerf.

## Dataset and Model Checkpoints

> Dataset website:https://huggingface.co/datasets/cnn_dailymail and https://github.com/abisee/cnn-dailymail

> Model checkpoint website: https://huggingface.co/t5-small/tree/main

We have already preprocessed the dataset and the model checkpoint files(The preprocessing script is `training/benchmarks/t5_small/pytorch/data_preprocessing/create_train_eval_data.py`).
The preprocessed can be downloaded directly from https://bd.bcebos.com/klx-pytorch-ipipe-bd/flagperf/datasets/t5_small_train.tar.
No additional preprocessing steps need to be conducted.

After decompressing, the dataset and model checkpoint files are organized as the following:

```
t5_small_train
├── dataset # dataset files
│ ├── eval_dataset.npz
│ └── train_dataset.npz
├── metrics # metrics for evaluation
│ └── rouge
│ └── rouge.py
├── model # model checkpoint and config files
│ ├── config.json
│ ├── generation_config.json
│ ├── model.safetensors
│ ├── spiece.model
│ ├── tokenizer.json
│ └── tokenizer_config.json
└── nltk_data # nltk data for evaluation
└── tokenizers
└── punkt
```

## Benchmark Task and Target Accuracy
This experiment is to finetune a summarization task on CNN/Daily Mail dataset with t5-small pretrained checkpoints.
After finetuning 3 epoches, the t5-small model is able to achieve a ROUGE-1 score of 41+, which matches the evaluation result on the [paper](https://arxiv.org/abs/1910.10683).

## AI Frameworks && Accelerators supports

| | Pytorch | Paddle | TensorFlow2 |
| ---------- | ------- | ------ | ----------- |
| Nvidia GPU | [✅](../../nvidia/t5_small-pytorch/README.md) | N/A | N/A |
2 changes: 2 additions & 0 deletions training/benchmarks/t5_small/pytorch/config/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from ._base import *
from .mutable_params import mutable_params
46 changes: 46 additions & 0 deletions training/benchmarks/t5_small/pytorch/config/_base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# DO NOT MODIFY THESE REQUIRED PARAMETERS

# Required parameters
vendor: str = None
data_dir: str = None
name: str = "t5_small"
cudnn_benchmark: bool = False
cudnn_deterministic: bool = True

# Optional parameters

# =========================================================
# loss scale
# =========================================================
lr: float = 5e-5
weight_decay = 0.0

# =========================================================
# train && evaluate
# =========================================================
train_batch_size: int = 32
eval_batch_size: int = 32

max_epoch: int = 3
target_rouge1: float = 40.5

do_train = True
distributed: bool = True

# =========================================================
# utils
# =========================================================
seed: int = 0
dist_backend: str = 'nccl'
device: str = None

# =========================================================
# for driver
# =========================================================
local_rank: int = -1
use_env: bool = True
log_freq: int = 500
print_freq: int = 500
n_device: int = 1
sync_bn: bool = False
gradient_accumulation_steps: int = 1
5 changes: 5 additions & 0 deletions training/benchmarks/t5_small/pytorch/config/mutable_params.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
mutable_params = [
'vendor', 'data_dir', 'lr', 'weight_decay', 'train_batch_size',
'eval_batch_size', 'do_train', 'distributed', 'dist_backend', 'device',
'cudnn_benchmark', 'cudnn_deterministic'
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
import os

import numpy as np
import datasets
from transformers import AutoTokenizer


def save_dataset(ds, save_path):
np.savez(save_path,
input_ids=ds['input_ids'],
attention_mask=ds['attention_mask'],
labels=ds['labels'])


def main():
data_prefix = 't5_small_train/dataset'
os.makedirs(data_prefix, exist_ok=True)
train_datapath = os.path.join(data_prefix, 'train_dataset.npz')
eval_datapath = os.path.join(data_prefix, 'eval_dataset.npz')

tokenizer = AutoTokenizer.from_pretrained('t5-small',
use_fast=True,
revision='main')

raw_datasets = datasets.load_dataset('cnn_dailymail', '3.0.0')

def preprocess_function(examples):
# remove pairs where at least one record is None
text_column = 'article'
summary_column = 'highlights'
prefix = 'summarize: '
max_source_length = 1024
max_target_length = 128
ignore_pad_token_for_loss = True
padding = "max_length"

inputs, targets = [], []
for i in range(len(examples[text_column])):
if examples[text_column][i] and examples[summary_column][i]:
inputs.append(examples[text_column][i])
targets.append(examples[summary_column][i])

inputs = [prefix + inp for inp in inputs]
model_inputs = tokenizer(inputs,
max_length=max_source_length,
padding=padding,
truncation=True)

# Tokenize targets with the `text_target` keyword argument
labels = tokenizer(text_target=targets,
max_length=max_target_length,
padding=padding,
truncation=True)

# If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
# padding in the loss.
if padding == "max_length" and ignore_pad_token_for_loss:
labels["input_ids"] = [[
(l if l != tokenizer.pad_token_id else -100) for l in label
] for label in labels["input_ids"]]

model_inputs["labels"] = labels["input_ids"]
return model_inputs

train_dataset = raw_datasets["train"]
train_dataset = train_dataset.map(
preprocess_function,
batched=True,
num_proc=32,
remove_columns=raw_datasets["train"].column_names,
load_from_cache_file=True,
desc="Running tokenizer on train dataset",
).with_format('numpy')
save_dataset(train_dataset, train_datapath)

eval_dataset = raw_datasets["validation"]
eval_dataset = eval_dataset.map(
preprocess_function,
batched=True,
num_proc=32,
remove_columns=raw_datasets["validation"].column_names,
load_from_cache_file=True,
desc="Running tokenizer on train dataset",
).with_format('numpy')
save_dataset(eval_dataset, eval_datapath)


if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .dataloader import build_train_dataloader, build_eval_dataloader
83 changes: 83 additions & 0 deletions training/benchmarks/t5_small/pytorch/dataloaders/dataloader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import os
import numpy as np
import torch
from torch.utils.data import Dataset
from torch.utils.data.dataloader import default_collate


class T5Dataset(Dataset):
def __init__(self, filepath):
origin_data = np.load(filepath)
self.input_ids = origin_data['input_ids']
self.attention_mask = origin_data['attention_mask']
self.labels = origin_data['labels']

def __len__(self):
return len(self.input_ids)

def __getitem__(self, idx):
sample = {
'input_ids': self.input_ids[idx],
'attention_mask': self.attention_mask[idx],
'labels': self.labels[idx]
}
return sample


def _prepare_decoder_input_ids_from_labels(input_ids):
"""
https://github.com/huggingface/transformers/blob/v4.31.0/src/transformers/models/t5/modeling_t5.py#L1800
https://github.com/huggingface/transformers/blob/v4.31.0/src/transformers/models/t5/modeling_t5.py#L851
"""
decoder_start_token_id = 0
pad_token_id = 0

shifted_input_ids = input_ids.new_zeros(input_ids.shape)
shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
shifted_input_ids[..., 0] = decoder_start_token_id

shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)

return shifted_input_ids


def my_collate(batch):
"""
https://github.com/huggingface/transformers/blob/v4.31.0/src/transformers/data/data_collator.py#L600
"""
new_batch = default_collate(batch)
new_batch["decoder_input_ids"] = _prepare_decoder_input_ids_from_labels(
new_batch["labels"])
return new_batch


def build_train_dataloader(config):
train_dataset = T5Dataset(
os.path.join(config.data_dir, 'dataset', 'train_dataset.npz'))

data_loader = torch.utils.data.DataLoader(
train_dataset,
shuffle=True,
batch_size=config.train_batch_size,
collate_fn=my_collate)
return data_loader


def build_eval_dataloader(config):
eval_dataset = T5Dataset(
os.path.join(config.data_dir, 'dataset', 'eval_dataset.npz'))

data_loader = torch.utils.data.DataLoader(
eval_dataset, batch_size=config.eval_batch_size, collate_fn=my_collate)
return data_loader


if __name__ == '__main__':
from collections import namedtuple
Config = namedtuple(
'Config',
['data_dir', 'distributed', 'train_batch_size', 'eval_batch_size'])
config = Config('t5_small_train/dataset', False, 4, 4)
eval_dataloader = build_eval_dataloader(config)
for i, batch in enumerate(eval_dataloader):
break
19 changes: 19 additions & 0 deletions training/benchmarks/t5_small/pytorch/model/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import os
from transformers import T5Config, T5ForConditionalGeneration, T5TokenizerFast


def create_model(config):
model_path = os.path.join(config.data_dir, 'model')
hfconfig = T5Config.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path,
config=hfconfig)
tokenizer = T5TokenizerFast.from_pretrained(model_path)
return model, hfconfig, tokenizer


if __name__ == '__main__':

from collections import namedtuple
Config = namedtuple('Config', ['data_dir'])
config = Config('t5_small_train')
model, tokenizer = create_model(config)
27 changes: 27 additions & 0 deletions training/benchmarks/t5_small/pytorch/optimizers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import torch


def create_optimizer(model, args):
# Optimizer
# Split weights in two groups, one with weight decay and the other not.
no_decay = ["bias", "LayerNorm.weight", "layer_norm.weight"]
optimizer_grouped_parameters = [
{
"params": [
p for n, p in model.named_parameters()
if not any(nd in n for nd in no_decay)
],
"weight_decay":
args.weight_decay,
},
{
"params": [
p for n, p in model.named_parameters()
if any(nd in n for nd in no_decay)
],
"weight_decay":
0.0,
},
]
optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=args.lr)
return optimizer
Loading