FlagOpen · shh2000 · Sep 18, 2023 · Aug 16, 2023 · Aug 16, 2023 · Aug 17, 2023
diff --git a/README.md b/README.md
@@ -290,7 +290,7 @@ under review表示对应case的支持已开发完毕，在review中；Incoming
 <tr height="16.80" style='height:16.80pt;'>
     <td class="xl65" height="33.60" rowspan="1" style='height:33.60pt;border-right:none;border-bottom:none;' x:str>T5_small</td>
     <td class="xl69" x:str>PyTorch</td>
-    <td class="xl69" x:str>under review</td>
+    <td class="xl69" x:str><a href="https://github.com/FlagOpen/FlagPerf/tree/main/training/nvidia/t5_small-pytorch" style="text-decoration:none" target="_parent">✅</a></td>
     <td class="xl69" x:str>Incoming</td>
     <td class="xl69" x:str>N/A</td>
       <td class="xl69" x:str>N/A</a></td>

diff --git a/training/benchmarks/t5_small/README.md b/training/benchmarks/t5_small/README.md
@@ -0,0 +1,68 @@
+
+## Model Introduction
+### What is T5-Small(Text-To-Text Transfer Transformer)?
+The developers of the Text-To-Text Transfer Transformer (T5) [write](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html):
+
+> With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task.
+
+T5-Small is the checkpoint with 60 million parameters.
+
+- **Developed by:** Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. See [associated paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf) and [GitHub repo](https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints)
+- **Model type:** Language model
+- **Language(s) (NLP):** English, French, Romanian, German
+- **License:** Apache 2.0
+- **Related Models:** [All T5 Checkpoints](https://huggingface.co/models?search=t5)
+- Resources for more information:
+  - [Research paper](https://jmlr.org/papers/volume21/20-074/20-074.pdf)
+  - [Google's T5 Blog Post](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)
+  - [GitHub Repo](https://github.com/google-research/text-to-text-transfer-transformer)
+  - [Hugging Face T5 Docs](https://huggingface.co/docs/transformers/model_doc/t5)
+
+## Model and Training Scripts source code
+Pytorch case:
+This repository includes software from https://github.com/huggingface/transformers/blob/v4.31.0/examples/pytorch/summarization/run_summarization_no_trainer.py
+licensed under the Apache License 2.0.
+
+Some of the files in this directory were modified by BAAI in 2023 to support FlagPerf.
+
+## Dataset and Model Checkpoints
+
+> Dataset website：https://huggingface.co/datasets/cnn_dailymail and https://github.com/abisee/cnn-dailymail
+
+> Model checkpoint website: https://huggingface.co/t5-small/tree/main
+
+We have already preprocessed the dataset and the model checkpoint files(The preprocessing script is `training/benchmarks/t5_small/pytorch/data_preprocessing/create_train_eval_data.py`).
+The preprocessed can be downloaded directly from https://bd.bcebos.com/klx-pytorch-ipipe-bd/flagperf/datasets/t5_small_train.tar.
+No additional preprocessing steps need to be conducted.
+
+After decompressing, the dataset and model checkpoint files are organized as the following:
+
+```
+t5_small_train
+├── dataset                     # dataset files
+│   ├── eval_dataset.npz
+│   └── train_dataset.npz
+├── metrics                     # metrics for evaluation
+│   └── rouge
+│       └── rouge.py
+├── model                       # model checkpoint and config files
+│   ├── config.json
+│   ├── generation_config.json
+│   ├── model.safetensors
+│   ├── spiece.model
+│   ├── tokenizer.json
+│   └── tokenizer_config.json
+└── nltk_data                   # nltk data for evaluation
+    └── tokenizers
+        └── punkt
+```
+
+## Benchmark Task and Target Accuracy
+This experiment is to finetune a summarization task on CNN/Daily Mail dataset with t5-small pretrained checkpoints.
+After finetuning 3 epoches, the t5-small model is able to achieve a ROUGE-1 score of 41+, which matches the evaluation result on the [paper](https://arxiv.org/abs/1910.10683).
+
+## AI Frameworks && Accelerators supports
+
+|            | Pytorch | Paddle | TensorFlow2 |
+| ---------- | ------- | ------ | ----------- |
+| Nvidia GPU | [✅](../../nvidia/t5_small-pytorch/README.md)       | N/A    | N/A       |
diff --git a/training/benchmarks/t5_small/pytorch/config/__init__.py b/training/benchmarks/t5_small/pytorch/config/__init__.py
@@ -0,0 +1,2 @@
+from ._base import *
+from .mutable_params import mutable_params
diff --git a/training/benchmarks/t5_small/pytorch/config/_base.py b/training/benchmarks/t5_small/pytorch/config/_base.py
@@ -0,0 +1,46 @@
+# DO NOT MODIFY THESE REQUIRED PARAMETERS
+
+# Required parameters
+vendor: str = None
+data_dir: str = None
+name: str = "t5_small"
+cudnn_benchmark: bool = False
+cudnn_deterministic: bool = True
+
+# Optional parameters
+
+# =========================================================
+# loss scale
+# =========================================================
+lr: float = 5e-5
+weight_decay = 0.0
+
+# =========================================================
+# train && evaluate
+# =========================================================
+train_batch_size: int = 32
+eval_batch_size: int = 32
+
+max_epoch: int = 3
+target_rouge1: float = 40.5
+
+do_train = True
+distributed: bool = True
+
+# =========================================================
+# utils
+# =========================================================
+seed: int = 0
+dist_backend: str = 'nccl'
+device: str = None
+
+# =========================================================
+# for driver
+# =========================================================
+local_rank: int = -1
+use_env: bool = True
+log_freq: int = 500
+print_freq: int = 500
+n_device: int = 1
+sync_bn: bool = False
+gradient_accumulation_steps: int = 1
diff --git a/training/benchmarks/t5_small/pytorch/config/mutable_params.py b/training/benchmarks/t5_small/pytorch/config/mutable_params.py
@@ -0,0 +1,5 @@
+mutable_params = [
+    'vendor', 'data_dir', 'lr', 'weight_decay', 'train_batch_size',
+    'eval_batch_size', 'do_train', 'distributed', 'dist_backend', 'device',
+    'cudnn_benchmark', 'cudnn_deterministic'
+]
diff --git a/training/benchmarks/t5_small/pytorch/data_preprocessing/create_train_eval_data.py b/training/benchmarks/t5_small/pytorch/data_preprocessing/create_train_eval_data.py
@@ -0,0 +1,89 @@
+import os
+
+import numpy as np
+import datasets
+from transformers import AutoTokenizer
+
+
+def save_dataset(ds, save_path):
+    np.savez(save_path,
+             input_ids=ds['input_ids'],
+             attention_mask=ds['attention_mask'],
+             labels=ds['labels'])
+
+
+def main():
+    data_prefix = 't5_small_train/dataset'
+    os.makedirs(data_prefix, exist_ok=True)
+    train_datapath = os.path.join(data_prefix, 'train_dataset.npz')
+    eval_datapath = os.path.join(data_prefix, 'eval_dataset.npz')
+
+    tokenizer = AutoTokenizer.from_pretrained('t5-small',
+                                              use_fast=True,
+                                              revision='main')
+
+    raw_datasets = datasets.load_dataset('cnn_dailymail', '3.0.0')
+
+    def preprocess_function(examples):
+        # remove pairs where at least one record is None
+        text_column = 'article'
+        summary_column = 'highlights'
+        prefix = 'summarize: '
+        max_source_length = 1024
+        max_target_length = 128
+        ignore_pad_token_for_loss = True
+        padding = "max_length"
+
+        inputs, targets = [], []
+        for i in range(len(examples[text_column])):
+            if examples[text_column][i] and examples[summary_column][i]:
+                inputs.append(examples[text_column][i])
+                targets.append(examples[summary_column][i])
+
+        inputs = [prefix + inp for inp in inputs]
+        model_inputs = tokenizer(inputs,
+                                 max_length=max_source_length,
+                                 padding=padding,
+                                 truncation=True)
+
+        # Tokenize targets with the `text_target` keyword argument
+        labels = tokenizer(text_target=targets,
+                           max_length=max_target_length,
+                           padding=padding,
+                           truncation=True)
+
+        # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
+        # padding in the loss.
+        if padding == "max_length" and ignore_pad_token_for_loss:
+            labels["input_ids"] = [[
+                (l if l != tokenizer.pad_token_id else -100) for l in label
+            ] for label in labels["input_ids"]]
+
+        model_inputs["labels"] = labels["input_ids"]
+        return model_inputs
+
+    train_dataset = raw_datasets["train"]
+    train_dataset = train_dataset.map(
+        preprocess_function,
+        batched=True,
+        num_proc=32,
+        remove_columns=raw_datasets["train"].column_names,
+        load_from_cache_file=True,
+        desc="Running tokenizer on train dataset",
+    ).with_format('numpy')
+    save_dataset(train_dataset, train_datapath)
+
+    eval_dataset = raw_datasets["validation"]
+    eval_dataset = eval_dataset.map(
+        preprocess_function,
+        batched=True,
+        num_proc=32,
+        remove_columns=raw_datasets["validation"].column_names,
+        load_from_cache_file=True,
+        desc="Running tokenizer on train dataset",
+    ).with_format('numpy')
+    save_dataset(eval_dataset, eval_datapath)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/training/benchmarks/t5_small/pytorch/dataloaders/__init__.py b/training/benchmarks/t5_small/pytorch/dataloaders/__init__.py
@@ -0,0 +1 @@
+from .dataloader import build_train_dataloader, build_eval_dataloader
diff --git a/training/benchmarks/t5_small/pytorch/dataloaders/dataloader.py b/training/benchmarks/t5_small/pytorch/dataloaders/dataloader.py
@@ -0,0 +1,83 @@
+import os
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+from torch.utils.data.dataloader import default_collate
+
+
+class T5Dataset(Dataset):
+    def __init__(self, filepath):
+        origin_data = np.load(filepath)
+        self.input_ids = origin_data['input_ids']
+        self.attention_mask = origin_data['attention_mask']
+        self.labels = origin_data['labels']
+
+    def __len__(self):
+        return len(self.input_ids)
+
+    def __getitem__(self, idx):
+        sample = {
+            'input_ids': self.input_ids[idx],
+            'attention_mask': self.attention_mask[idx],
+            'labels': self.labels[idx]
+        }
+        return sample
+
+
+def _prepare_decoder_input_ids_from_labels(input_ids):
+    """
+        https://github.com/huggingface/transformers/blob/v4.31.0/src/transformers/models/t5/modeling_t5.py#L1800
+        https://github.com/huggingface/transformers/blob/v4.31.0/src/transformers/models/t5/modeling_t5.py#L851
+    """
+    decoder_start_token_id = 0
+    pad_token_id = 0
+
+    shifted_input_ids = input_ids.new_zeros(input_ids.shape)
+    shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
+    shifted_input_ids[..., 0] = decoder_start_token_id
+
+    shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
+
+    return shifted_input_ids
+
+
+def my_collate(batch):
+    """
+        https://github.com/huggingface/transformers/blob/v4.31.0/src/transformers/data/data_collator.py#L600
+    """
+    new_batch = default_collate(batch)
+    new_batch["decoder_input_ids"] = _prepare_decoder_input_ids_from_labels(
+        new_batch["labels"])
+    return new_batch
+
+
+def build_train_dataloader(config):
+    train_dataset = T5Dataset(
+        os.path.join(config.data_dir, 'dataset', 'train_dataset.npz'))
+
+    data_loader = torch.utils.data.DataLoader(
+        train_dataset,
+        shuffle=True,
+        batch_size=config.train_batch_size,
+        collate_fn=my_collate)
+    return data_loader
+
+
+def build_eval_dataloader(config):
+    eval_dataset = T5Dataset(
+        os.path.join(config.data_dir, 'dataset', 'eval_dataset.npz'))
+
+    data_loader = torch.utils.data.DataLoader(
+        eval_dataset, batch_size=config.eval_batch_size, collate_fn=my_collate)
+    return data_loader
+
+
+if __name__ == '__main__':
+    from collections import namedtuple
+    Config = namedtuple(
+        'Config',
+        ['data_dir', 'distributed', 'train_batch_size', 'eval_batch_size'])
+    config = Config('t5_small_train/dataset', False, 4, 4)
+    eval_dataloader = build_eval_dataloader(config)
+    for i, batch in enumerate(eval_dataloader):
+        break
diff --git a/training/benchmarks/t5_small/pytorch/model/__init__.py b/training/benchmarks/t5_small/pytorch/model/__init__.py
@@ -0,0 +1,19 @@
+import os
+from transformers import T5Config, T5ForConditionalGeneration, T5TokenizerFast
+
+
+def create_model(config):
+    model_path = os.path.join(config.data_dir, 'model')
+    hfconfig = T5Config.from_pretrained(model_path)
+    model = T5ForConditionalGeneration.from_pretrained(model_path,
+                                                       config=hfconfig)
+    tokenizer = T5TokenizerFast.from_pretrained(model_path)
+    return model, hfconfig, tokenizer
+
+
+if __name__ == '__main__':
+
+    from collections import namedtuple
+    Config = namedtuple('Config', ['data_dir'])
+    config = Config('t5_small_train')
+    model, tokenizer = create_model(config)
diff --git a/training/benchmarks/t5_small/pytorch/optimizers/__init__.py b/training/benchmarks/t5_small/pytorch/optimizers/__init__.py
@@ -0,0 +1,27 @@
+import torch
+
+
+def create_optimizer(model, args):
+    # Optimizer
+    # Split weights in two groups, one with weight decay and the other not.
+    no_decay = ["bias", "LayerNorm.weight", "layer_norm.weight"]
+    optimizer_grouped_parameters = [
+        {
+            "params": [
+                p for n, p in model.named_parameters()
+                if not any(nd in n for nd in no_decay)
+            ],
+            "weight_decay":
+            args.weight_decay,
+        },
+        {
+            "params": [
+                p for n, p in model.named_parameters()
+                if any(nd in n for nd in no_decay)
+            ],
+            "weight_decay":
+            0.0,
+        },
+    ]
+    optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=args.lr)
+    return optimizer