FlagAI-Open · marscrazy · Jul 21, 2022 · Jul 1, 2022 · Jul 1, 2022 · Jul 6, 2022
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
 --------------------------------------------------------------------------------
 
 
-FlagAI (Fast LArge-scale General AI models) is an fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality. Currently, we are focusing on NLP models and tasks. In near futher, we will support for other modalities.
+FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality. Currently, we are focusing on NLP models and tasks. In near futher, we will support for other modalities.
 
 * Now it supports **WuDao GLM** with a maximum of 10 billion parameters (see [Introduction to GLM](/docs/GLM.md)). It also supports **BERT**, **RoBERTa**, **GPT2**, **T5**, and models from Huggingface Transformers.
 
@@ -18,7 +18,7 @@ FlagAI (Fast LArge-scale General AI models) is an fast, easy-to-use and extensib
 * FlagAI is backed by the three most popular data/model parallel libraries — [PyTorch](https://pytorch.org/)/[Deepspeed](https://www.deepspeed.ai/)/[Megatron-LM](https://github.com/NVIDIA/Megatron-LM) — with seamless integration between them. Users can parallel their training/testing process with less than ten lines of code.
 
 
-The code is partially based on [GLM](https://github.com/THUDM/GLM), [Transformers](https://github.com/huggingface/transformers) and [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).
+The code is partially based on [GLM](https://github.com/THUDM/GLM), [Transformers](https://github.com/huggingface/transformers), [timm](https://github.com/rwightman/pytorch-image-models) and [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).
 
 
 <!-- toc -->
@@ -156,7 +156,6 @@ start with our [contributor guidelines](CONTRIBUTING.md) and then
 check these [open issues](https://github.com/BAAI-Open/FlagAI/issues) for specific tasks.
 
 ## Contact us
-Scan wechat QR code 
 
 <img src="./flagai_wechat.png" width = "200" height = "200" align=center />
 

diff --git a/README_zh.md b/README_zh.md
@@ -18,7 +18,7 @@
 * 飞智由三个最流行的数据/模型并行库（[PyTorch](https://pytorch.org/)/[Deepspeed](https://www.deepspeed.ai/)/[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)）提供支持，它们之间实现了无缝集成。 你可以用不到十行代码来并行你的训练/测试过程。
 
 
-本项目的部分代码基于[GLM](https://github.com/THUDM/GLM),[Transformers](https://github.com/huggingface/transformers) 和 [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).
+本项目的部分代码基于 [GLM](https://github.com/THUDM/GLM),[Transformers](https://github.com/huggingface/transformers)，[timm](https://github.com/rwightman/pytorch-image-models) 和 [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).
 
 <!-- toc -->
 

diff --git a/doc_zh/TUTORIAL_4_TRAINER.md b/doc_zh/TUTORIAL_4_TRAINER.md
@@ -13,7 +13,7 @@
  - [deepspeed](#deepspeed)
  - [pytorchDDP](#pytorchddp)
  - [deepspeed + megatron-lm](#deepspeed--megatron-lm)
-
+- [EnvTrainer](#EnvTrainer)
 
 Trainer 类提供了API用于多种并行框架的训练。API 支持在多个 GPU上使用Pytorch DDP/Deepspeed进行分布式训练，同时支持Megatron-LM+Deepspeed的混合并行分布式训练，同时也通过 NVIDIA Apex 实现混合精度。
 ## 入门
@@ -335,3 +335,72 @@ trainer = MyTrainer(
 )
 ```
 
+# EnvTrainer
+
+为了更容易的输入参数，我们提供了EnvTrainer代替原来的Trainer
+例如：
+```python
+# train.py
+import torch
+from flagai.env_args import EnvArgs
+from flagai.env_trainer import EnvTrainer
+
+lr = 2e-5
+n_epochs = 50
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+env_args = EnvArgs(
+ env_type="pytorch",
+ experiment_name="vit-cifar100-single_gpu",
+ batch_size=150,
+ num_gpus=1,
+ gradient_accumulation_steps=1,
+ lr=lr,
+ weight_decay=1e-5,
+ epochs=n_epochs,
+ log_interval=100,
+ eval_interval=1000,
+ load_dir=None,
+ pytorch_device=device,
+ save_dir="checkpoints_vit_cifar100_single_gpu",
+ save_interval=1000,
+ num_checkpoints=1,
+)
+
+env_args.add_arg(arg_name="test1", default=0, type=int, )
+env_args_parse = env_args.parse_args()
+trainer = EnvTrainer(env_args)
+```
+
+运行train.py文件时，可以通过命令行修改输入参数。
+```commandline
+python train.py --batch_size=8 --epochs=10
+```
+如果你需要添加额外的参数，你可以调用这个函数:
+```python
+env_args.add_arg(arg_name="test1", default=0, type=int, )
+```
+然后你可以运行如下命令中的train.py文件:
+```commandline
+python train.py --test1=1
+```
+更多的例子可以查看 :
+
+1. [vit-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/vit_cifar100/train_env_trainer.py)
+
+2. [glm-title-generation-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/glm_title_generation/train_env_trainer.py)
+
+
+# 使用 pytorchDDP launcher 或 deepspeed launcher 运行
+如果你使用多个GPU来训练模型，你可以直接运行train.py来调用FlagAI训练器中的启动器。
+```commandline
+python train.py
+```
+另外，你也可以使用pytorchDDP和deepspeed启动器来运行，例如:
+### pytorchDDP
+```commandline
+python -m torch.distributed.launch --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 17750 train_env_trainer.py --not_call_launch
+```
+### deepspeed
+```commandline
+python -m deepspeed.launcher.launch --master_addr=172.31.125.121 --master_port=17500 train.py --not_call_launch
+```
diff --git a/docs/TUTORIAL_4_TRAINER.md b/docs/TUTORIAL_4_TRAINER.md
@@ -13,6 +13,9 @@
  - [deepspeed](#deepspeed)
  - [pytorchDDP](#pytorchddp)
  - [deepspeed + megatron-lm](#deepspeed--megatron-lm)
+- [EnvTrainer](#EnvTrainer)
+
+
 The Trainer class provides APIs for training with multiple parallel frameworks. The API supports distributed training with Pytorch DDP/Deepspeed on multiple GPUs, as well as mixed parallel distributed training with Megatron-LM+Deepspeed, and mixed precision via NVIDIA Apex.
 
 ## Getting Started
@@ -341,3 +344,76 @@ trainer = MyTrainer(
  model_paralle_size = 2
 )
 ```
+
+# EnvTrainer
+
+To input the parameters easier, we provided the EnvTrainer to replace the original Tranier.
+
+Taking the code for example:
+```python
+# train.py
+import torch
+from flagai.env_args import EnvArgs
+from flagai.env_trainer import EnvTrainer
+
+lr = 2e-5
+n_epochs = 50
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+env_args = EnvArgs(
+ env_type="pytorch",
+ experiment_name="vit-cifar100-single_gpu",
+ batch_size=150,
+ num_gpus=1,
+ gradient_accumulation_steps=1,
+ lr=lr,
+ weight_decay=1e-5,
+ epochs=n_epochs,
+ log_interval=100,
+ eval_interval=1000,
+ load_dir=None,
+ pytorch_device=device,
+ save_dir="checkpoints_vit_cifar100_single_gpu",
+ save_interval=1000,
+ num_checkpoints=1,
+)
+
+env_args.add_arg(arg_name="test1", default=0, type=int, )
+env_args_parse = env_args.parse_args()
+trainer = EnvTrainer(env_args)
+```
+
+When you run the train.py file, you can modify the input parameters through command line.
+```commandline
+python train.py --batch_size=8 --epochs=10
+```
+If you need to add additional parameters, you can call the function:
+```python
+env_args.add_arg(arg_name="test1", default=0, type=int, )
+```
+Then you can run the train.py file in the following command:
+```commandline
+python train.py --test1=1
+```
+
+More examples in :
+
+1. [vit-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/vit_cifar100/train_env_trainer.py)
+
+2. [glm-title-generation-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/glm_title_generation/train_env_trainer.py)
+
+
+# Run with pytorchDDP launcher or deepspeed launcher
+If you use multiple GPU to train models, you can run the train.py directly which to call the launcher in FlagAI Trainer.
+```commandline
+python train.py
+```
+In addition, you also can use the pytorchDDP and deepspeed launcher to run, as example:
+
+### pytorchDDP
+```commandline
+python -m torch.distributed.launch --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 17750 train_env_trainer.py --not_call_launch
+```
+### deepspeed
+```commandline
+python -m deepspeed.launcher.launch --master_addr=172.31.125.121 --master_port=17500 train.py --not_call_launch
+```
diff --git a/examples/glm_title_generation/train_env_trainer.py b/examples/glm_title_generation/train_env_trainer.py
@@ -0,0 +1,144 @@
+# Copyright © 2022 BAAI. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License")
+import os
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+from flagai.auto_model.auto_loader import AutoLoader
+from flagai.env_trainer import EnvTrainer
+from flagai.env_args import EnvArgs
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+# You can input all parameters by the command line.
+# For example: python train_env_trainer.py --epochs=300 --batch_size=4 --env_type=pytorch
+env_args = EnvArgs()
+trainer = EnvTrainer(env_args)
+
+cur_dir = os.path.dirname(os.path.abspath(__file__))
+src_dir = cur_dir + '/data/train.src'
+tgt_dir = cur_dir + '/data/train.tgt'
+
+maxlen = 256
+auto_loader = AutoLoader("lm",
+ model_name="GLM-large-ch",
+ model_dir="./state_dict/")
+model = auto_loader.get_model()
+tokenizer = auto_loader.get_tokenizer()
+
+def read_file():
+ src = []
+ tgt = []
+
+ with open(src_dir, 'r', encoding='utf-8') as f:
+ lines = f.readlines()
+ for line in lines:
+ src.append(line.strip('\n').lower())
+
+ with open(tgt_dir, 'r', encoding='utf-8') as f:
+ lines = f.readlines()
+ for line in lines:
+ tgt.append(line.strip('\n').lower())
+
+ return src, tgt
+
+
+class GLMSeq2seqDataset(Dataset):
+
+ def __init__(self,
+ sents_src,
+ sents_tgt,
+ tokenizer,
+ max_src_length=300,
+ max_tgt_length=200):
+ super(GLMSeq2seqDataset, self).__init__()
+ self.sents_src = sents_src
+ self.sents_tgt = sents_tgt
+ self.tokenizer = tokenizer
+ self.max_src_length = max_src_length
+ self.max_tgt_length = max_tgt_length
+ self.no_block_position = False
+
+ def __getitem__(self, i):
+ source_text = self.sents_src[i]
+ target_text = self.sents_tgt[i]
+ data = self.tokenizer.encode_plus(source_text, target_text)
+
+ return data
+
+ def __len__(self):
+
+ return len(self.sents_src)
+
+
+class GLMPoetryDynamicCollateFN(): #padding process in each batch
+
+ def __init__(self, pad_id):
+ self.pad_id = pad_id
+
+ def pad_token(self, tokens, max_length):
+ pad_len = max_length - len(tokens)
+ tokens += [self.pad_id] * pad_len
+ return tokens
+
+ def pad_position_ids(self, position_ids, max_length):
+ pad_len = max_length - len(position_ids[0])
+ position_ids[0] += [len(position_ids[0]) + x for x in range(pad_len)]
+ position_ids[1] += [1] * pad_len
+ return position_ids
+
+ def pad_loss_mask(self, loss_mask, max_length):
+ pad_len = max_length - len(loss_mask)
+ loss_mask += [0] * pad_len
+ return loss_mask
+
+ def __call__(self, batch):
+ input_ids = [data["input_ids"] for data in batch]
+ target_ids = [data["target_ids"] for data in batch]
+ position_ids = [data["position_ids"] for data in batch]
+ attention_mask = [data['attention_mask'] for data in batch]
+ loss_mask = [data['loss_mask'] for data in batch]
+
+ max_length = max([len(t) for t in input_ids])
+ for i in range(len(input_ids)):
+ input_ids[i] = self.pad_token(input_ids[i], max_length)
+ target_ids[i] = self.pad_token(target_ids[i], max_length)
+ position_ids[i] = self.pad_position_ids(position_ids[i],
+ max_length)
+ loss_mask[i] = self.pad_loss_mask(loss_mask[i], max_length)
+ return {
+ 'input_ids': torch.LongTensor(input_ids),
+ 'labels': torch.LongTensor(target_ids),
+ 'position_ids': torch.LongTensor(position_ids),
+ 'attention_mask': torch.LongTensor(attention_mask),
+ 'loss_mask': torch.LongTensor(loss_mask)
+ }
+
+
+sents_src, sents_tgt = read_file()
+my_collate_fn = GLMPoetryDynamicCollateFN(
+ pad_id=tokenizer.get_command('pad').Id)
+
+data_len = len(sents_tgt)
+train_size = int(data_len * 0.8)
+train_src = sents_src[:train_size][:2000]
+train_tgt = sents_tgt[:train_size][:2000]
+
+val_src = sents_src[train_size:]
+val_tgt = sents_tgt[train_size:]
+
+train_dataset = GLMSeq2seqDataset(train_src,
+ train_tgt,
+ tokenizer=tokenizer,
+ max_src_length=300,
+ max_tgt_length=200)
+val_dataset = GLMSeq2seqDataset(val_src,
+ val_tgt,
+ tokenizer=tokenizer,
+ max_src_length=300,
+ max_tgt_length=200)
+
+trainer.train(model,
+ train_dataset=train_dataset,
+ valid_dataset=val_dataset,
+ collate_fn=my_collate_fn)