Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop #71

Merged
merged 28 commits into from
Jul 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
2763b5d
Opt 30b (#16)
920232796 Jul 1, 2022
3e52907
fix bert tokenizer issue (#18)
Anhforth Jul 1, 2022
3a0c8cb
Opt 66b (#19)
920232796 Jul 6, 2022
4f8d715
updated release version
Anhforth Jul 6, 2022
efc1310
fix tokenizer issue
Anhforth Jul 6, 2022
9b81869
fix bug multi_gpu_training
920232796 Jul 8, 2022
7ad38a0
Merge pull request #21 from baai-open-internal/fix_multi_gpu_training
Anhforth Jul 8, 2022
72ffd6a
changed the version
Anhforth Jul 8, 2022
e6f89a6
fix_validation_bug (#24)
920232796 Jul 11, 2022
29ea850
updated the version
Anhforth Jul 11, 2022
8d44329
add vit and examples
920232796 Jul 15, 2022
81c438d
vit and examples
920232796 Jul 15, 2022
da24628
Update base_model.py
marscrazy Jul 15, 2022
aff728b
Update vit.py
marscrazy Jul 15, 2022
e5a0ddb
modify readme.md
920232796 Jul 15, 2022
fe56b8b
modify readme.md
920232796 Jul 15, 2022
fc6c32e
delete annotating code
920232796 Jul 15, 2022
cd45e5c
Vit xzh (#25)
920232796 Jul 15, 2022
faee281
Merge branch 'develop' into vit_xzh
BAAI-OpenPlatform Jul 19, 2022
06f0b69
Merge pull request #28 from baai-open-internal/vit_xzh
BAAI-OpenPlatform Jul 19, 2022
deaa120
Merge pull request #27 from baai-open-internal/develop
marscrazy Jul 20, 2022
9558a47
env trainer
920232796 Jul 20, 2022
c35d4b6
Merge pull request #29 from baai-open-internal/env_args
marscrazy Jul 20, 2022
dc5a84d
Create README.md
marscrazy Jul 21, 2022
437caa4
vit-checkpoint-activations
920232796 Jul 21, 2022
dc6fc3d
vit-checkpoint-activations
920232796 Jul 21, 2022
c1cec9f
Merge pull request #33 from baai-open-internal/vit-checkpointing-acti…
marscrazy Jul 21, 2022
dc98332
Merge remote-tracking branch 'upstream/develop' into develop
marscrazy Jul 21, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
--------------------------------------------------------------------------------


FlagAI (Fast LArge-scale General AI models) is an fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality. Currently, we are focusing on NLP models and tasks. In near futher, we will support for other modalities.
FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality. Currently, we are focusing on NLP models and tasks. In near futher, we will support for other modalities.

* Now it supports **WuDao GLM** with a maximum of 10 billion parameters (see [Introduction to GLM](/docs/GLM.md)). It also supports **BERT**, **RoBERTa**, **GPT2**, **T5**, and models from Huggingface Transformers.

Expand All @@ -18,7 +18,7 @@ FlagAI (Fast LArge-scale General AI models) is an fast, easy-to-use and extensib
* FlagAI is backed by the three most popular data/model parallel libraries — [PyTorch](https://pytorch.org/)/[Deepspeed](https://www.deepspeed.ai/)/[Megatron-LM](https://github.com/NVIDIA/Megatron-LM) — with seamless integration between them. Users can parallel their training/testing process with less than ten lines of code.


The code is partially based on [GLM](https://github.com/THUDM/GLM), [Transformers](https://github.com/huggingface/transformers) and [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).
The code is partially based on [GLM](https://github.com/THUDM/GLM), [Transformers](https://github.com/huggingface/transformers), [timm](https://github.com/rwightman/pytorch-image-models) and [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).


<!-- toc -->
Expand Down Expand Up @@ -156,7 +156,6 @@ start with our [contributor guidelines](CONTRIBUTING.md) and then
check these [open issues](https://github.com/BAAI-Open/FlagAI/issues) for specific tasks.

## Contact us
Scan wechat QR code

<img src="./flagai_wechat.png" width = "200" height = "200" align=center />

Expand Down
2 changes: 1 addition & 1 deletion README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
* 飞智由三个最流行的数据/模型并行库([PyTorch](https://pytorch.org/)/[Deepspeed](https://www.deepspeed.ai/)/[Megatron-LM](https://github.com/NVIDIA/Megatron-LM))提供支持,它们之间实现了无缝集成。 你可以用不到十行代码来并行你的训练/测试过程。


本项目的部分代码基于[GLM](https://github.com/THUDM/GLM),[Transformers](https://github.com/huggingface/transformers) 和 [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).
本项目的部分代码基于 [GLM](https://github.com/THUDM/GLM),[Transformers](https://github.com/huggingface/transformers),[timm](https://github.com/rwightman/pytorch-image-models) 和 [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).

<!-- toc -->

Expand Down
71 changes: 70 additions & 1 deletion doc_zh/TUTORIAL_4_TRAINER.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
- [deepspeed](#deepspeed)
- [pytorchDDP](#pytorchddp)
- [deepspeed + megatron-lm](#deepspeed--megatron-lm)

- [EnvTrainer](#EnvTrainer)

Trainer 类提供了API用于多种并行框架的训练。API 支持在多个 GPU上使用Pytorch DDP/Deepspeed进行分布式训练,同时支持Megatron-LM+Deepspeed的混合并行分布式训练,同时也通过 NVIDIA Apex 实现混合精度。
## 入门
Expand Down Expand Up @@ -335,3 +335,72 @@ trainer = MyTrainer(
)
```

# EnvTrainer

为了更容易的输入参数,我们提供了EnvTrainer代替原来的Trainer
例如:
```python
# train.py
import torch
from flagai.env_args import EnvArgs
from flagai.env_trainer import EnvTrainer

lr = 2e-5
n_epochs = 50
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env_args = EnvArgs(
env_type="pytorch",
experiment_name="vit-cifar100-single_gpu",
batch_size=150,
num_gpus=1,
gradient_accumulation_steps=1,
lr=lr,
weight_decay=1e-5,
epochs=n_epochs,
log_interval=100,
eval_interval=1000,
load_dir=None,
pytorch_device=device,
save_dir="checkpoints_vit_cifar100_single_gpu",
save_interval=1000,
num_checkpoints=1,
)

env_args.add_arg(arg_name="test1", default=0, type=int, )
env_args_parse = env_args.parse_args()
trainer = EnvTrainer(env_args)
```

运行train.py文件时,可以通过命令行修改输入参数。
```commandline
python train.py --batch_size=8 --epochs=10
```
如果你需要添加额外的参数,你可以调用这个函数:
```python
env_args.add_arg(arg_name="test1", default=0, type=int, )
```
然后你可以运行如下命令中的train.py文件:
```commandline
python train.py --test1=1
```
更多的例子可以查看 :

1. [vit-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/vit_cifar100/train_env_trainer.py)

2. [glm-title-generation-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/glm_title_generation/train_env_trainer.py)


# 使用 pytorchDDP launcher 或 deepspeed launcher 运行
如果你使用多个GPU来训练模型,你可以直接运行train.py来调用FlagAI训练器中的启动器。
```commandline
python train.py
```
另外,你也可以使用pytorchDDP和deepspeed启动器来运行,例如:
### pytorchDDP
```commandline
python -m torch.distributed.launch --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 17750 train_env_trainer.py --not_call_launch
```
### deepspeed
```commandline
python -m deepspeed.launcher.launch --master_addr=172.31.125.121 --master_port=17500 train.py --not_call_launch
```
76 changes: 76 additions & 0 deletions docs/TUTORIAL_4_TRAINER.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@
- [deepspeed](#deepspeed)
- [pytorchDDP](#pytorchddp)
- [deepspeed + megatron-lm](#deepspeed--megatron-lm)
- [EnvTrainer](#EnvTrainer)


The Trainer class provides APIs for training with multiple parallel frameworks. The API supports distributed training with Pytorch DDP/Deepspeed on multiple GPUs, as well as mixed parallel distributed training with Megatron-LM+Deepspeed, and mixed precision via NVIDIA Apex.

## Getting Started
Expand Down Expand Up @@ -341,3 +344,76 @@ trainer = MyTrainer(
model_paralle_size = 2
)
```

# EnvTrainer

To input the parameters easier, we provided the EnvTrainer to replace the original Tranier.

Taking the code for example:
```python
# train.py
import torch
from flagai.env_args import EnvArgs
from flagai.env_trainer import EnvTrainer

lr = 2e-5
n_epochs = 50
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env_args = EnvArgs(
env_type="pytorch",
experiment_name="vit-cifar100-single_gpu",
batch_size=150,
num_gpus=1,
gradient_accumulation_steps=1,
lr=lr,
weight_decay=1e-5,
epochs=n_epochs,
log_interval=100,
eval_interval=1000,
load_dir=None,
pytorch_device=device,
save_dir="checkpoints_vit_cifar100_single_gpu",
save_interval=1000,
num_checkpoints=1,
)

env_args.add_arg(arg_name="test1", default=0, type=int, )
env_args_parse = env_args.parse_args()
trainer = EnvTrainer(env_args)
```

When you run the train.py file, you can modify the input parameters through command line.
```commandline
python train.py --batch_size=8 --epochs=10
```
If you need to add additional parameters, you can call the function:
```python
env_args.add_arg(arg_name="test1", default=0, type=int, )
```
Then you can run the train.py file in the following command:
```commandline
python train.py --test1=1
```

More examples in :

1. [vit-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/vit_cifar100/train_env_trainer.py)

2. [glm-title-generation-env-trainer](https://github.com/BAAI-Open/FlagAI/tree/master/examples/glm_title_generation/train_env_trainer.py)


# Run with pytorchDDP launcher or deepspeed launcher
If you use multiple GPU to train models, you can run the train.py directly which to call the launcher in FlagAI Trainer.
```commandline
python train.py
```
In addition, you also can use the pytorchDDP and deepspeed launcher to run, as example:

### pytorchDDP
```commandline
python -m torch.distributed.launch --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 17750 train_env_trainer.py --not_call_launch
```
### deepspeed
```commandline
python -m deepspeed.launcher.launch --master_addr=172.31.125.121 --master_port=17500 train.py --not_call_launch
```
144 changes: 144 additions & 0 deletions examples/glm_title_generation/train_env_trainer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Copyright © 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
import os
import numpy as np
import torch
from torch.utils.data import Dataset
from flagai.auto_model.auto_loader import AutoLoader
from flagai.env_trainer import EnvTrainer
from flagai.env_args import EnvArgs
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# You can input all parameters by the command line.
# For example: python train_env_trainer.py --epochs=300 --batch_size=4 --env_type=pytorch
env_args = EnvArgs()
trainer = EnvTrainer(env_args)

cur_dir = os.path.dirname(os.path.abspath(__file__))
src_dir = cur_dir + '/data/train.src'
tgt_dir = cur_dir + '/data/train.tgt'

maxlen = 256
auto_loader = AutoLoader("lm",
model_name="GLM-large-ch",
model_dir="./state_dict/")
model = auto_loader.get_model()
tokenizer = auto_loader.get_tokenizer()

def read_file():
src = []
tgt = []

with open(src_dir, 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
src.append(line.strip('\n').lower())

with open(tgt_dir, 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
tgt.append(line.strip('\n').lower())

return src, tgt


class GLMSeq2seqDataset(Dataset):

def __init__(self,
sents_src,
sents_tgt,
tokenizer,
max_src_length=300,
max_tgt_length=200):
super(GLMSeq2seqDataset, self).__init__()
self.sents_src = sents_src
self.sents_tgt = sents_tgt
self.tokenizer = tokenizer
self.max_src_length = max_src_length
self.max_tgt_length = max_tgt_length
self.no_block_position = False

def __getitem__(self, i):
source_text = self.sents_src[i]
target_text = self.sents_tgt[i]
data = self.tokenizer.encode_plus(source_text, target_text)

return data

def __len__(self):

return len(self.sents_src)


class GLMPoetryDynamicCollateFN(): #padding process in each batch

def __init__(self, pad_id):
self.pad_id = pad_id

def pad_token(self, tokens, max_length):
pad_len = max_length - len(tokens)
tokens += [self.pad_id] * pad_len
return tokens

def pad_position_ids(self, position_ids, max_length):
pad_len = max_length - len(position_ids[0])
position_ids[0] += [len(position_ids[0]) + x for x in range(pad_len)]
position_ids[1] += [1] * pad_len
return position_ids

def pad_loss_mask(self, loss_mask, max_length):
pad_len = max_length - len(loss_mask)
loss_mask += [0] * pad_len
return loss_mask

def __call__(self, batch):
input_ids = [data["input_ids"] for data in batch]
target_ids = [data["target_ids"] for data in batch]
position_ids = [data["position_ids"] for data in batch]
attention_mask = [data['attention_mask'] for data in batch]
loss_mask = [data['loss_mask'] for data in batch]

max_length = max([len(t) for t in input_ids])
for i in range(len(input_ids)):
input_ids[i] = self.pad_token(input_ids[i], max_length)
target_ids[i] = self.pad_token(target_ids[i], max_length)
position_ids[i] = self.pad_position_ids(position_ids[i],
max_length)
loss_mask[i] = self.pad_loss_mask(loss_mask[i], max_length)
return {
'input_ids': torch.LongTensor(input_ids),
'labels': torch.LongTensor(target_ids),
'position_ids': torch.LongTensor(position_ids),
'attention_mask': torch.LongTensor(attention_mask),
'loss_mask': torch.LongTensor(loss_mask)
}


sents_src, sents_tgt = read_file()
my_collate_fn = GLMPoetryDynamicCollateFN(
pad_id=tokenizer.get_command('pad').Id)

data_len = len(sents_tgt)
train_size = int(data_len * 0.8)
train_src = sents_src[:train_size][:2000]
train_tgt = sents_tgt[:train_size][:2000]

val_src = sents_src[train_size:]
val_tgt = sents_tgt[train_size:]

train_dataset = GLMSeq2seqDataset(train_src,
train_tgt,
tokenizer=tokenizer,
max_src_length=300,
max_tgt_length=200)
val_dataset = GLMSeq2seqDataset(val_src,
val_tgt,
tokenizer=tokenizer,
max_src_length=300,
max_tgt_length=200)

trainer.train(model,
train_dataset=train_dataset,
valid_dataset=val_dataset,
collate_fn=my_collate_fn)
Loading