-
Notifications
You must be signed in to change notification settings - Fork 526
DeepSpeed support
UER-py integrates the DeepSpeed and supports gigantic model pre-training, fine-tuning, and inference.
To use DeepSpeed, we need to specify --deepspeed and the path of DeepSpeed configuration file (--deepspeed_config). This section was takes gigantic models in Megatron-LM as examples to demonstrate how to use DeepSpeed in UER-py. It is noticeable that pre-layernorm is used in Megatron BERT and Megatron GPT-2.
The example of using DeepSpeed for pre-training Megatron BERT: The example of pre-training on a single machine with 8 GPUs:
python3 preprocess.py --corpus_path corpora/CLUECorpusSmall_5000_lines.txt --vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 --dynamic_masking \
--data_processor mlm
deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
--dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/bert_3.9B_config.json \
--output_model_path models/output_model \
--world_size 8 --batch_size 16 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 100 --deep_init
The example of loading PyTorch model and doing incremental training:
deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
--pretrained_model_path models/input_model.bin \
--dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/bert_3.9B_config.json \
--output_model_path models/output_model \
--world_size 8 --batch_size 16 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 100 --deep_init
The example of pre-training on two machines: each machine has 8 GPUs (16 GPUs in total): It is required to provide hostfile.txt , whose format is ip slots=the number of GPUs . For example:
1.1.1.1 slots=8
2.2.2.2 slots=8
When training on multiple machines, we only need to run scripts in master node.
python3 preprocess.py --corpus_path corpora/CLUECorpusSmall_5000_lines.txt --vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 --dynamic_masking \
--data_processor mlm
deepspeed --hostfile=hostfile.txt pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
--dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/bert_3.9B_config.json \
--output_model_path models/output_model \
--world_size 16 --batch_size 16 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 100 --deep_init
The example of using DeepSpeed for training Megatron GPT-2: The example of pre-training on a single machine with 8 GPUs:
python3 preprocess.py --corpus_path corpora/CLUECorpusSmall_5000_lines.txt --vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 --data_processor lm
deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
--dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/gpt2_8.3B_config.json \
--output_model_path models/output_model \
--world_size 8 --batch_size 4 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 100 --deep_init
The example of pre-training on two machines: each machine has 8 GPUs (16 GPUs in total):
python3 preprocess.py --corpus_path corpora/CLUECorpusSmall_5000_lines.txt --vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 --data_processor lm
deepspeed --hostfile=hostfile.txt pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
--dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/gpt2_8.3B_config.json \
--output_model_path models/output_model \
--world_size 16 --batch_size 4 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 100 --deep_init
After pre-training, the pre-trained model and the conversion script zero_to_fp32.py are saved in models/output_model folder. zero_to_fp32.py converts the pre-trained from DeepSpeed format to PyTorch format.
usage: zero_to_fp32.py [-h] checkpoint_dir output_file
positional arguments:
checkpoint_dir path to the deepspeed checkpoint folder, e.g., path/checkpoint-1/global_step1
output_file path to the pytorch fp32 state_dict output file (e.g.,path/checkpoint-1/pytorch_model.bin)
optional arguments:
-h, --help how this help message and exit
The example Megatron BERT conversion:
python3 models/output_model/zero_to_fp32.py models/output_model/10000 models/output_model/megatron_bert.bin-10000
finetune/run_classifier_deepspeed.py is used to fine-tune gigantic models with DeepSpeed. run_classifier_deepspeed.py and the regular classification script run_classifier.py have the following differences:
- In run_classifier_deepspeed.py , --world_size is used to specify the number of GPUs.
- In run_classifier_deepspeed.py , the actual batch size is --batch_size times --world_size . In run_classifier.py , the actual batch size is batch_size .
- run_classifier_deepspeed.py saves the fine-tuned model every epoch and place them to the path specified by --output_model_path.
The example of using DeepSpeed for fine-tuning Megatron BERT: The example of fine-tuning on a single machine with 8 GPUs:
deepspeed finetune/run_classifier_deepspeed.py --pretrained_model_path models/output_model/megatron_bert.bin-10000 \
--deepspeed_config models/deepspeed_config.json \
--vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/bert_3.9B_config.json \
--output_model_path models/classifier_model \
--train_path datasets/chnsenticorp/train.tsv \
--dev_path datasets/chnsenticorp/dev.tsv \
--test_path datasets/chnsenticorp/test.tsv \
--epochs_num 3 --batch_size 8 --world_size 8
The example of fine-tuning on two machines: each machine has 8 GPUs (16 GPUs in total):
deepspeed --hostfile=hostfile.txt finetune/run_classifier_deepspeed.py --pretrained_model_path models/output_model/megatron_bert.bin-10000 \
--deepspeed_config models/deepspeed_config.json \
--vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/bert_3.9B_config.json \
--output_model_path models/classifier_model \
--train_path datasets/chnsenticorp/train.tsv \
--dev_path datasets/chnsenticorp/dev.tsv \
--test_path datasets/chnsenticorp/test.tsv \
--epochs_num 3 --batch_size 4 --world_size 16
Then we converts the pre-trained model from DeepSpeed format to PyTorch format:
python3 models/classifier_model/zero_to_fp32.py models/classifier_model/3 models/classifier_model/megatron_bert_classifier.bin
run_classifier_deepspeed_infer.py is used to do inference on gigantic models with DeepSpeed. --mp_size specifies the the number of used GPUs for model parallel.
The example of using DeepSpeed for Megatron BERT: The example of doing inference on a single machine with 8 GPUs:
deepspeed finetune/run_classifier_deepspeed_infer.py --load_model_path models/classifier_model/megatron_bert_classifier.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/bert_3.9B_config.json \
--test_path datasets/chnsenticorp/test_nolabel.tsv \
--prediction_path prediction.txt --labels_num 2 \
--mp_size 8
The example of do inference on two machines: each machine has 8 GPUs (16 GPUs in total):
deepspeed --hostfile=hostfile.txt finetune/run_classifier_deepspeed_infer.py --load_model_path models/classifier_model/megatron_bert_classifier.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/bert_3.9B_config.json \
--test_path datasets/chnsenticorp/test_nolabel.tsv \
--prediction_path prediction.txt --labels_num 2 \
--mp_size 16
generate_lm_deepspeed.py is used to generate text with gigantic language models. The model generates the text according to the beginning. --mp_size specifies the the number of used GPUs for model parallel. The example of using DeepSpeed for Megatron GPT-2: The example of generating text on a single machine with 8 GPUs:
deepspeed scripts/generate_lm_deepspeed.py --load_model_path models/megatron_gpt2.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/gpt2_8.3B_config.json \
--test_path beginning.txt --prediction_path generated_sentence.txt \
--mp_size 8
The example of doing inference on two machines: each machine has 8 GPUs (16 GPUs in total):
deepspeed --hostfile=hostfile.txt scripts/generate_lm_deepspeed.py --load_model_path models/megatron_gpt2.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/megatron/gpt2_8.3B_config.json \
--test_path beginning.txt --prediction_path generated_sentence.txt \
--mp_size 16
generate_seq2seq_deepspeed.py is used to generate text with gigantic seq2seq models. The example of generating text on a single machine with 8 GPUs:
deepspeed scripts/generate_seq2seq_deepspeed.py --load_model_path models/input_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/encoder_decoder_config.json \
--test_path input.txt --prediction_path output.txt \
--mp_size 8
The example of generating text on two machines: each machine has 8 GPUs (16 GPUs in total):
deepspeed --hostfile=hostfile.txt scripts/generate_seq2seq_deepspeed.py --load_model_path models/input_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/encoder_decoder_config.json \
--test_path input.txt --prediction_path output.txt \
--mp_size 16