预训练

usage: pretrain.py [-h] [--dataset_path DATASET_PATH]
                   [--pretrained_model_path PRETRAINED_MODEL_PATH]
                   --output_model_path OUTPUT_MODEL_PATH
                   [--config_path CONFIG_PATH] [--total_steps TOTAL_STEPS]
                   [--save_checkpoint_steps SAVE_CHECKPOINT_STEPS]
                   [--report_steps REPORT_STEPS]
                   [--accumulation_steps ACCUMULATION_STEPS]
                   [--batch_size BATCH_SIZE]
                   [--instances_buffer_size INSTANCES_BUFFER_SIZE]
                   [--labels_num LABELS_NUM] [--dropout DROPOUT] [--seed SEED]
                   [--tokenizer {bert,bpe,char,space,xlmroberta,image,text_image}]
                   [--vocab_path VOCAB_PATH] [--merges_path MERGES_PATH]
                   [--spm_model_path SPM_MODEL_PATH]
                   [--do_lower_case {true,false}]
                   [--vqgan_model_path VQGAN_MODEL_PATH]
                   [--vqgan_config_path VQGAN_CONFIG_PATH]
                   [--tgt_tokenizer {bert,bpe,char,space,xlmroberta}]
                   [--tgt_vocab_path TGT_VOCAB_PATH]
                   [--tgt_merges_path TGT_MERGES_PATH]
                   [--tgt_spm_model_path TGT_SPM_MODEL_PATH]
                   [--tgt_do_lower_case {true,false}]
                   [--embedding {word,pos,seg,sinusoidalpos,patch,speech,word_patch,dual} [{word,pos,seg,sinusoidalpos,patch,speech,word_patch,dual} ...]]
                   [--tgt_embedding {word,pos,seg,sinusoidalpos,patch,speech,word_patch,dual} [{word,pos,seg,sinusoidalpos,patch,speech,word_patch,dual} ...]]
                   [--max_seq_length MAX_SEQ_LENGTH]
                   [--relative_position_embedding] [--share_embedding]
                   [--remove_embedding_layernorm]
                   [--factorized_embedding_parameterization]
                   [--encoder {transformer,rnn,lstm,gru,birnn,bilstm,bigru,gatedcnn,dual}]
                   [--decoder {None,transformer}]
                   [--mask {fully_visible,causal,causal_with_prefix}]
                   [--layernorm_positioning {pre,post}]
                   [--feed_forward {dense,gated}]
                   [--relative_attention_buckets_num RELATIVE_ATTENTION_BUCKETS_NUM]
                   [--remove_attention_scale] [--remove_transformer_bias]
                   [--layernorm {normal,t5}] [--bidirectional]
                   [--parameter_sharing] [--has_residual_attention]
                   [--has_lmtarget_bias]
                   [--target {sp,lm,mlm,bilm,cls,clr} [{sp,lm,mlm,bilm,cls,clr} ...]]
                   [--tie_weights] [--pooling {mean,max,first,last}]
                   [--image_height IMAGE_HEIGHT] [--image_width IMAGE_WIDTH]
                   [--patch_size PATCH_SIZE] [--channels_num CHANNELS_NUM]
                   [--image_preprocess IMAGE_PREPROCESS [IMAGE_PREPROCESS ...]]
                   [--sampling_rate SAMPLING_RATE]
                   [--audio_preprocess AUDIO_PREPROCESS [AUDIO_PREPROCESS ...]]
                   [--max_audio_frames MAX_AUDIO_FRAMES]
                   [--conv_layers_num CONV_LAYERS_NUM]
                   [--audio_feature_size AUDIO_FEATURE_SIZE]
                   [--conv_channels CONV_CHANNELS]
                   [--conv_kernel_sizes CONV_KERNEL_SIZES [CONV_KERNEL_SIZES ...]]
                   [--data_processor {bert,lm,mlm,bilm,albert,mt,t5,cls,prefixlm,gsg,bart,cls_mlm,vit,vilt,clip,s2t,beit,dalle}]
                   [--deep_init] [--whole_word_masking] [--span_masking]
                   [--span_geo_prob SPAN_GEO_PROB]
                   [--span_max_length SPAN_MAX_LENGTH]
                   [--learning_rate LEARNING_RATE] [--warmup WARMUP] [--fp16]
                   [--fp16_opt_level {O0,O1,O2,O3}]
                   [--optimizer {adamw,adafactor}]
                   [--scheduler {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}]
                   [--world_size WORLD_SIZE]
                   [--gpu_ranks GPU_RANKS [GPU_RANKS ...]]
                   [--master_ip MASTER_IP] [--backend {nccl,gloo}]
                   [--deepspeed] [--deepspeed_config DEEPSPEED_CONFIG]
                   [--deepspeed_checkpoint_activations]
                   [--deepspeed_checkpoint_layers_num DEEPSPEED_CHECKPOINT_LAYERS_NUM]
                   [--local_rank LOCAL_RANK] [--log_path LOG_PATH]
                   [--log_level {ERROR,INFO,DEBUG,NOTSET}]
                   [--log_file_level {ERROR,INFO,DEBUG,NOTSET}]

大部分预训练模型可以拆解成词向量、编码器、解码端词向量、解码器、预训练目标这5部分。TencentPretrain包含这5部分（--embedding --encoder --tgt_embedding --decoder --target），并在每个部分中提供了丰富的模块，能够让用户对这些部分中的模块自由组合，高效的构建需要的预训练模型。更多的例子可以在预训练模型使用示例章节中找到。以编码器为例，TencentPretrain包括众多模块，例如：

lstm: LSTM
gru: GRU
bilstm: 双向LSTM (和 --encoder lstm --bidirectional 不同，更多的信息可以参考这里)
gatedcnn: GatedCNN
transformer: 支持BERT (--encoder transformer --mask fully_visible)、GPT-2 (--encoder transformer --mask causal --layernorm_positioning pre)等

预训练阶段指定的 dataset.pt 数据格式（--data_processor）应与预处理阶段中指定的格式一致。

预训练阶段需要指定路径、模型、训练环境等信息。在路径信息方面，通常需要给出按照指定格式预处理好的输入数据路径（--dataset_path）、配置文件路径（--config_path）、预训练模型输出路径（--output_model_path）。模型信息通常放在配置文件中（--config_path），不需要显式的在命令行中给出。命令行中指定的信息会覆盖配置文件中的信息以及默认的信息。训练环境信息通常通过 --world_size 和 --gpu_ranks 指定，后面会给出详细的介绍。

预训练的参数初始化策略有两种：1）随机初始化； 2）加载预训练模型。

随机初始化

单机CPU预训练示例：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/output_model.bin \
                    --data_processor bert \
                    --embedding word pos seg --encoder transformer --mask fully_visible --target mlm sp

预训练的输入由 --dataset_path 指定。单机单GPU预训练示例（GPU的ID为3）：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/output_model.bin --gpu_ranks 3 \
                    --data_processor bert \
                    --embedding word pos seg --encoder transformer --mask fully_visible --target mlm sp

单机8GPU预训练示例：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --data_processor bert \
                    --embedding word pos seg --encoder transformer --mask fully_visible --target mlm sp

--world_size 指定所开进程（以及GPU）的总数。 --gpu_ranks 为每个进程/GPU指定唯一的ID，要求从0到n-1，其中n是预训练进程的数量。如果想指定使用某几块GPU，使用CUDA_VISIBLE_DEVICES指定程序可见的GPU：

CUDA_VISIBLE_DEVICES=1,2,3,5 python3 pretrain.py --dataset_path dataset.pt \
                                                 --vocab_path models/google_zh_vocab.txt \
                                                 --config_path models/bert/base_config.json \
                                                 --output_model_path models/output_model.bin \
                                                 --world_size 4 --gpu_ranks 0 1 2 3 \
                                                 --data_processor bert \
                                                 --embedding word pos seg \
                                                 --encoder transformer --mask fully_visible \
                                                 --target mlm sp

因为只使用4个GPU，因此 --world_size 设置为4，这4个进程/GPU的ID从0到3，由 --gpu_ranks 指定。

2机每机8GPU预训练示例总共16个进程，依次在两台机器（Node-0和Node-1）上启动脚本。 --master_ip 指定为 --gpu_ranks 包含0的机器的ip:port，启动示例：

Node-0 : python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                             --config_path models/bert/base_config.json \
                             --output_model_path models/output_model.bin \
                             --world_size 16 --gpu_ranks 0 1 2 3 4 5 6 7 \
                             --total_steps 100000 --save_checkpoint_steps 10000 --report_steps 100 \
                             --master_ip tcp://9.73.138.133:12345 \
                             --data_processor bert \
                             --embedding word pos seg --encoder transformer --mask fully_visible --target mlm sp

Node-1 : python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                             --config_path models/bert/base_config.json \
                             --output_model_path models/output_model.bin \
                             --world_size 16 --gpu_ranks 8 9 10 11 12 13 14 15 \
                             --total_steps 100000 \
                             --master_ip tcp://9.73.138.133:12345 \
                             --data_processor bert \
                             --embedding word pos seg --encoder transformer --mask fully_visible --target mlm sp

Node-0的ip地址为9.73.138.133
--total_steps 指定训练的步数。两台机器的训练的步数需保持一致。 --save_checkpoint_steps 指定每隔多少步数对预训练模型进行保存。注意到我们只需要在Node-0指定，因为模型只会在Node-0机器上保存。 --report_steps 指定每隔多少步数打印训练进度。注意到我们只需要在Node-0指定，因为打印结果只会在Node-0机器上显示。需要注意的是，在指定 --master_ip 中的端口号（port）时，不能选择被其他程序占用的端口号。通常来说，参数随机初始化的情况下，预训练需要更大的学习率。推荐使用 --learning_rate 1e-4（默认为2e-5）。

注意到在上面的命令行中显式的指定了预训练模型的类型。配置文件中（--config_path models/bert/base_config.json）已经包括了这些信息。如果不需要在命令行中去覆盖配置文件中的信息，可以不显式的指定这些信息。

加载预训练模型

我们推荐使用这种方案因为这种方案能够利用已有的预训练模型。我们通过参数 --pretrained_model_path 指定加载已有的预训练模型。单机CPU、单机单GPU预训练示例:

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/google_zh_model.bin \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/output_model.bin

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/google_zh_model.bin \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/output_model.bin --gpu_ranks 3

单机8GPU预训练示例：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/google_zh_model.bin \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7

2机每机8GPU预训练示例：

Node-0: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                            --pretrained_model_path models/google_zh_model.bin \
                            --config_path models/bert/base_config.json \
                            --output_model_path models/output_model.bin \
                            --world_size 16 --gpu_ranks 0 1 2 3 4 5 6 7 \
                            --master_ip tcp://9.73.138.133:12345

Node-1: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                            --pretrained_model_path models/google_zh_model.bin \
                            --config_path models/bert/base_config.json \
                            --output_model_path models/output_model.bin \
                            --world_size 16 --gpu_ranks 8 9 10 11 12 13 14 15 \
                            --master_ip tcp://9.73.138.133:12345

3机每机8GPU预训练示例：

Node-0: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                            --pretrained_model_path models/google_zh_model.bin \
                            --config_path models/bert/base_config.json \
                            --output_model_path models/output_model.bin \
                            --world_size 24 --gpu_ranks 0 1 2 3 4 5 6 7 \
                            --master_ip tcp://9.73.138.133:12345

Node-1: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                            --pretrained_model_path models/google_zh_model.bin \
                            --config_path models/bert/base_config.json \
                            --output_model_path models/output_model.bin \
                            --world_size 24 --gpu_ranks 8 9 10 11 12 13 14 15 \
                            --master_ip tcp://9.73.138.133:12345

Node-2: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                            --pretrained_model_path models/google_zh_model.bin \
                            --config_path models/bert/base_config.json \
                            --output_model_path models/output_model.bin \
                            --world_size 24 --gpu_ranks 16 17 18 19 20 21 22 23 \
                            --master_ip tcp://9.73.138.133:12345

调整预训练模型尺寸

通常来说，大模型更消耗计算资源但是有更好的表现。我们可以通过 --config_path 在预训练阶段指定预训练模型的配置文件。针对BERT（和RoBERTa），项目在models/bert/文件夹中提供了7个配置文件，xlarge_config.json 、 large_config.json 、 base_config.json 、 medium_config.json 、 small_config.json 、 mini_config.json、 tiny_config.json ，我们提供了不同大小的中文预训练模型权重，详情见预训练模型仓库。项目默认使用 models/bert/base_config.json 作为配置文件。对于其他预训练模型，我们同样在相应的文件夹下提供了配置文件，例如文件夹models/albert/、models/gpt2/、models/t5/ 。

加载中文medium预训练模型进行增量预训练示例：

python3 pretrain.py --dataset_path dataset.pt \
                    --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/cluecorpussmall_roberta_medium_seq512_model.bin \
                    --config_path models/bert/medium_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --data_processor bert \
                    --embedding word pos seg --encoder transformer --mask fully_visible --target mlm sp

基于词的预训练模型

TencentPretrain支持基于词的预训练模型。我们下载 cluecorpussmall_word_bert_base_seq512_model.bin 和词典 cluecorpussmall_word_vocab.txt 。模型的训练语料为CLUECorpusSmall（使用jieba分词并且词与词之间用空格分隔）：

python3 preprocess.py --corpus_path corpora/CLUECorpusSmall_word_jieba_bert.txt \
                      --tokenizer space --vocab_path models/cluecorpussmall_word_vocab.txt \
                      --dataset_path dataset.pt \
                      --processes_num 8 --dynamic_masking \
                      --data_processor bert

python3 pretrain.py --dataset_path dataset.pt \
                    --tokenizer space --vocab_path models/cluecorpussmall_word_vocab.txt \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/cluecorpussmall_word_bert_base_seq128_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
                    --learning_rate 1e-4 --batch_size 64 \
                    --data_processor bert

python3 preprocess.py --corpus_path corpora/CLUECorpusSmall_word_jieba_bert.txt \
                      --tokenizer space --vocab_path models/cluecorpussmall_word_vocab.txt \
                      --dataset_path dataset.pt \
                      --processes_num 8 --dynamic_masking \
                      --data_processor bert --seq_length 512

python3 pretrain.py --dataset_path dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_word_bert_base_seq128_model.bin-1000000 \
                    --tokenizer space --vocab_path models/cluecorpussmall_word_vocab.txt \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/cluecorpussmall_word_bert_base_seq512_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
                    --learning_rate 5e-5 --batch_size 16 \
                    --data_processor bert

加载 cluecorpussmall_word_bert_base_seq512_model.bin 进行增量预训练的示例：假设训练语料是 corpora/book_review.txt 。首先，我们进行分句并获得 book_review_seg.txt ，单词之间用空格分隔。然后我们在语料上建立词典：

python3 scripts/build_vocab.py --corpus_path corpora/book_review_seg.txt \
                               --output_path models/book_review_word_vocab.txt \
                               --delimiter space --workers_num 8 --min_count 5

由于我们使用新的词典，因此我们调整预训练模型 cluecorpussmall_word_bert_base_seq512_model.bin ，词向量层和softmax前一层会根据旧词典和新词典之间的差异进行改变，新的词对应的向量是随机初始化的。调整后的模型和新的词典相对应：

python3 scripts/dynamic_vocab_adapter.py --old_model_path models/cluecorpussmall_word_bert_base_seq512_model.bin \
                                         --old_vocab_path models/cluecorpussmall_word_vocab.txt \
                                         --new_vocab_path models/book_review_word_vocab.txt \
                                         --new_model_path models/book_review_word_model.bin

最后，我们对调整后的模型 book_review_word_model.bin 进行增量预训练，预训练目标为MLM：

python3 preprocess.py --corpus_path corpora/book_review_seg.txt \
                      --vocab_path models/book_review_word_vocab.txt --tokenizer space \
                      --dataset_path book_review_word_dataset.pt \
                      --processes_num 8 --seq_length 128 --dynamic_masking \
                      --data_processor mlm

python3 pretrain.py --dataset_path book_review_word_dataset.pt \
                    --vocab_path models/book_review_word_vocab.txt \
                    --pretrained_model_path models/book_review_word_model.bin \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 20000 --save_checkpoint_steps 10000 --report_steps 1000 \
                    --data_processor mlm \
                    --embedding word pos seg --encoder transformer --mask fully_visible --target mlm

此外，可以通过SentencePiece分词的方式得到基于词的预训练模型：

python3 preprocess.py --corpus_path corpora/book_review.txt \
                      --spm_model_path models/cluecorpussmall_spm.model \
                      --dataset_path book_review_word_dataset.pt \
                      --processes_num 8 --seq_length 128 --dynamic_masking \
                      --data_processor mlm

python3 pretrain.py --dataset_path book_review_word_dataset.pt \
                    --spm_model_path models/cluecorpussmall_spm.model \
                    --config_path models/bert/base_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 20000 --save_checkpoint_steps 10000 --report_steps 1000 \
                    --learning_rate 1e-4 \
                    --data_processor mlm \
                    --embedding word pos seg --encoder transformer --mask fully_visible --target mlm

--spm_model_path 指定加载的sentencepiece模型路径。这里我们使用了在CLUECorpusSmall上训练的sentencepiece模型 models/cluecorpussmall_spm.model 。

Home
主页
- 项目特色
- 依赖环境
- 快速上手
- 预训练数据
- 下游任务数据集
- 预训练模型仓库
- 使用说明
- 竞赛解决方案
  - 中文任务测评基准CLUE
  - SMP2020-EWECT
  - SMP2019-ECISA
  - CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原
  - 英文任务测评基准GLUE
  - 视觉任务评测基准
- 引用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

预训练

随机初始化

加载预训练模型

调整预训练模型尺寸

基于词的预训练模型

Clone this wiki locally