Skip to content

Scripts

zhezhaoa edited this page Sep 26, 2022 · 3 revisions

TencentPretrain provides abundant tool scripts for pre-training models. This section firstly summarizes tool scripts and their functions, and then provides using examples of some scripts.

Script Function description
average_model.py Take the average of pre-trained models
build_vocab.py Build vocabulary given corpus and tokenizer
cloze_test.py Randomly mask a word and predict it, top n words are returned
convert_bart_from_huggingface_to_tencentpretrain.py Convert BART of Huggingface format (PyTorch) to TencentPretrain format
convert_bart_from_tencentpretrain_to_huggingface.py Convert BART of TencentPretrain format to Huggingface format (PyTorch)
convert_albert_from_huggingface_to_tencentpretrain.py Convert ALBERT of Huggingface format (PyTorch) to TencentPretrain format
convert_albert_from_tencentpretrain_to_huggingface.py Convert ALBERT of TencentPretrain format to Huggingface format (PyTorch)
convert_bert_extractive_qa_from_huggingface_to_tencentpretrain.py Convert extractive QA BERT of Huggingface format (PyTorch) to TencentPretrain format
convert_bert_extractive_qa_from_tencentpretrain_to_huggingface.py Convert extractive QA BERT of TencentPretrain format to Huggingface format (PyTorch)
convert_bert_from_google_to_tencentpretrain.py Convert BERT of Google format (TF) to TencentPretrain format
convert_bert_from_huggingface_to_tencentpretrain.py Convert BERT of Huggingface format (PyTorch) to TencentPretrain format
convert_bert_from_tencentpretrain_to_google.py Convert BERT of TencentPretrain format to Google format (TF)
convert_bert_from_tencentpretrain_to_huggingface.py Convert BERT of TencentPretrain format to Huggingface format (PyTorch)
convert_bert_text_classification_from_huggingface_to_tencentpretrain.py Convert text classification BERT of Huggingface format (PyTorch) to TencentPretrain format
convert_bert_text_classification_from_tencentpretrain_to_huggingface.py Convert text classification BERT of TencentPretrain format to Huggingface format (PyTorch)
convert_bert_token_classification_from_huggingface_to_tencentpretrain.py Convert sequence labeling BERT of Huggingface format (PyTorch) to TencentPretrain format
convert_bert_token_classification_from_tencentpretrain_to_huggingface.py Convert sequence labeling BERT of TencentPretrain format to Huggingface format (PyTorch)
convert_gpt2_from_huggingface_to_tencentpretrain.py Convert GPT-2 of Huggingface format (PyTorch) to TencentPretrain format
convert_gpt2_from_tencentpretrain_to_huggingface.py Convert GPT-2 of TencentPretrain format to Huggingface format (PyTorch)
convert_pegasus_from_huggingface_to_tencentpretrain.py Convert Pegasus of Huggingface format (PyTorch) to TencentPretrain format
convert_pegasus_from_tencentpretrain_to_huggingface.py Convert Pegasus of TencentPretrain format to Huggingface format (PyTorch)
convert_t5_from_huggingface_to_tencentpretrain.py Convert T5 of Huggingface format (PyTorch) to TencentPretrain format
convert_t5_from_tencentpretrain_to_huggingface.py Convert T5 of TencentPretrain format to Huggingface format (PyTorch)
convert_xlmroberta_from_huggingface_to_tencentpretrain.py Convert XLM-RoBERTa of Huggingface format (PyTorch) to TencentPretrain format
convert_xlmroberta_from_tencentpretrain_to_huggingface.py Convert XLM-RoBERTa of TencentPretrain format to Huggingface format (PyTorch)
diff_vocab.py Compare two vocabularies
dynamic_vocab_adapter.py Adapt the pre-trained model according to the vocabulary
extract_embeddings.py Extract the embedding of the pre-trained model
extract_features.py Obtain text representation
generate_lm.py Generate text with language model
generate_seq2seq.py Generate text with seq2seq model
run_bayesopt.py Search hyper-parameters for LightGBM by bayesian optimization
run_lgb.py Model ensemble with LightGBM (classification)
topn_words_dep.py Find nearest neighbors with context-dependent word embedding
topn_words_indep.py Find nearest neighbors with context-independent word embedding

Cloze test

cloze_test.py uses MLM target to predict masked word. Top n words are returned. Cloze test can be used for operations such as data augmentation. The example of using cloze_test.py:

python3 scripts/cloze_test.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                              --config_path models/bert/base_config.json \
                              --test_path datasets/tencent_profile.txt --prediction_path output.txt

Notice that cloze_test.py only supports pre-trained models with MLM target.

Feature extractor

The text is encoded into a fixed-length embedding by extract_features.py (through embedding, encoder, and pooling layers). The example of using extract_features.py:

python3 scripts/extract_features.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                    --config_path models/bert/base_config.json \
                                    --test_path datasets/tencent_profile.txt --prediction_path features.pt \
                                    --pooling first

CLS embedding (--pooling first) is commonly used as the text embedding. When cosine similarity is used to measure the relationship between two texts, CLS embedding is not a proper choice. According to recent work, it is necessary to perform whitening operation:

python3 scripts/extract_features.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                    --config_path models/bert/base_config.json \
                                    --test_path datasets/tencent_profile.txt --prediction_path features.pt \
                                    --pooling first --whitening_size 64

--whitening_size 64 indicates that the whitening operation is used and the dimension of the text embedding is 64.

Embedding extractor

extract_embeddings.py extracts embedding layer from the pre-trained model. The extracted context-independent embedding can be used to initialize other models' (e.g. CNN) embedding layer. The example of using extract_embeddings.py:

python3 scripts/extract_embeddings.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                      --word_embedding_path embeddings.txt

--word_embedding_path specifies the path of the output word embedding file. The format of word embedding file follows here, which can be loaded directly by mainstream projects.

Finding nearest neighbours

The pre-trained model contains word embeddings. Traditional word embeddings such as word2vec and GloVe assign each word a fixed vector (context-independent word embedding). However, polysemy is a pervasive phenomenon in human language, and the meanings of a polysemous word depend on the context. To this end, we use the hidden state in pre-trained model to represent a word. It is noticeable that most Chinese pre-trained models are based on character. To obtain real word embedding (not character embedding), users can download word-based BERT model and its vocabulary. The example of using scripts/topn_words_indep.py to find nearest neighbours for context-independent word embedding (character-based and word-based models):

python3 scripts/topn_words_indep.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                    --test_path target_words.txt

python3 scripts/topn_words_indep.py --load_model_path models/wiki_bert_word_model.bin --vocab_path models/wiki_word_vocab.txt \
                                    --test_path target_words.txt

Context-independent word embedding comes from embedding layer. The format of the target_words.txt is as follows:

word-1
word-2
...
word-n

The example of using scripts/topn_words_dep.py to find nearest neighbours for context-dependent word embedding (character-based and word-based models):

python3 scripts/topn_words_dep.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                  --cand_vocab_path models/google_zh_vocab.txt --test_path target_words_with_sentences.txt --config_path models/bert/base_config.json \
                                  --batch_size 256 --seq_length 32 --tokenizer bert

python3 scripts/topn_words_dep.py --load_model_path models/bert_wiki_word_model.bin --vocab_path models/wiki_word_vocab.txt \
                                  --cand_vocab_path models/wiki_word_vocab.txt --test_path target_words_with_sentences.txt --config_path models/bert/base_config.json \
                                  --batch_size 256 --seq_length 32 --tokenizer space

We substitute the target word with other words in the vocabulary and feed the sentences into the pre-trained model. Hidden state is used as the context-dependent embedding of a word. --cand_vocab_path specifies the path of candidate word file. For faster speed one can use a smaller candidate vocabulary. Users should do word segmentation manually and use space tokenizer if word-based model is used. The format of target_words_with_sentences.txt is as follows:

word1 sent1
word2 sent2 
...
wordn sentn

Sentence and word are split by \t.

Model average

average_models.py takes the average of multiple weights for probably more robust performance. The example of using average_models.py

python3 scripts/average_models.py --model_list_path models/book_review_model.bin-4000 models/book_review_model.bin-5000 \
                                  --output_model_path models/book_review_model.bin

Text generator (language model)

We could use generate_lm.py to generate text through language model. Given a few words, generate_lm.py can continue writing. The example of using generate_lm.py to load GPT-2-distil and continue writing:

python3 scripts/generate_lm.py --load_model_path models/gpt_model.bin --vocab_path models/google_zh_vocab.txt \
                               --test_path beginning.txt --prediction_path generated_text.txt \
                               --config_path models/gpt2/distil_config.json --seq_length 128 \
                               --embedding word_pos --remove_embedding_layernorm \
                               --encoder transformer --mask causal --layernorm_positioning pre \
                               --target lm --tie_weights

where beginning.txt contains the beginning of a text and generated_text.txt contains the text that the model writes.

Text generator (seq2seq model)

We could use generate_seq2seq.py to generate text through seq2seq model. The example of using generate_seq2seq.py to load Transformer translation model (zh_en) and translate from Chinese to English:

python3 scripts/generate_seq2seq.py --load_model_path models/iwslt_zh_en_model.bin-50000 \
                                    --vocab_path models/google_zh_vocab.txt --tgt_vocab_path models/google_uncased_en_vocab.txt --tgt_tokenizer bert \
                                    --test_path input.txt --prediction_path output.txt \
                                    --config_path models/encoder_decoder_config.json --seq_length 64 --tgt_seq_length 64 \
                                    --embedding word_sinusoidalpos --tgt_embedding word_sinusoidalpos \
                                    --encoder transformer --mask fully_visible --decoder transformer

where --test_path specifies the path of text to be translated and --prediction_path specifies the path of the translated text.

Compare vocabularies

We provide diff_vocab.py to compare two vocabularies. Here is an example to compare a poem vocabulary and an ancient Chinese vocabulary by using diff_vocab.py :

python scripts/diff_vocab.py --vocab_1 models/google_zh_poem_vocab.txt --vocab_2 models/google_zh_ancient_vocab.txt

--vocab_1 and --vocab_2 specify the path of the compared vocabularies.

You may get the following output:

vocab_1: models/google_zh_poem_vocab.txt, size: 22556
vocab_2: models/google_zh_ancient_vocab.txt, size: 25369
vocab_1 - vocab_2 = 114
vocab_2 - vocab_1 = 2927
vocab_1 & vocab_2 = 22442

which shows the sizes of the complements and the intersection.

Adapt old model to new vocabulary

We can use dynamic_vocab_adapter.py to produce a new model according to a new vocabulary by adapting the embedding layer and the softmax layer from the old model. If the token from the new vocabulary exists in the old model, we will directly copy the parameters; otherwise, such parameters will be randomly initialized. Specifically, we will use dynamic_vocab_adapter.py to adapt the old model models/google_zh_model.bin with the old vocabulary models/google_zh_vocab.txt according to the new vocabulary models/google_zh_poem_vocab.txt, producing a new model models/google_zh_poem_model.bin .

python scripts/dynamic_vocab_adapter.py --old_model_path models/google_zh_model.bin \
                                        --old_vocab_path models/google_zh_vocab.txt \
                                        --new_vocab_path models/google_zh_poem_vocab.txt \
                                        --new_model_path models/google_zh_poem_model.bin

--old_model_path and --old_vocab_path represent the paths of the old model and vocabulary, respectively. And --new_model_path and --new_vocab_path specify the paths of the new model and vocabulary, respectively.

Model conversion

Converting model from TencentPretrain format to Huggingface format (PyTorch):

We provide the usage of TencentPretrain-to-Huggingface conversion scripts in Huggingface model repository (uer).

Converting model from Huggingface format (PyTorch) to TencentPretrain format :

BART: Taking the bart-base-chinese-cluecorpussmall model in Huggingface as an example:

python3 scripts/convert_bart_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                        --output_model_path tencentpretrain_pytorch_model.bin \
                                                        --layers_num 6

ALBERT: Taking the albert-base-chinese-cluecorpussmall model in Huggingface as an example:

python3 scripts/convert_albert_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                          --output_model_path tencentpretrain_pytorch_model.bin 

Roberta: Taking the chinese_roberta_L-2_H-128 model in Huggingface as an example:

python3 scripts/convert_bert_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                        --output_model_path tencentpretrain_pytorch_model.bin \
                                                        --layers_num 2 --type mlm

GPT-2: Taking the gpt2-chinese-cluecorpussmall model in Huggingface as an example:

python3 scripts/convert_gpt2_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                        --output_model_path tencentpretrain_pytorch_model.bin \
                                                        --layers_num 12

RoBERTa (BERT) for extractive QA: Taking the roberta-base-chinese-extractive-qa model in Huggingface as an example:

python3 scripts/convert_bert_extractive_qa_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                      --output_model_path tencentpretrain_pytorch_model.bin \
                                                                      --layers_num 12

RoBERTa (BERT) for text classification: Taking the roberta-base-finetuned-dianping-chinese model in Huggingface as an example:

python3 scripts/convert_bert_text_classification_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                            --output_model_path tencentpretrain_pytorch_model.bin \
                                                                            --layers_num 12

RoBERTa (BERT) for token classification (sequence labeling): Taking the roberta-base-finetuned-cluener2020-chinese model in Huggingface as an example:

python3 scripts/convert_bert_token_classification_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                             --output_model_path tencentpretrain_pytorch_model.bin \
                                                                             --layers_num 12

T5: Taking the t5-base-chinese-cluecorpussmall model in Huggingface as an example:

python3 scripts/convert_t5_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                      --output_model_path tencentpretrain_pytorch_model.bin \
                                                      --layers_num 12 \
                                                      --type t5

T5-v1_1: Taking the t5-v1_1-small-chinese-cluecorpussmall model in Huggingface as an example:

python3 scripts/convert_t5_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                      --output_model_path tencentpretrain_pytorch_model.bin \
                                                      --layers_num 8 \
                                                      --type t5-v1_1

Pegasus: Taking the pegasus-base-chinese-cluecorpussmall model in Huggingface as an example:

python3 scripts/convert_pegasus_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                           --output_model_path tencentpretrain_pytorch_model.bin \
                                                           --layers_num 12

XLM-RoBERTa: Taking the xlm-roberta-base model in Huggingface as an example:

python3 scripts/convert_xlmroberta_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                              --output_model_path tencentpretrain_pytorch_model.bin \
                                                              --layers_num 12
Clone this wiki locally