Skip to content

Experiments

Li Yudong edited this page Sep 27, 2020 · 1 revision

Experiments

Speed

GPU:Tesla P40

CPU:Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

We use BERT to test the speed of distributed training mode. Google BERT is trained for 1 million steps and each step contains 128,000 tokens. It takes around 18 days to reproduce the experiments by UER-py on 3 GPU machines (24 GPU in total).

#(machine) #(GPU)/machine tokens/second
1 0 276
1 1 7050
1 2 13071
1 4 24695
1 8 44300
3 8 84386

Qualitative evaluation

We qualitatively evaluate pre-trained models by finding words' near neighbours.

Character-based model

Evaluation of context-independent word embedding:

Target word: 苹 Target word: 吃 Target word: 水
0.762 0.539 0.286
apple 0.447 0.475 0.278
iphone 0.400 0.340 water 0.276
0.347 0.324 0.266
ios 0.317 0.322 0.259

Evaluation of context-dependent word embedding:

Target sentence: 其冲积而形成小平原沙土层厚而肥沃,盛产苹果、大樱桃、梨和葡萄。

Target word: 苹
0.822
0.714
0.706
0.704
0.696

Target sentence: 苹果削减了台式Mac产品线上的众多产品。

Target word: 苹
0.892
apple 0.788
iphone 0.743
ios 0.720
ipad 0.706

Word-based model

Evaluation of context-independent word embedding:

Target word: 苹果 Target word: 腾讯 Target word: 吉利
苹果公司 0.419 新浪 0.357 沃尔沃 0.277
apple 0.415 网易 0.356 伊利 0.243
苹果电脑 0.349 搜狐 0.356 长荣 0.235
微软 0.320 百度 0.341 天安 0.224
mac 0.298 乐视 0.332 哈达 0.220

Evaluation of context-dependent word embedding:

Target sentence: 其冲积而形成小平原沙土层厚而肥沃,盛产苹果、大樱桃、梨和葡萄。

Target word: 苹果
柠檬 0.734
草莓 0.725
荔枝 0.719
树林 0.697
牡丹 0.686

Target sentence: 苹果削减了台式Mac产品线上的众多产品

Target word: 苹果
苹果公司 0.836
apple 0.829
福特 0.796
微软 0.777
苹果电脑 0.773

Target sentence: 讨吉利是通过做民间习俗的吉祥事,或重现过去曾经得到好结果的行为,以求得好兆头。

Target word: 吉利
仁德 0.749
光彩 0.743
愉快 0.736
永元 0.736
仁和 0.732

Target sentence: 2010年6月2日福特汽车公司宣布出售旗下高端汽车沃尔沃予中国浙江省的吉利汽车,同时将于2010年第四季停止旗下中阶房车品牌所有业务

Target word: 吉利
沃尔沃 0.771
卡比 0.751
永利 0.745
天安 0.741
仁和 0.741

Target sentence: 主要演员有扎克·布拉夫、萨拉·朝克、唐纳德·费森、尼尔·弗林、肯·詹金斯、约翰·麦吉利、朱迪·雷耶斯、迈克尔·莫斯利等。

Target word: 吉利
玛利 0.791
米格 0.768
韦利 0.767
马力 0.764
安吉 0.761

Quantitative evaluation

We use a range of Chinese datasets to evaluate the performance of UER-py. Douban book review, ChnSentiCorp, Shopping, and Tencentnews are sentence-level small-scale sentiment classification datasets. MSRA-NER is a sequence labeling dataset. These datasets are included in this project. Dianping, JDfull, JDbinary, Ifeng, and Chinanews are large-scale classification datasets. They are collected in glyph and can be downloaded at glyph's github project. These five datasets don't contain validation set. We use 10% instances in trainset for validation.

Most pre-training models consist of 2 stages: pre-training on general-domain corpus and fine-tuning on downstream dataset. We recommend 3-stage mode: 1)Pre-training on general-domain corpus; 2)Pre-training on downstream dataset; 3)Fine-tuning on downstream dataset. Stage 2 enables models to get familiar with distributions of downstream tasks. It is sometimes known as semi-supervised fune-tuning.

Hyper-parameter settings are as follows:

  • Stage 1: We train with batch size of 256 sequences and each sequence contains 256 tokens. We load Google's pretrained models and train upon it for 500,000 steps. The learning rate is 2e-5 and other optimizer settings are identical with Google BERT. BERT tokenizer is used.
  • Stage 2: We train with batch size of 256 sequences. For classification datasets, the sequence length is 128. For sequence labeling datasets, the sequence length is 256. We train upon Google's pretrained model for 20,000 steps. Optimizer settings and tokenizer are identical with stage 1.
  • Stage 3: For classification datasets, the training batch size and epochs are 64 and 3. For sequence labeling datasets, the training batch size and epochs are 32 and 5. Optimizer settings and tokenizer are identical with stage 1.

We provide the pre-trained models (using BERT target) on different downstream datasets: book_review_model.bin; chnsenticorp_model.bin; shopping_model.bin; msra_model.bin. Tencentnews dataset and its pretrained model will be publicly available after data desensitization.

Model/Dataset Douban book review ChnSentiCorp Shopping MSRA-NER Tencentnews review
BERT 87.5 94.3 96.3 93.0/92.4/92.7 84.2
BERT+semi_BertTarget 88.1 95.6 97.0 94.3/92.6/93.4 85.1
BERT+semi_MlmTarget 87.9 95.5 97.1 85.1

Pre-training is also important for other encoders and targets. We pre-train a 2-layer LSTM on 1.9G review corpus with language model target. Embedding size and hidden size are 512. The model is much more efficient than BERT in pre-training and fine-tuning stages. We show that pre-training brings significant improvements and achieves competitive results (the differences are not big compared with the results of BERT).

Model/Dataset Douban book review ChnSentiCorp Shopping
BERT 87.5 94.3 96.3
LSTM 80.2 88.3 94.4
LSTM+pre-training 86.6(+6.4) 94.5(+6.2) 96.5(+2.1)

It requires tremendous computional resources to fine-tune on large-scale datasets. For Ifeng, Chinanews, Dianping, JDbinary, and JDfull datasets, we provide their classification models (see Chinese model zoo). Classification models on large-scale datasets allow users to reproduce the results without training. Besides that, classification models could be used for improving other related tasks. More experimental results will come soon.

Ifeng and Chinanews datasets contain news' titles and abstracts. In stage 2, we use title to predict abstract.

Model/Dataset Ifeng Chinanews Dianping JDbinary JDfull
pre-SOTA (Glyph & Glyce) 85.76 91.88 78.46 91.76 54.24
BERT 87.50 93.37 92.37 54.79
BERT+semi+BertTarget 87.65

We also provide the pre-trained models on different corpora, encoders, and targets (see Chinese model zoo). Selecting proper pre-training models is beneficial to the performance of downstream tasks.

Model/Dataset MSRA-NER
Wikizh corpus (Google) 93.0/92.4/92.7
Renminribao corpus 94.4/94.4/94.4

Clone this wiki locally