Experiments

Speed

GPU：Tesla P40

CPU：Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

We use BERT to test the speed of distributed training mode. Google BERT is trained for 1 million steps and each step contains 128,000 tokens. It takes around 18 days to reproduce the experiments by UER-py on 3 GPU machines (24 GPU in total).

#(machine)	#(GPU)/machine	tokens/second
1	0	276
1	1	7050
1	2	13071
1	4	24695
1	8	44300
3	8	84386

Qualitative evaluation

We qualitatively evaluate pre-trained models by finding words' near neighbours.

Character-based model

Evaluation of context-independent word embedding:

Target word: 苹		Target word: 吃		Target word: 水
蘋	0.762	喝	0.539	河	0.286
apple	0.447	食	0.475	海	0.278
iphone	0.400	啃	0.340	water	0.276
柠	0.347	煮	0.324	油	0.266
ios	0.317	嚐	0.322	雨	0.259

Evaluation of context-dependent word embedding:

Target sentence: 其冲积而形成小平原沙土层厚而肥沃，盛产苹果、大樱桃、梨和葡萄。

Target word: 苹
蘋	0.822
莓	0.714
芒	0.706
柠	0.704
樱	0.696

Target sentence: 苹果削减了台式Mac产品线上的众多产品。

Target word: 苹
蘋	0.892
apple	0.788
iphone	0.743
ios	0.720
ipad	0.706

Word-based model

Evaluation of context-independent word embedding:

Target word: 苹果		Target word: 腾讯		Target word: 吉利
苹果公司	0.419	新浪	0.357	沃尔沃	0.277
apple	0.415	网易	0.356	伊利	0.243
苹果电脑	0.349	搜狐	0.356	长荣	0.235
微软	0.320	百度	0.341	天安	0.224
mac	0.298	乐视	0.332	哈达	0.220

Evaluation of context-dependent word embedding:

Target sentence: 其冲积而形成小平原沙土层厚而肥沃，盛产苹果、大樱桃、梨和葡萄。

Target word: 苹果
柠檬	0.734
草莓	0.725
荔枝	0.719
树林	0.697
牡丹	0.686

Target sentence: 苹果削减了台式Mac产品线上的众多产品

Target word: 苹果
苹果公司	0.836
apple	0.829
福特	0.796
微软	0.777
苹果电脑	0.773

Target sentence: 讨吉利是通过做民间习俗的吉祥事，或重现过去曾经得到好结果的行为，以求得好兆头。

Target word: 吉利
仁德	0.749
光彩	0.743
愉快	0.736
永元	0.736
仁和	0.732

Target sentence: 2010年6月2日福特汽车公司宣布出售旗下高端汽车沃尔沃予中国浙江省的吉利汽车，同时将于2010年第四季停止旗下中阶房车品牌所有业务

Target word: 吉利
沃尔沃	0.771
卡比	0.751
永利	0.745
天安	0.741
仁和	0.741

Target sentence: 主要演员有扎克·布拉夫、萨拉·朝克、唐纳德·费森、尼尔·弗林、肯·詹金斯、约翰·麦吉利、朱迪·雷耶斯、迈克尔·莫斯利等。

Target word: 吉利
玛利	0.791
米格	0.768
韦利	0.767
马力	0.764
安吉	0.761

Quantitative evaluation

We use a range of Chinese datasets to evaluate the performance of UER-py. Douban book review, ChnSentiCorp, Shopping, and Tencentnews are sentence-level small-scale sentiment classification datasets. MSRA-NER is a sequence labeling dataset. These datasets are included in this project. Dianping, JDfull, JDbinary, Ifeng, and Chinanews are large-scale classification datasets. They are collected in glyph and can be downloaded at glyph's github project. These five datasets don't contain validation set. We use 10% instances in trainset for validation.

Most pre-training models consist of 2 stages: pre-training on general-domain corpus and fine-tuning on downstream dataset. We recommend 3-stage mode: 1)Pre-training on general-domain corpus; 2)Pre-training on downstream dataset; 3)Fine-tuning on downstream dataset. Stage 2 enables models to get familiar with distributions of downstream tasks. It is sometimes known as semi-supervised fune-tuning.

Hyper-parameter settings are as follows:

Stage 1: We train with batch size of 256 sequences and each sequence contains 256 tokens. We load Google's pretrained models and train upon it for 500,000 steps. The learning rate is 2e-5 and other optimizer settings are identical with Google BERT. BERT tokenizer is used.
Stage 2: We train with batch size of 256 sequences. For classification datasets, the sequence length is 128. For sequence labeling datasets, the sequence length is 256. We train upon Google's pretrained model for 20,000 steps. Optimizer settings and tokenizer are identical with stage 1.
Stage 3: For classification datasets, the training batch size and epochs are 64 and 3. For sequence labeling datasets, the training batch size and epochs are 32 and 5. Optimizer settings and tokenizer are identical with stage 1.

We provide the pre-trained models (using BERT target) on different downstream datasets: book_review_model.bin; chnsenticorp_model.bin; shopping_model.bin; msra_model.bin. Tencentnews dataset and its pretrained model will be publicly available after data desensitization.

Model/Dataset	Douban book review	ChnSentiCorp	Shopping	MSRA-NER	Tencentnews review
BERT	87.5	94.3	96.3	93.0/92.4/92.7	84.2
BERT+semi_BertTarget	88.1	95.6	97.0	94.3/92.6/93.4	85.1
BERT+semi_MlmTarget	87.9	95.5	97.1		85.1

Pre-training is also important for other encoders and targets. We pre-train a 2-layer LSTM on 1.9G review corpus with language model target. Embedding size and hidden size are 512. The model is much more efficient than BERT in pre-training and fine-tuning stages. We show that pre-training brings significant improvements and achieves competitive results (the differences are not big compared with the results of BERT).

Model/Dataset	Douban book review	ChnSentiCorp	Shopping
BERT	87.5	94.3	96.3
LSTM	80.2	88.3	94.4
LSTM+pre-training	86.6(+6.4)	94.5(+6.2)	96.5(+2.1)

It requires tremendous computional resources to fine-tune on large-scale datasets. For Ifeng, Chinanews, Dianping, JDbinary, and JDfull datasets, we provide their classification models (see Chinese model zoo). Classification models on large-scale datasets allow users to reproduce the results without training. Besides that, classification models could be used for improving other related tasks. More experimental results will come soon.

Ifeng and Chinanews datasets contain news' titles and abstracts. In stage 2, we use title to predict abstract.

Model/Dataset	Ifeng	Chinanews	Dianping	JDbinary	JDfull
pre-SOTA (Glyph & Glyce)	85.76	91.88	78.46	91.76	54.24
BERT	87.50	93.37		92.37	54.79
BERT+semi+BertTarget	87.65

We also provide the pre-trained models on different corpora, encoders, and targets (see Chinese model zoo). Selecting proper pre-training models is beneficial to the performance of downstream tasks.

Model/Dataset	MSRA-NER
Wikizh corpus (Google)	93.0/92.4/92.7
Renminribao corpus	94.4/94.4/94.4

Home
主页
- 项目特色
- 依赖环境
- 快速上手
- 预训练数据
- 下游任务数据集
- 预训练模型仓库
- 使用说明
- 竞赛解决方案
  - 中文任务测评基准CLUE
  - SMP2020-EWECT
  - SMP2019-ECISA
  - CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原
  - 英文任务测评基准GLUE
- 引用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiments

Experiments

Speed

Qualitative evaluation

Character-based model

Word-based model

Quantitative evaluation

Clone this wiki locally