Skip to content

Latest commit

 

History

History
297 lines (214 loc) · 13.6 KB

README_EN.md

File metadata and controls

297 lines (214 loc) · 13.6 KB

中文说明 | English



GitHub

Pre-trained Language Model (PLM) has been an important technique in the recent natural language processing field, including multilingual NLP research. In order to promote the NLP research for Chinese minority languages, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the first specialized pre-trained language model **CINO** (**C**hinese m**INO**rity PLM).

Chinese LERT | Chinese/English PERT Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | TextBrewer | TextPruner

More resources by HFL: https://github.com/ymcui/HFL-Anthology

News

Oct 29, 2022 We release a new pre-trained model called LERT, check https://github.com/ymcui/LERT/

Aug 23, 2022 CINO has been accepted as a long paper at COLING 2022. We will update the final paper and release the corresponding resources after the camera-ready deadline.

Feb 21, 2022 CINO-small (6-layer, 148M parameters) have been released.

Jan 25, 2022 CINO-base-v2, CINO-large-v2, and WCM-v2 have been released.

Dec 17, 2021 We have released a model pruning toolkit TextPruner. Check https://github.com/airaria/TextPruner

Oct 25, 2021 CINO-large and Wiki-Chinese-Minority(WCM)dataset have been released.

Guide

Section Description
Introduction Introduction to CINO
Download Download links and how-to-use
Quick Load Learn how to quickly load our models through 🤗Transformers
Dataset for Chinese Minority Languages Introduce Wiki-Chinese-Minority (WCM) and other datasets
Results Results on several datasets
Citation Citation and technical report

Introduction

Multilingual Pre-trained Language Model, such as mBERT and XLM-R, adopts masked language model (MLM) and other self-supervised approaches to support multilingual and cross-lingual abilities in NLP systems, using training corpus in various languages.

However, due to the scarcity of corpus in Chinese minority languages and neglection of relevant research, current multilingual PLMs are not capable of dealing with these languages.

We made the following contributions.

  • We propose CINO (Chinese mINOrity PLM), which is built on XLM-R. We further pre-train XLM-R with corpus in Chinese minority languages.

  • To evaluate CINO as well as other multilingual PLMs, we also propose a new classification dataset called Wiki-Chinese-Minority(WCM), which is built on Wikipedia.

  • The experimental results on WCM, Tibetan News Classification Corpus (TNCC), and KLUE-TC (YNAT) show that CINO achieves state-of-the-art performances.

CINO supports the following languages:

  • Chinese,中文(zh)
  • Tibetan,藏语(bo)
  • Mongolian (Uighur form),蒙语(mn)
  • Uyghur,维吾尔语(ug)
  • Kazakh (Arabic form),哈萨克语(kk)
  • Korean,朝鲜语(ko)
  • Zhuang,壮语
  • Cantonese,粤语(yue)



Download

Direct Download

We provide CINO-small, CINO-base and CINO-large of PyTorch version (preferred version: v2). We will release more models in the future.

  • CINO-large-v2:24-layer, 1024-hidden, 16-heads, vocabulary size 136K, 442M parameters
  • CINO-base-v2 12-layer, 768-hidden, 12-heads, vocabulary size 136K, 190M parameters
  • CINO-small-v2 6-layer, 768-hidden, 12-heads, vocabulary size 136K, 148M parameters
  • CINO-large:24-layer, 1024-hidden, 16-heads, vocabulary size 275K, 585M parameters

Notice:

  • v1 model(CINO-large)supports all the languages in XLM-R and the minority languages.
  • v2 models (CINO-large-v2 and CINO-base-v2 and CINO-small-v2) have pruned vocabularies and only support Chinese and the minority languages.
Model Size Google Drive Baidu Disk
CINO-large-v2 1.6GB PyTorch PyTorch(pw: 3fjt)
CINO-base-v2 705MB PyTorch PyTorch(pw: qnvc)
CINO-small-v2 564MB PyTorch PyTorch todo(pw: 9mc8)
CINO-large 2.2GB PyTorch PyTorch (pw: wpyh)

Download from 🤗transformers

You can also download our models from 🤗transformers Model Hub, including PyTorch and Tensorflow2 models.

Model Size transformers model hub URL
CINO-large-v2 1.6GB https://huggingface.co/hfl/cino-large-v2
CINO-base-v2 705MB https://huggingface.co/hfl/cino-base-v2
CINO-small-v2 564MB https://huggingface.co/hfl/cino-small-v2
CINO-large 2.2GB https://huggingface.co/hfl/cino-large

How-to: click the model link that you wish to download (e.g., https://huggingface.co/hfl/cino-large) → Select "Files and versions" tab → Download!

How-To-Use

There are three files in PyTorch model:

pytorch_model.bin        # Model Weight
config.json              # Model Config
sentencepiece.bpe.model  # Vocabulary

CINO uses exactly the same neural architecture with XLM-R, which can be direclty loaded using XLMRobertaModel class in Transformers.

from transformers import XLMRobertaTokenizer, XLMRobertaModel
tokenizer = XLMRobertaTokenizer.from_pretrained("PATH_TO_MODEL_DIR")
model = XLMRobertaModel.from_pretrained("PATH_TO_MODEL_DIR")

Quick Load

With 🤗Transformers, the models above could be easily accessed and loaded through the following codes.

from transformers import XLMRobertaTokenizer, XLMRobertaModel
tokenizer = XLMRobertaTokenizer.from_pretrained("MODEL_NAME")
model = XLMRobertaModel.from_pretrained("MODEL_NAME")

The actual model and its MODEL_NAME are listed below.

Actual Model MODEL_NAME
CINO-large-v2 hfl/cino-large-v2
CINO-base-v2 hfl/cino-base-v2
CINO-small-v2 hfl/cino-small-v2
CINO-large hfl/cino-large

Dataset for Chinese Minority Languages

Wiki-Chinese-Minority(WCM)

We built a new classification dataset Wiki-Chinese-Minority (WCM). The dataset covers Mongolian, Tibetan, Uyghur, Cantonese, Korean, Kazakh, and Chinese, including ten categories of art, geography, history, nature, natural science, people, technology, education, economy, and health.

We use weighted-F1 for evaluation.

Name Google Drive Baidu Disk
Wiki-Chinese-Minority-v2(WCM-v2) Google Drive -
Wiki-Chinese-Minority(WCM) Google Drive -

WCM-v2 has a more balanced data distribution across categories and languages.

Dataset Statistics of WCM-v2:

Category mn bo ug yue ko Kk zh-Train zh-Dev zh-Test
Art 135 141 3 387 806 348 2657 331 335
Geography 76 339 256 1550 1197 572 12854 1589 1644
History 66 111 0 499 776 491 1771 227 248
Nature 7 0 7 606 442 361 1105 134 110
Natural Science 779 133 20 336 532 880 2314 317 287
People 1402 111 0 1230 684 169 7706 953 924
Technology 191 163 8 329 808 515 1184 134 152
Education 6 1 0 289 439 1392 936 130 118
Economy 205 0 0 445 575 637 922 113 109
Health 106 111 6 272 299 893 551 67 73
Total 2973 1110 300 5943 6558 6258 32000 3995 4000

Note:

  • The dataset includes two folders: zh and minority
  • zh: train/dev/test in Chinese
  • minority: test set for all languages

The dataset is still in its alpha stage, with possible modifications in the future.

Results

We evaluate on YNAT, TNCC, and Wiki-Chinese-Minority. For each dataset, we use the same hyper-params for all models.

Korean Text Classification (YNAT)

#Train #Dev #Test #Classes Metric
45,678 9,107 9,107 7 macro-F1

Hyper-params: Initial LR1e-5, batch size 16.

Results:

Model Dev
XLM-R-large[1] 87.3
XLM-R-large[2] 86.3
CINO-small-v2 84.1
CINO-base-v2 85.5
CINO-large-v2 87.2
CINO-large 87.4

[1] The results in the original paper.
[2] Reproduced result using the same initial LR with CINO-large.

Tibetan News Classification Corpus(TNCC)

#Train[1] #Dev #Test #Classes Metric
7,363 920 920 12 macro-F1

Hyper-params: initial LR 5e-6, batch size 16

Results:

Model Dev Test
TextCNN 65.1 63.4
XLM-R-large 14.3 13.3
CINO-small-v2 72.1 66.7
CINO-base-v2 70.3 68.4
CINO-large-v2 72.9 71.0
CINO-large 71.3 68.6

Note: there is no official train/dev/test split in this dataset. We split the dataset with the ratio of 8:1:1. Our splits are available at data/TNCC. The version "with_space_separated" reserves the spaces provided by the original author, but in our paper, we use the version "without_space_separated" where the spaces for separation have been removed.

Wiki-Chinese-Minority

We use Chinese training set to train our model and test on other languages (zero-shot). We use weighted-F1 for evaluation.

Hyper-params: initial LR 7e-6, batch size 32.

Results on WCM-v2:

Model MN BO UG YUE KO KK ZH Average
XLM-R-base 41.2 25.7 84.5 66.1 43.1 23.0 88.3 53.1
XLM-R-large 53.8 24.5 89.4 67.3 45.4 30.0 88.3 57.0
CINO-small-v2 60.3 47.9 86.5 64.6 43.2 33.2 87.9 60.5
CINO-base-v2 62.1 52.7 87.8 68.1 45.6 38.3 89.0 63.4
CINO-large-v2 73.1 58.9 90.1 66.9 45.1 42.0 88.9 66.4

Demo Code

See examples. It currently includes

Citation

If you find the technical report or resource is useful, please cite our work in your paper.

@inproceedings{yang-etal-2022-cino,
    title = "{CINO}: A {C}hinese Minority Pre-trained Language Model",
    author = "Yang, Ziqing  and
      Xu, Zihang  and
      Cui, Yiming  and
      Wang, Baoxin  and
      Lin, Min  and
      Wu, Dayong  and
      Chen, Zhigang",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.346",
    pages = "3937--3949"
}

Follow Us

Follow our official WeChat account to keep updated with our latest technologies!

qrcode.jpg

Issues

Please submit an issue.