This Japanese BERT model was pre-trained with our own web corpus, on the basis of the original BERT and this Japanese BERT. So far both base model (12-layer, 768-hidden, 12-heads, 110M parameters) and large model (24-layer, 1024-hidden, 16-heads, 340M parameters) pre-trained with the same web corpus have been released.
Download
base model with unigram tokenizer
large model with unigram tokenizer
base model with BPE tokenizer
large model with BPE tokenizer
The models have been evaluated for two tasks, Livedoor news classification task and driving-domain question answering (DDQA) task. In Livedoor news classification, each piece of news is supposed to be classified into one of nine categories. In DDQA task, given question-article pairs, answers to the questions are expected to be found from the articles. The results of the evaluation are shown below, in comparison with a baseline model pre-trained with Japanese Wikipedia corpus released by this Japanese BERT repository. Note that the results are the averages of multiple-time mearsurement. Due to the small size of the evaluation datasets, the results may differ a little every time.
For Livedoow news classification task:
model size | corpus | corpus size | eval evironment | batch size | epoch | learning rate | measurement times | mean accuracy (%) | standard deviation |
---|---|---|---|---|---|---|---|---|---|
Base | JA-Wikipedia | 2.9G | GPU | 4 | 10 | 2e-5 | 5 | 97.23 | 2.38e-1 |
Base | Web Corpus | 12G | GPU | 4 | 10 | 2e-5 | 5 | 97.72 | 2.27e-1 |
Large | Web Corpus | 12G | TPU | 32 | 7 | 2e-5 | 30 | 98.07 | 2.45e-3 |
For Driving-domain QA task:
model size | corpus | corpus size | eval evironment | batch size | epoch | learning rate | measurement times | mean EM (%) | standard deviation |
---|---|---|---|---|---|---|---|---|---|
Base | JA-Wikipedia | 2.9G | TPU | 32 | 3 | 5e-5 | 100 | 76.3 | 5.16e-3 |
Base | Web Corpus | 12G | TPU | 32 | 3 | 5e-5 | 100 | 75.5 | 5.06e-3 |
Large | Web Corpus | 12G | TPU | 32 | 3 | 5e-5 | 30 | 77.3 | 4.96e-3 |
We haven't published any paper on this work. Please cite this repository:
@article{Laboro BERT Japanese,
title = {Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus},
author = {"Zhao, Xinyi and Hamamoto, Masafumi and Fujihara, Hiromasa"},
year = {2020},
howpublished = {\url{https://github.com/laboroai/Laboro-BERT-Japanese}}
}
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
For commercial use, please contact Laboro.AI Inc.
Text classification means assigning labels to text. Because the labels can be defined to describe any aspect of the text, text classification has a wide range of application. The most straightforward one would be categorizing the topic or sentiment of the text. Besides those, other examples include recognizing spam email, judging whether two sentences have same or similar meaning.
In the evaluation of English BERT model in classification task, several datasets (e.g. SST-2, MRPC) can be used as common benchmarks. As for Japanese BERT model, Livedoor news corpus can be used in the same fashion. Each piece of news in this corpus can be classified into one of the nine categories.
The original corpus is not devided in training, evaluation, and testing data. The dataset we provided in this repository was pre-processed based on Livedoor News Corpus in following steps:
- concatenating all of the data
- shuffling randomly
- deviding into train:dev:test = 6:2:2
- Python 3.6.9
- tensorflow==1.13.0
- sentencepiece==0.1.85
- GPU is recommended
Before running the code, make sure
- the livedoor dataset is in the data folder
- the pre-trained BERT model is in the model folder, including model.ckpt.data, model.ckpt.meta, model.ckpt.index, bert_config.json
- the sentencepiece model is also in the model folder, including webcorpus.model, webcorpus.vocab
git clone https://github.com/laboroai/Laboro-BERT-Japanese.git
cd ./Laboro-BERT-Japanese/src
./run_classifier.sh
Question answering task is another way to evaluate and apply BERT model. In English NLP, SQuAD is one the of most widely used datasets for this task. In SQuAD, questions and corresponding Wikipedia pages are given, and the answers to the questions are supposed to be found from the Wikipedia pages.
For QA task, we used Driving Domain QA dataset for evaluation. This dataset consists of PAS-QA dataset and RC-QA dataset. So far, we have only evaluated our model on the RC-QA dataset. The dataset is already in the format of SQuAD 2.0, so no pre-processing is needed for further use.
- Python 3.6.9
- tensorflow==1.13.0
- sentencepiece==0.1.85
- TPU is recommended (in our experiments, out-of-memory error occurs when using GPU)
- Google Cloud Storage if TPU is used
TPU is recommended for this evaluation, and TPU can only read from and write to Google Cloud Storage, thus we recommend to place BERT model and output in cloud storage bucket. Before running the code, make sure
- the livedoor dataset is in the data folder
- the pre-trained BERT model is in the model folder in cloud storage bucket, including model.ckpt.data, model.ckpt.meta, model.ckpt.index, bert_config.json
- the sentencepiece model is in the local model folder, including webcorpus.model, webcorpus.vocab
git clone https://github.com/laboroai/Laboro-BERT-Japanese.git
cd ./Laboro-BERT-Japanese/src
./run_squad.sh
Our Japanese BERT model is pre-trained with a web-based corpus especially built for this project. It was built by using a web crawler, and in total 2,605,280 webpages from 4,307 websites were crawled. The source websites extend from news websites and part of Wikipedia to personal blogs, covering both formal and informal written Japanese.
The original English BERT model was trained on a 13GB corpus consisting of English Wikipedia and BooksCorpus. The size of raw text in our web-based corpus is 12GB, which is similar to the original one.
SentencePiece is used as the tokenizer. The parameters when training the sentencepiece model are as followings:
vocab_size = 32000
shuffle_input_sentence = True
input_sentence_size = 18000000
character_coverage = 0.9995 #default
model_type = 'unigram' #default
ctlsymbols = '[CLS],[SEP],[MASK]'
The pre-training consists of two phases, in which the train_batch_size
and max_sequence_length
are changed.
Phase 1
train_batch_size = 256
max_seq_length = 128
num_train_steps = 2900000
num_warmup_steps = 10000
learning_rate = 1e-4
Phase 2
train_batch_size = 64
max_seq_length = 512
num_train_steps = 3900000
num_warmup_steps = 10000
learning_rate = 1e-4
- Cloud TPU v3-8 on Google Cloud Platform
- tensorflow==1.13.0