This is a repository of Japanese BERT trained on Aozora Bunko and Wikipedia.
- We provide models trained on Aozora Bunko. We used works written both in contemporary Japanese kana spelling and in classical Japanese kana spelling.
- Models trained on Aozora Bunko and Wikipedia are also available.
- We trained models by applying different pre-tokenization methods (MeCab with UniDic and SudachiPy).
- All models are trained with the same configuration as the bert-japanese (except for tokenization. bert-japanese uses SentencePiece unigram language model without pre-tokenization).
- We provide models with 2M training steps.
If you want to use models with 🤗 Transformers, see Converting Tensorflow Checkpoints.
When you use models, you will have to pre-tokenize datasets with the same morphological analyzer and the dictionary.
When you do fine-tuning tasks, you may want to modify official BERT codes or Transformers codes. BERT日本語Pretrainedモデル - KUROHASHI-KAWAHARA LAB will help you out.
After pre-tokenization, texts are tokenized by subword-nmt. Final vocab size is 32k.
Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603
- 1.4M steps:
BERT-base_aozora_unidic_bpe-32k_1.4m.tar.xz
- 2M steps:
BERT-base_aozora_unidic_bpe-32k_2m.tar.xz
Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603
- 1.4M steps:
BERT-base_aozora_sudachipy-unidic_bpe-32k_1.4m.tar.xz
- 2M steps:
BERT-base_aozora_sudachipy-unidic_bpe-32k_2m.tar.xz
- 1.4M steps:
BERT-base_aozora-jawiki1.5m_unidic_bpe-32k_1.4m.tar.xz
- 2M steps:
BERT-base_aozora-jawiki1.5m_unidic_bpe-32k_2m.tar.xz
- 1.4M steps:
BERT-base_aozora-jawiki1.5m_sudachipy-unidic_bpe-32k_1.4m.tar.xz
- 2M steps:
BERT-base_aozora-jawiki1.5m_sudachipy-unidic_bpe-32k_2m.tar.xz
- 1.4M steps:
BERT-base_aozora-jawiki3m_unidic_bpe-32k_1.4m.tar.xz
- 2M steps:
BERT-base_aozora-jawiki3m_unidic_bpe-32k_2m.tar.xz
- 1.4M steps:
BERT-base_aozora-jawiki3m_sudachipy-unidic_bpe-32k_1.4m.tar.xz
- 2M steps:
BERT-base_aozora-jawiki3m_sudachipy-unidic_bpe-32k_2m.tar.xz
- Aozora Bunko: Git repository as of 2019-04-21
git clone https://github.com/aozorabunko/aozorabunko
andgit checkout 1e3295f447ff9b82f60f4133636a73cf8998aeee
.- We removed text files with
作品著作権フラグ
=あり
inindex_pages/list_person_all_extended_utf8.zip
.
- Wikipedia (Japanese): XML dump as of 2018-12-20
- You can get the archive from the download page of bert-japanese.
For each document, we identify kana spelling method and then pre-tokenize by using morphological analyzer with the dictionary associated with the spelling, i.e. unidic-cwj or SudachiDict-core is used for contemporary kana spelling, unidic-qkana is used for classical kana spelling.
In SudachiPy, we use split mode A ($ sudachipy -m A -a file
) because it's equivalent to short unit word (SUW) in UniDic and unidic-cwj and unidic-qkana have only SUW mode.
After pre-tokenization, we concatenate texts of Aozora Bunko and random sampled Wikipedia (or only Aozora Bunko), and get vocabulary by using subword-nmt.
We assume that contemporary kana spelling is used.
index_pages/list_person_all_extended_utf8.zip
has 文字遣い種別
column that is the information of kanji (旧字
or 新字
) and kana spelling (旧仮名
or 新仮名
). We use only kana spelling information.