They are designed to be adapted to the Japanese medical domain.
The medical corpora were scraped for academic use from Today's diagnosis and treatment: premium, which consists of 15 digital references for clinicians in Japanese published by IGAKU-SHOIN Ltd..
The general corpora were extracted from a Wikipedia dump file (jawiki-20190901) on https://dumps.wikimedia.org/jawiki/.
- medBERTjp - MeCab-IPAdic
- pre-trained model following MeCab-IPAdic-tokenized Japanese BERT model.
- Japanese tokenizer: MeCab + Byte Pair Encoding (BPE)
- ipadic-py, or manual install of IPAdic is required.
- max_seq_length=128
- medBERTjp - Unidic-2.3.0
- medBERTjp - MeCab-IPAdic-NEologd-JMeDic
- Japanese tokenizer: MeCab + BPE
- install of both mecab-ipadic-NEologd and J-MeDic (MANBYO_201907_Dic-utf8.dic) is required.
- max_seq_length=128
- medBERTjp - SentencePiece
(Old: v0.1-sp)- Japanese tokenizer: SentencePiece following Sentencepiece Japanese BERT model
- use customized tokenization for the medical domain by SentencePiece
- max_seq_length=128
For just using the models:
- Transformers (>=2.11.0)
- fugashi, a Cython wrapper for MeCab
- ipadic, unidic-py, mecab-ipadic-NEologd, and J-MeDic: if required.
- SentencePiece would be automatically installed with Transformers.
Please check code examples of tokenization_example.ipynb
, or try to use example_google_colab.ipynb
on Google Colab.
This work was supported by Council for Science, Technology and Innovation (CSTI), cross-ministerial Strategic Innovation Promotion Program (SIP), "Innovative AI Hospital System" (Funding Agency: National Institute of Biomedical Innovation, Health and Nutrition (NIBIOHN)).
The pretrained models are distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
They are freely available for academic purpose or individual research, but restricted for commecial use.
The codes in this repository are licensed under the Apache License, Version2.0.