Trials of pre-trained BERT models for the medical domain in Japanese

They are designed to be adapted to the Japanese medical domain.
The medical corpora were scraped for academic use from Today's diagnosis and treatment: premium, which consists of 15 digital references for clinicians in Japanese published by IGAKU-SHOIN Ltd..
The general corpora were extracted from a Wikipedia dump file (jawiki-20190901) on https://dumps.wikimedia.org/jawiki/.

Our demonstration models

medBERTjp - MeCab-IPAdic
- pre-trained model following MeCab-IPAdic-tokenized Japanese BERT model.
- Japanese tokenizer: MeCab + Byte Pair Encoding (BPE)
- ipadic-py, or manual install of IPAdic is required.
- max_seq_length=128
medBERTjp - Unidic-2.3.0
- Japanese tokenizer: MeCab + BPE
- Unidic v2.3.0+2020-10-08 via unidic-py is required.
- max_seq_length=128
medBERTjp - MeCab-IPAdic-NEologd-JMeDic
- Japanese tokenizer: MeCab + BPE
- install of both mecab-ipadic-NEologd and J-MeDic (MANBYO_201907_Dic-utf8.dic) is required.
- max_seq_length=128
medBERTjp - SentencePiece
(Old: v0.1-sp)
- Japanese tokenizer: SentencePiece following Sentencepiece Japanese BERT model
- use customized tokenization for the medical domain by SentencePiece
- max_seq_length=128

Requirements

For just using the models:

Transformers (>=2.11.0)
fugashi, a Cython wrapper for MeCab
- ipadic, unidic-py, mecab-ipadic-NEologd, and J-MeDic: if required.
SentencePiece would be automatically installed with Transformers.

Usage

Please check code examples of tokenization_example.ipynb, or try to use example_google_colab.ipynb on Google Colab.

Funding

This work was supported by Council for Science, Technology and Innovation (CSTI), cross-ministerial Strategic Innovation Promotion Program (SIP), "Innovative AI Hospital System" (Funding Agency: National Institute of Biomedical Innovation, Health and Nutrition (NIBIOHN)).

Licenses

The pretrained models are distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
They are freely available for academic purpose or individual research, but restricted for commecial use.

The codes in this repository are licensed under the Apache License, Version2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
LICENSE		LICENSE
README.md		README.md
example_google_colab.ipynb		example_google_colab.ipynb
requirements.txt		requirements.txt
tokenization_example.ipynb		tokenization_example.ipynb
tokenization_jp_mod.py		tokenization_jp_mod.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trials of pre-trained BERT models for the medical domain in Japanese

Our demonstration models

Requirements

Usage

Funding

Licenses

About

Releases 5

Packages

Languages

License

ou-medinfo/medbertjp

Folders and files

Latest commit

History

Repository files navigation

Trials of pre-trained BERT models for the medical domain in Japanese

Our demonstration models

Requirements

Usage

Funding

Licenses

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages