Improving Biomedical Pretrained Language Models with Knowledge. Accepted by BioNLP 2021. Paper
KeBioLM: Knowledge enhanced Biomedical pretrained Language Model
KeBioLM applies text-only encoding layer to learn entity representation and applies a text-entity fusion encoding to aggregate entity representation. KeBioLM has three pretraining tasks:
- Masked Language Model: Extend whole word masking to whole entity masking.
- Entity Detection: Predict B/I/O tags for NER.
- Entity Linking: Link predicted entities to UMLS.
You can download our model from Google Drive.
Our model contain pre-trained weights pytorch_model.bin
, a tokenizer vocab.txt
(same as PubMedBERT) and an entity dictionary for entity linking task in pretraining phase entity.jsonl
.
All codes are tested under Python 3.7, PyTorch 1.7.0 and Transformers 3.4.0.
Download BLURB dataset from here.
For example, to fine tune BC5CDR-disease dataset:
cd ner
CUDA_VISIBLE_DEVICES=0 python \
run_ner.py \
--data_dir $BC5CDR_DATASET \
--model_name_or_path $KEBIOLM_CHECKPOINT_PATH \
--output_dir $OUTPUT_DIR \
--num_train_epochs 60 \
--do_train --do_eval --do_predict --overwrite_output_dir \
--gradient_accumulation_steps 2 \
--learning_rate 3e-5 \
--warmup_steps 1710 \
--evaluation_strategy epoch \
--max_seq_length 512 \
--per_device_train_batch_size 8 \
--eval_accumulation_steps 1 \
--load_best_model_at_end --metric_for_best_model f1
To use your own task for fine-tuning, please prepare train.tsv
, test.tsv
and dev.tsv
in same folder.
If your task contain tags more than B,I,O like B-disease, I-disease, please also provide a label file which contains each label in a line:
O
B-disease
I-disease
B-symptom
I-symptom
And you should pass this label file by --labels $label_file
.
To fine-tune DDI dataset:
cd re
CUDA_VISIBLE_DEVICES=0 python \
run.py \
--task_name ddi \
--data_dir $DDI_DATASET \
--model_name_or_path $KEBIOLM_CHECKPOINT_PATH \
--output_dir $OUTPUT_DIR \
--num_train_epochs 60 \
--do_train --do_eval --do_predict --overwrite_output_dir \
--gradient_accumulation_steps 1 \
--learning_rate 1e-5 \
--warmup_steps 9486 \
--evaluation_strategy epoch \
--max_seq_length 256 \
--per_device_train_batch_size 16 \
--eval_accumulation_steps 1 \
--load_best_model_at_end --metric_for_best_model f1
Since DDI, ChemProt and GAD dataset have different formats and labels, we should specific --task_name ddi/chemprot/gad
.
To reproduce the results of our paper, please try training models with following hyperparameters. All models are trained for 60 epochs with a 10% steps linear warmup.
Dataset | Learning rate | Sequence Length | Batch size | Gradient accumulation |
---|---|---|---|---|
BC5chem | 3e-5 | 512 | 8 | 2 |
BC5dis | 1e-5 | 512 | 8 | 2 |
NCBI | 1e-5 | 512 | 8 | 2 |
BC2GM | 3e-5 | 512 | 8 | 2 |
JNLPBA | 1e-5 | 512 | 8 | 2 |
ChemProt | 1e-5 | 256 | 16 | 1 |
DDI | 1e-5 | 256 | 16 | 1 |
GAD | 1e-5 | 128 | 16 | 1 |
BC2GM, GAD and DDI datasets have relatively higher variance, you can try different seeds by setting --seed $seed_number
.
For a relation triplet (s, r, o)
in UMLS, we generate two queries: [CLS] [MASK] r o [SEP]
and [CLS] s r [MASK] [SEP]
.
We request language models to restore the masked entites.
We collect 143771 queries for 922 relation types.
To rebuild our dataset for UMLS knowledge probing, you should prepare UMLS2020AA version.
After installing UMLS2020AA, you should have a folder $UMLS_DIR
with MRCONSO.RRF
, MRREL.RRF
, MRSTY.RRF
.
Using probe/build.py
to rebuild the probing dataset dataset.txt
based on UMLS LUI and relation.
cd probe
python build.py $UMLS_DIR
Rebuilding process will take about 10 minutes.
The default setting of probing is max_length_of_[MASK] = 10
, beam_width = 5
.
To probe dataset.txt
for SciBERT (or other Bert-based language models):
cd probe
python beam_batch_decode.py $SCIBERT_PATH dataset.txt
To probe dataset.txt
for KeBioLM:
cd probe
python beam_batch_decode.py $KEBIOLM_CHECKPOINT_PATH dataset.txt
Probing with beam_width = 5
will take very long time (over 1 day on V100), you may split dataset.txt
for using multi GPUs to decode.
To perform evaluation on probing results, using:
cd probe
python metric.py $predict_file dataset.txt $UMLS_DIR
@inproceedings{yuan-etal-2021-improving,
title = "Improving Biomedical Pretrained Language Models with Knowledge",
author = "Yuan, Zheng and
Liu, Yijia and
Tan, Chuanqi and
Huang, Songfang and
Huang, Fei",
booktitle = "Proceedings of the 20th Workshop on Biomedical Language Processing",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.bionlp-1.20",
doi = "10.18653/v1/2021.bionlp-1.20",
pages = "180--190",
abstract = "Pretrained language models have shown success in many natural language processing tasks. Many works explore to incorporate the knowledge into the language models. In the biomedical domain, experts have taken decades of effort on building large-scale knowledge bases. For example, UMLS contains millions of entities with their synonyms and defines hundreds of relations among entities. Leveraging this knowledge can benefit a variety of downstream tasks such as named entity recognition and relation extraction. To this end, we propose KeBioLM, a biomedical pretrained language model that explicitly leverages knowledge from the UMLS knowledge bases. Specifically, we extract entities from PubMed abstracts and link them to UMLS. We then train a knowledge-aware language model that firstly applies a text-only encoding layer to learn entity representation and then applies a text-entity fusion encoding to aggregate entity representation. In addition, we add two training objectives as entity detection and entity linking. Experiments on the named entity recognition and relation extraction tasks from the BLURB benchmark demonstrate the effectiveness of our approach. Further analysis on a collected probing dataset shows that our model has better ability to model medical knowledge.",
}