Authors: Zaiqiao Meng, Fangyu Liu, Thomas Hikaru Clark, Ehsan Shareghi, Nigel Collier.
Code of our paper Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT[EMNLP2021]
[26 August 2022] - Our paper has been accepted to appear at the EMNLP 2021 as a short paper.
Infusing factual knowledge into pre-trained models is fundamental for many knowledge-intensive tasks. In this paper, we proposed Mixture-of-Partitions (MoP), an infusion approach that can handle a very large knowledge graph (KG) by partitioning it into smaller sub-graphs and infusing their specific knowledge into various BERT models using lightweight adapters. To leverage the overall factual knowledge for a target task, these sub-graph adapters are further fine-tuned along with the underlying BERT through a mixture layer. We evaluate our MoP with three biomedical BERTs (SciBERT, BioBERT, PubmedBERT) on six downstream tasks (inc. NLI, QA, Classification), and the results show that our MoP consistently enhances the underlying BERTs in task performance, and achieves new SOTA performances on five evaluated datasets.
data_dir
: downstream task dataset used in the experiments.kg_dir
: folder to save the knowledge graphs as well as the partitioned files.model_dir
: folder to save pre-trained models.src
: source code.adapter-transformers
: adapter-transformers v1.1.1 forked from adapter-transformers, it has been modified for using different mixture approaches.evaluate_tasks
: codes for the downstream tasks.knowledge_infusion
: knowledge infusion main codes.
kg_dir and model_dir can be downloaded at this link.
The code is tested with python 3.8.5, torch 1.7.0 and huggingface transformers 3.5.0. Please view requirements.txt for more details.
Our models use a modified adapter-transformers. To use this package, please go to the ./src/adapter-transformers
folder of this project, and run pip install .
to install the adapter-transformers
package
- The BioAsq7b, PubMedQA, HoC datasets can be downloaded from BLURB
- The MedQA dataset can be downloaded from: https://github.com/jind11/MedQA
- The BioAsq8b datasets can be downloaded from: http://bioasq.org/
To train knowledge infusion, you can run the following command in the src/knowledge_infusion/entity_prediction
folder.
Click to expand!
MODEL="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
TOKENIZER="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
INPUT_DIR="kg_dir"
OUTPUT_DIR="model_dir"
DATASET_NAME="S20Rel"
ADAPTER_NAMES="entity_predict"
PARTITION=20
python run_pretrain.py \
--model $MODEL \
--tokenizer $TOKENIZER \
--input_dir $INPUT_DIR \
--data_name $DATASET_NAME \
--output_dir $OUTPUT_DIR \
--n_partition $PARTITION \
--use_adapter \
--non_sequential \
--adapter_names $ADAPTER_NAMES\
--amp \
--cuda \
--num_workers 32 \
--max_seq_length 64 \
--batch_size 256 \
--lr 1e-04 \
--epochs 1 \
--save_step 2000
To evaluate the model on a downstream task, you can go to the task folder and see the *.sh file for an example. For example, the following command is used to train a model on pubmedqa dataset over different shuffle_rates.
Click to expand!
MODEL="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
TOKENIZER="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
ADAPTER_NAMES="entity_predict"
PARTITION=20
python run_pretrain.py \
--model $MODEL \
--tokenizer $TOKENIZER \
--input_dir $INPUT_DIR \
--output_dir $OUTPUT_DIR \
--n_partition $PARTITION \
--use_adapter \
--non_sequential \
--adapter_names $ADAPTER_NAMES\
--amp \
--cuda \
--num_workers 32 \
--max_seq_length 64 \
--batch_size 256 \
--bi_direction \
--lr 1e-04 \
--epochs 2 \
--save_step 2000
done
Click to expand!
Parameter | Value |
---|---|
lr | 1e-04 |
epoch | 1-2 |
batch_size | 256 |
max_seq_length | 64 |
Click to expand!
Parameter | Value |
---|---|
lr | 1e-05 |
epoch | 25 |
patient | 5 |
batch_size | 8 |
max_seq_length | 512 |
repeat_run | 10 |
Click to expand!
Parameter | Value |
---|---|
lr | 1e-05,2e-05 |
epoch | 25 |
patient | 5 |
batch_size | 12 |
max_seq_length | 512 |
repeat_run | 3 |
temperature | 1 |
Click to expand!
Parameter | Value |
---|---|
lr | 1e-05 |
epoch | 25 |
patient | 5 |
batch_size | 16 |
max_seq_length | 256 |
repeat_run | 3 |
temperature | -15,-10,1 |
Click to expand!
Parameter | Value |
---|---|
lr | 1e-05,3e-05 |
epoch | 25 |
patient | 5 |
batch_size | 16,32 |
max_seq_length | 256 |
repeat_run | 5 |
temperature | 1 |
@inproceedings{meng2021mixture,
title={Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT},
author={Meng, Zaiqiao and Liu, Fangyu and Clark, Thomas and Shareghi, Ehsan and Collier, Nigel},
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
pages={4672--4681},
year={2021}
}
If you have any questions, feel free to contact me via (zm324@cam.ac.uk).