***** New May 14th, 2020: ouBioBERT (full) is released *****
***** New April 15th, 2020: released *****
Thank you for your interest in our research!
The biomedical language understanding evaluation (BLUE) benchmark is a collection of resources for evaluating and analyzing biomedical
natural language representation models (Peng et al., 2019).
This repository provides our implementation of fine-tuning for the BLUE benchmark with 🤗/Transformers.
Our demonstration models are available now.
- Download the benchmark dataset from https://github.com/ncbi-nlp/BLUE_Benchmark
- Save pre-trained models to your directory. For example, BioBERT, clinicalBERT, SciBERT, BlueBERT and so on.
- Try to use our code in utils. Examples of the command can be found in scripts.
If you download Tensorflow models, converting them into PyTorch ones comforts your fine-tuning.
Converting Tensorflow Checkpoints
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
transformers-cli convert --model_type bert \
--tf_checkpoint $BERT_BASE_DIR/bert_model.ckpt \
--config $BERT_BASE_DIR/bert_config.json \
--pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
- Demonstration models for our research
- ouBioBERT-Base, Uncased # 20200514 (recommended)
- the best score on the BLUE benchmark
- trained on Focused PubMed abstracts with the Other PubMed abstracts.
- max_seq_length=512
- also community-uploaded on Hugging Face, refer to https://huggingface.co/models.
- the sample command of the fine-tuning is here.
- ouBioBERT-Base, Uncased (demo) # 20200415
- trained on Focused PubMed abstracts with the Other PubMed abstracts.
- max_seq_length=128
- BERT (sP + B + enW) # 20200512
- a validation model for our method.
- trained on a small biomedical corpus with BooksCorpus and Wikipedia.
- ouBioBERT-Base, Uncased # 20200514 (recommended)
Abbr. | Corpus | Words | Size | Domain |
---|---|---|---|---|
enW | English Wikipedia | 2,200M | 13GB | General |
B | BooksCorpus | 850M | 5GB | General |
sP | Small PubMed abstracts | 30M | 0.2GB | BioMedical |
fP | Focused PubMed abstracts | 280M | 1.8GB | BioMedical |
oP | Other PubMed abstracts | 2,800M | 18GB | BioMedical |
Table: List of the text corpora used for our models.
- Small PubMed abstracts (sP): extracted from PubMed baseline more associated with clinical research and translational research of human disease by using each MeSH IDs.
- Focused PubMed abstracts (fP): articles more related to human beings.
- Other PubMed abstracts (oP): articles other than Focused PubMed abstracts.
Total | MedSTS | BIOSSES | BC5CDR disease | BC5CDR chemical | ShARe CLEFE | DDI | ChemProt | i2b2 | HoC | MedNLI | |
---|---|---|---|---|---|---|---|---|---|---|---|
BERT (sP+B+enW) | 81.4 | 83.2 | 89.7 | 85.7 | 91.8 | 79.1 | 78.4 | 67.5 | 73.1 | 85.3 | 80.1 |
BERT-BASE | 54.8 | 52.1 | 34.9 | 66.5 | 76.7 | 56.1 | 35.3 | 29.8 | 51.1 | 78.2 | 67.0 |
BioBERT (v1.1) | 82.9 | 85.0 | 90.9 | 85.8 | 93.2 | 76.9 | 80.9 | 73.2 | 74.2 | 85.9 | 83.1 |
clinicalBERT | 81.2 | 82.7 | 88.0 | 84.6 | 92.5 | 78.0 | 76.9 | 67.6 | 74.3 | 86.1 | 81.4 |
SciBERT | 82.0 | 84.0 | 85.5 | 85.9 | 92.7 | 77.7 | 80.1 | 71.9 | 73.3 | 85.9 | 83.2 |
BlueBERT (P) | 82.9 | 85.3 | 88.5 | 86.2 | 93.5 | 77.7 | 81.2 | 73.5 | 74.2 | 86.2 | 82.7 |
BlueBERT (P+M) | 81.8 | 84.4 | 85.2 | 84.6 | 92.2 | 79.5 | 79.3 | 68.8 | 75.7 | 85.2 | 82.8 |
Table: BLUE scores of BERT (sP + B + W) compared with those of all the BERT-Base variants for the biomedical domain as of April 2020.
Bold indicates the best result of all.
Total | MedSTS | BIOSSES | BC5CDR disease | BC5CDR chemical | ShARe CLEFE | DDI | ChemProt | i2b2 | HoC | MedNLI | |
---|---|---|---|---|---|---|---|---|---|---|---|
ouBioBERT | 83.8 (0.3) |
84.9 (0.6) |
92.3 (0.8) |
87.4 (0.1) |
93.7 (0.2) |
80.1 (0.4) |
81.1 (1.5) |
75.0 (0.3) |
74.0 (0.8) |
86.4 (0.5) |
83.6 (0.7) |
BioBERT (v1.1) | 82.8 (0.1) |
84.9 (0.5) |
89.3 (1.7) |
85.7 (0.4) |
93.3 (0.1) |
78.0 (0.8) |
80.4 (0.4) |
73.3 (0.4) |
74.5 (0.6) |
85.8 (0.6) |
82.9 (0.7) |
BlueBERT (P) | 82.9 (0.1) |
84.8 (0.5) |
90.3 (2.0) |
86.2 (0.4) |
93.3 (0.3) |
78.3 (0.4) |
80.7 (0.6) |
73.5 (0.5) |
73.9 (0.8) |
86.3 (0.7) |
82.1 (0.8) |
BlueBERT (P+M) | 81.6 (0.5) |
84.6 (0.8) |
82.0 (5.1) |
84.7 (0.3) |
92.3 (0.1) |
79.9 (0.4) |
78.8 (0.8) |
68.6 (0.5) |
75.8 (0.3) |
85.0 (0.4) |
83.9 (0.8) |
Table: Performance of ouBioBERT on the BLUE task.
The numbers are mean (standard deviation) on five different random seeds.
The best scores are in bold.
- Preparations
- Our models
- Results
- BLUE Tasks
- Sentence similarity
- Named-entity recognition
- Relation extraction
- Document multilabel classification
- Inference task
- Total score
- Citing
- Funding
- Acknowledgments
- References
Corpus | Train | Dev | Test | Task | Metrics | Domain |
---|---|---|---|---|---|---|
MedSTS | 675 | 75 | 318 | Sentence similarity | Pearson | Clinical |
BIOSSES | 64 | 16 | 20 | Sentence similarity | Pearson | Biomedical |
BC5CDR-disease | 4182 | 4244 | 4424 | Named-entity recognition | F1 | Biomedical |
BC5CDR-chemical | 5203 | 5347 | 5385 | Named-entity recognition | F1 | Biomedical |
ShARe/CLEFE | 4628 | 1065 | 5195 | Named-entity recognition | F1 | Clinical |
DDI | 2937 | 1004 | 979 | Relation extraction | micro F1 | Biomedical |
ChemProt | 4154 | 2416 | 3458 | Relation extraction | micro F1 | Biomedical |
i2b2-2010 | 3110 | 10 | 6293 | Relation extraction | micro F1 | Clinical |
HoC | 1108 | 157 | 315 | Document classification | F1 | Biomedical |
MedNLI | 11232 | 1395 | 1422 | Inference | accuracy | Clinical |
- The sentence-similarity task is to predict similarity scores on the basis of sentence pairs.
- Metrics: Pearson correlation coefficients
- We use scipy.stats.pearsonr().
MedSTS is a corpus of sentence pairs selected from the clinical data warehouse of Mayo Clinic and was used in the BioCreative/OHNLP Challenge 2018 Task 2 as ClinicalSTS (Wang et al., 2018).
Please visit the website or contact to the 1st author to obtain a copy of the dataset.
BIOSSES is a corpus of sentence pairs selected from the Biomedical Summarization Track Training Dataset in the biomedical domain (Soğancıoğlu et al., 2017).
The BIOSSES dataset is very small, therefore it causes unstable performance of fine-tuning.
- The aim of the Named-entity recognition task is to predict mention spans given in a text.
- Metrics: strict version of F1-score (exact phrase matching).
- We use a primitive approach descirbed below to deal with disjoint mentions.
There are some irregular patterns:
- starting with I: caused by long phrases split in the middle (example).
- I next to O: due to discontinuous mentions. It is often observed in ShARe/CLEFE (example).
conlleval.py appears to count them as different phrases.
Then, we manage this problem by the following method on evaluation:
- Example:
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
y_true | O | O | B | I | O | O | B | I | O | I | O | B | I | I | I | I | O | |||
y_pred | O | O | B | I | O | O | B | I | O | O | O | B | I | I | I | O | I |
- skip blank line and concat all the tags in the sentence into a one-dimensional array.
index | 0 | 1 | 2 | 3 | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 13 | 14 | 15 | 17 | 18 | 19 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
y_true | O | O | B | I | O | O | B | I | O | I | O | B | I | I | I | I | O |
y_pred | O | O | B | I | O | O | B | I | O | O | O | B | I | I | I | O | I |
- get the token index of phrases that start with B.
- y_true: 2_3, 7_8_10, 13_14_15_17_18
- y_pred: 2_3, 7_8, 13_14_15_17_19
- calculate metrics: utils/metrics/ner.py
y_true = set(y_true)
y_pred = set(y_pred)
TP = len(y_true & y_pred) # 1: {2_3}
FN = len(y_true) - TP # 2: {7_8_10, 13_14_15_17_18}
FP = len(y_pred) - TP # 2: {7_8, 13_14_15_17_19}
prec = TP / (TP + FP) # 1 / (1 + 2) = 0.33
rec = TP / (TP + FN) # 1 / (1 + 2) = 0.33
fb1 = 2 * rec * prec / (rec + prec) # = 0.33
tag of tokens | Train | Dev | Test |
---|---|---|---|
starting with B | 4182 | 4244 | 4424 |
starting with I | 0 | 0 | 0 |
I next to O | 0 | 0 | 0 |
Total | 4182 | 4244 | 4424 |
BC5CDR is a collection of 1,500 PubMed titles and abstracts selected from the CTD-Pfizer corpus and was used in the BioCreative V chemical-disease relation task (Li et al., 2016).
tag of tokens | Train | Dev | Test |
---|---|---|---|
starting with B | 5203 | 5347 | 5385 |
starting with I | 2 | 0 | 1 |
I next to O | 0 | 0 | 0 |
Total | 5205 | 5347 | 5386 |
An example of starting with I: test.tsv#L78550-L78598
Compound 10510854 553 O
7e - 562 O
, - 564 O
5 - 566 B
- - 567 I
{ - 568 I
2 - 569 I
- - 570 I
// -------------
1H - 637 I
- - 639 I
indol 10510854 641 I
- - 646 I
2 - 647 I
- - 648 I
one - 649 I
, - 652 O
// -------------
tag of tokens | Train | Dev | Test |
---|---|---|---|
starting with B | 4628 | 1065 | 5195 |
starting with I | 6 | 1 | 17 |
I next to O | 517 | 110 | 411 |
Total | 5151 | 1176 | 5623 |
ShARe/CLEFE eHealth Task 1 Corpus is a collection of 299 clinical free-text notes from the MIMIC II database (Suominen et al.,2013).
Please visit the website and sign up to obtain a copy of the dataset.
An example of I next to O: Test.tsv#L112-L118
You'd better check out these original files, too:
Task1Gold_SN2012/Gold_SN2012/00176-102920-ECHO_REPORT.txt#L2
Task1TestSetCorpus100/ALLREPORTS/00176-102920-ECHO_REPORT.txt#L21
The 00176-102920-ECHO_REPORT 426 O
left - 430 B
atrium - 435 I
is - 442 O
moderately - 445 O
dilated - 456 I
. - 463 O
- The aim of the relation-extraction task is to predict relations and their types between the two entities mentioned in the sentences. The relations with types were compared to annotated data.
- Following the implimentation of the BLUE benchmark, we treated the relation extraction task as a sentence classification by replacing two named entity mentions of interest in the sentence with predefined tags (Lee et al., 2019).
- ORIGINAL: Citalopram protected against the RTI-76-induced inhibition of SERT binding.
- REPLACED: @CHEMICAL$ protected against the RTI-76-induced inhibition of @GENE$ binding.
- RELATION: citalopram and SERT has a chemical-gene relation.
- Evaluation:
- predict classes containing "false".
- aggregate TP, FN, and FP in each class.
- calculate metrics excluding the "false" class.
- Metrics: micro-average F1-score.
- We use sklearn.metrics.confusion_matrix() and compute TP, FP, FN and TN on each class.
class | Train | Dev | Test | note |
---|---|---|---|---|
DDI-advise | 633 | 193 | 221 | a recommendation or advice regarding a drug interaction is given. e.g. UROXATRAL should not be used in combination with other alpha-blockers. |
DDI-effect | 1212 | 396 | 360 | DDIs describing an effect or a pharmacodynamic (PD) mechanism. e.g. In uninfected volunteers, 46% developed rash while receiving SUSTIVA and clarithromycin. Chlorthalidone may potentiate the action of other antihypertensive drugs. |
DDI-int | 146 | 42 | 96 | a DDI appears in the text without providing any additional information. e.g. The interaction of omeprazole and ketoconazole has been established. |
DDI-mechanism | 946 | 373 | 302 | drug-drug interactions (DDIs) described by their pharmacokinetic (PK) mechanism. e.g. Grepafloxacin may inhibit the metabolism of theobromine. |
DDI-false | 15842 | 6240 | 4782 | |
Total | 2937 +15842 |
1004 +6240 |
979 +4782 |
DDI extraction 2013 corpus is a collection of 792 texts selected from the DrugBank database and other 233 Medline abstracts (Herrero-Zazo et al., 2013).
class | Train | Dev | Test | note |
---|---|---|---|---|
CPR:3 | 768 | 550 | 665 | UPREGULATOR|ACTIVATOR|INDIRECT_UPREGULATOR |
CPR:4 | 2251 | 1094 | 1661 | DOWNREGULATOR|INHIBITOR|INDIRECT_DOWNREGULATOR |
CPR:5 | 173 | 116 | 195 | AGONIST|AGONIST-ACTIVATOR|AGONIST-INHIBITOR |
CPR:6 | 235 | 199 | 293 | ANTAGONIST |
CPR:9 | 727 | 457 | 644 | SUBSTRATE|PRODUCT_OF|SUBSTRATE_PRODUCT_OF |
false | 15306 | 9404 | 13485 | |
Total | 4154 +15306 |
2416 +9404 |
3458 +13485 |
ChemProt comprises 1,820 PubMed abstracts with chemical–protein interactions and was used in the BioCreative VI text mining chemical–protein interac-tions shared task (Krallinger et al, 2017).
class | Train | Dev | Test | note |
---|---|---|---|---|
PIP | 755 | 0 | 1448 | Medical problem indicates medical problem. |
TeCP | 158 | 8 | 338 | Test conducted to investigate medical problem. |
TeRP | 993 | 0 | 2060 | Test reveals medical problem. |
TrAP | 883 | 2 | 1732 | Treatment is administered for medical problem. |
TrCP | 184 | 0 | 342 | Treatment causes medical problem. |
TrIP | 51 | 0 | 152 | Treatment improves medical problem. |
TrNAP | 62 | 0 | 112 | Treatment is not administered because of medical problem. |
TrWP | 24 | 0 | 109 | Treatment worsens medical problem. |
false | 19050 | 86 | 36707 | They are in the same sentence, but do not fit into one of the above defined relationships. |
Total | 3110 +19050 |
10 +86 |
6293 +36707 |
i2b2 2010 shared task collection comprises 170 documents for training and 256 for testing (Uzuner et al., 2011).
The development dataset is very small, then it is difficult to determine the best model.
- The multilabel-classification task predicts multiple labels from the texts.
label | Train | Dev | Test |
---|---|---|---|
0 | 458 | 71 | 138 |
1 | 148 | 33 | 45 |
2 | 164 | 14 | 35 |
3 | 213 | 30 | 52 |
4 | 264 | 34 | 70 |
5 | 563 | 58 | 150 |
6 | 238 | 39 | 80 |
7 | 596 | 92 | 145 |
8 | 723 | 86 | 184 |
9 | 346 | 55 | 119 |
Labels: (IM) Activating invasion & metastasis, (ID) Avoiding immune destruction, (CE) Deregulating cellular energetics,
(RI) Enabling replicative immortality, (GS) Evading growth suppressors, (GI) Genome instability & mutation,
(A) Inducing angiogenesis, (CD) Resisting cell death, (PS) Sustaining proliferative signaling, (TPI) tumor promoting inflammation
Note: This table shows the number of each label on the sentence level, rather than on the abstract level.
- Train: sentences: 10527/ articles: 1108
- Dev: sentences: 1496/ articles: 157
- Test: sentences: 2896/ articles: 315
HoC (the Hallmarks of Cancers corpus) comprises 1,580 PubMed publication abstracts manually annotated using ten currently known hallmarks of cancer (Baker et al., 2016).
- Evaluation:
- predict multi-labels for each sentence in the document.
- combine the labels in one document and compare them with the gold-standard.
- Metrics: example-based F1-score on the abstract level (Zhang and Zhou, 2014; Du et al., 2019).
- We use eval_hoc.py from the BLUE benchmark repository to calculate the metrics.
- The inference task aims to predict whether the relationship between the premise and hypothesis sentences is contradiction, entailment, or neutral.
- Metrics: overall accuracy
- We use sklearn.metrics.confusion_matrix() and sklearn.metrics.accuracy_score() to compute TP, FP, FN and TN on each class and overall accuracy.
class | Train | Dev | Test |
---|---|---|---|
contradiction | 3744 | 465 | 474 |
entailment | 3744 | 465 | 474 |
neutral | 3744 | 465 | 474 |
Total | 11232 | 1395 | 1422 |
MedNLI is a collection of sentence pairs selected from MIMIC-III (Romanov and Shivade, 2018).
Please visit the website and sign up to obtain a copy of the dataset.
Following the practice in Peng et al. (2019), we use a macro-average of Pearson scores and F1-scores to determine a pre-trained model's position.
The results are above.
If you use our work in your research, please kindly cite the following papers:
the original paper of the BLUE Benchmark
- Peng Y, Yan S, Lu Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. In Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019). 2019:58-65.
Our research
@misc{2005.07202,
Author = {Shoya Wada and Toshihiro Takeda and Shiro Manabe and Shozo Konishi and Jun Kamohara and Yasushi Matsumura},
Title = {A pre-training technique to localize medical BERT and enhance BioBERT},
Year = {2020},
Eprint = {arXiv:2005.07202},
}
This work was supported by Council for Science, Technology and Innovation (CSTI), cross-ministerial Strategic Innovation Promotion Program (SIP), "Innovative AI Hospital System" (Funding Agency: National Institute of Biomedical Innovation, Health and Nutrition (NIBIOHN)).
We are grateful to the authors of BERT to make the data and codes publicly available. We thank the NVIDIA team because their implementation of BERT for PyTorch enabled us to pre-train BERT models on our local machine. We would also like to thank Yifan Peng and shared-task organizers for publishing the BLUE benchmark.
- Peng Y, Yan S, Lu Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. In Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP). 2019: 58-65.
- Wang Y, Afzal N, Fu S, Wang L, Shen F, Rastegar-Mojarad M, Liu H. MedSTS: a resource for clinical semantic textual similarity. Language Resources and Evaluation. 2018 Jan 1:1-6.
- Soğancıoğlu G, Öztürk H, Özgü A. BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics. 2017 Jul 15; 33(14): i49–i58.
- Li J, Sun Y, Johnson RJ, Sciaky D, Wei CH, Leaman R, Davis AP, et al.. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database: the journal of biological databases and curation. 2016.
- Suominen H, Salanterä S, Velupillai S, Chapman WW, Savova G, Elhadad N, Pradhan S, et al.. Overview of the ShARe/CLEF eHealth evaluation lab 2013. Information Access Evaluation Multilinguality, Multimodality, and Visualization. 2013. Springer. 212–231.
- Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. 2019;36(4):1234-40.
- Herrero-Zazo M, Segura-Bedmar I, Martínez P, Declerck T. The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions. Journal of biomedical informatics. 2013 46: 914–920.
- Krallinger M, Rabal O, Akhondi SA, Pérez MP, Santamaría JL, Rodríguez GP, Tsatsaronis G, et al.. Overview of the BioCreative VI chemical-protein interaction track. In Proceedings of BioCreative. 2017. 141–146.
- Uzuner Ö,South BR,Shen S, DuVall SL. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association (JAMIA). 2011 18: 552–556.
- Baker S, Silins I, Guo Y, Ali I, Högberg J, Stenius U, Korhonen A. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics (Oxford, England). 2016 32: 432–440.
- Zhang ML, Zhou ZH. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering. 2014 26(8): 1819–1837.
- Du J, Chen Q, Peng Y, Xiang Y, Tao C, Lu Z. ML-Net: multilabel classification of biomedical texts with deep neural networks. Journal of the American Medical Informatics Association (JAMIA). 2019 Nov; 26(11); 1279–1285.
- Romanov A, Shivade C. Lessons from Natural Language Inference in the Clinical Domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 1586-1596.