Skip to content

BIDS-Xu-Lab/Biomedical-NLP-Benchmarks

Repository files navigation

Large language models for biomedical natural language processing: benchmarks, baselines, and recommendations

This is the github repository for the manuscript "A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations". The related data, models, and codes are publicly available, described below.

1. Benchmarks and models

This study consists of 12 benchmarks from six biomedical natural language processing applications: named entity recognition, relation extraction, multi-label document classification, question answering, text summarization, and text simplification.

The benchmarks are under benchmarks folder, explained below.

1.1 Original datasets

Each has a full_set folder consisting the original training (train), development (dev), and testing (test) datasets are located in the benchmarks/{dataset_name}/datasets/full_set/ directory from the existing studies.

1.2 Prompts for zero- and few-shot

We also made the prompts used in the study publicly available. For each dataset_name, zero- and few-shot prompts are also provided in the benchmarks/{dataset_name}/ directory. For instance, one-shot for pubmedqa has the following information:

TASK: Your task is to answer biomedical questions using the given abstract. Only output yes, no, or maybe as answer. 
INPUT: The input is a question followed by an abstract.
OUTPUT: Answer each question by providing one of the following options: yes, no, maybe.
Example
INPUT: Does hippocampal atrophy on MRI predict cognitive decline? ["To investigate whether the presence of hippocampal atrophy (HCA) on MRI in Alzheimer's disease (AD) leads to a more rapid decline in cognitive function. To investigate whether cognitively unimpaired controls and depressed subjects with HCA are at higher risk than those without HCA of developing dementia.", 'A prospective follow-up of subjects from a previously reported MRI study.', 'Melbourne, Australia.', 'Five controls with HCA and five age-matched controls without HCA, seven depressed subjects with HCA and seven without HCA, and 12 subjects with clinically diagnosed probable AD with HCA and 12 without HCA were studied. They were followed up at approximately 2 years with repeat cognitive testing, blind to initial diagnosis and MRI result.', 'HCA was rated by two radiologists blind to cognitive test score results. Cognitive assessment was by the Cambridge Cognitive Examination (CAMCOG).', 'No significant differences in rate of cognitive decline, mortality or progression to dementia were found between subjects with or without HCA.']
OUTPUT: no
Input: {Input}
Output:

The example input and output are from an instance from the training set. {Input} is an instance from the testing set for inference.

1.3 Preprocessed dataset for instruction fine-tuning

We also provide the preprocessed datasets for instruction fine-tuning via here.

Dataset Train/Dev Test
[NER]BC5CDR-chemical Train / Dev Test
[NER]NCBI Disease Train / Dev Test
[RE]ChemProt Train / Dev Test
[RE]DDI2013 Train / Dev Test
[MLC]HoC Train/Dev Test
[MLC]LitCovid Train/Dev Test
[QA]MedQA(5-option) Train/Dev Test
[QA]PubMedQA Train/Dev Test
[Summarization]PubMed Train/Dev Test
[Summarization]MS^2 Train/Dev Test
[Simplification]Cochrane Train/Dev Test
[Simplification]PLOS Train/Dev Test

Instruction fine-tuned models

We also made the instruction fine-tuned models in the study publicly available via here.

Dataset LLAMA PMC-LLAMA
[NER]BC5CDR-chemical LLAMA 2 13B PMC-LLAMA 13B
[NER]NCBI Disease LLAMA 2 13B PMC-LLAMA 13B
[RE]ChemProt LLAMA 2 13B PMC-LLAMA 13B
[RE]DDI2013 LLAMA 2 13B PMC-LLAMA 13B
[MLC]HoC LLAMA 2 13B PMC-LLAMA 13B
[MLC]LitCovid LLAMA 2 13B PMC-LLAMA 13B
[QA]MedQA(5-option) LLAMA 2 13B PMC-LLAMA 13B
[QA]PubMedQA LLAMA 2 13B PMC-LLAMA 13B
[Summarization]PubMed LLAMA 2 13B PMC-LLAMA 13B
[Summarization]MS^2 LLAMA 2 13B PMC-LLAMA 13B
[Simplification]Cochrane LLAMA 2 13B PMC-LLAMA 13B
[Simplification]PLOS LLAMA 2 13B PMC-LLAMA 13B

2. Inference

2.1 Inference for GPT models

The inference codes for GPT models are under the GPT folder

To generate predictions for the generative/reasoning tasks ([QA]MedQA(5-option), [QA]PubMedQA, [Summarization]PubMed, [Summarization]MS^2, [Simplification]Cochrane, [Simplification]PLOS), please use the following command:

python generative_tasks/run_gpt.py \
 --dataset {medqa5 | pubmedqa | pubmed | ms2 | cochrane | plos} \
 --model {gpt-35-turbo-16k | gpt-4-32k } \
 --setting {zero_shot | one_shot}

Predictions and corresponding gold labels are saved in JSON format, for example, ms2_gpt-4-32k_one_shot.json. The JSON files include both the predicted outputs and the gold standard labels for all examples within this dataset.

To generate predictions for the extractive/classification tasks ([NER]BC5CDR-chemical, [NER]NCBI Disease, [RE]ChemProt, [RE]DDI2013, [MLC]HoC, [MLC]LitCovid), please use the following command:

python extractive_tasks/run_gpt.py

and

python extractive_tasks/run_convert_pred_2_json.py

to generate all predictions (6 extractive tasks for GPT-3.5 / 4, zero_shot / one_shot) all together. Predictions and corresponding gold labels are saved in JSON format, for example, Hoc_gpt4_os.json. The JSON files include both the predicted outputs and the gold standard labels for all examples within this dataset.

2.2 Inference for Llama models

The inference codes for GPT models are under the llama folder

Please adhere to the instructions in the llama folder. Note that the evaluation script within this folder serves merely as a reference. For consistent results across all models — including Llama models and GPT models — we used run_eval.py for evaluations.

Predictions and corresponding gold labels are saved in JSON format, for example, ms2_llama2_13b_chat_one_shot.json. The JSON files include both the predicted outputs and the gold standard labels for all examples within this dataset.

3. Instruction fine-tuning Llama models

The instruction fine-tuning codes are under the llmindcarft folder.

Please adhere to the instructions in the folder, which provides both the preprocessing scripts and fine-tuning docker images.

For NER and RE tasks, run:

./llama/scripts/run-NER.sh 

or

./llama/scripts/run-RE.sh 

The models arguement could be set to any huggingface-based LLaMA models.

4. Evaluation

Please use run_eval.py for evaluation.

Before evaluation, please download BART checkpoint (for BART metrics evaluation).

To evaluate on various datasets or tasks, please use the following command:

python run_eval.py \
 --json_file {ms2_gpt-4-32k_one_shot.json | ms2_llama2_13b_chat_one_shot.json | ...} \
 --format_type {gpt | llama} \ 
 --task {NER | RE | MLC | QA | summarization | simplification}

5. Results

Main metrics SOTA results before LLMs GPT-3.5 0s GPT-4 0s LLAMA2 13B 0s GPT-3.5 1s GPT-4 1s LLAMA2 13B 1s GPT-3.5 5s GPT-4 5s LLAMA2 13B 5s LLAMA2 13B fine-tuned PMC LLAMA 13B fine-tuned
[NER]BC5CDR-chemical Entity F1 0.9500 0.6274 0.7993 0.3944 0.7133 0.8327* 0.6276 0.7228 0.7979 0.5530 0.9149 0.9063
[NER]NCBI Disease Entity F1 0.9090 0.4060 0.5827 0.2211 0.4817 0.5988 0.3811 0.4309 0.6389* 0.4847 0.8682* 0.8353
[RE]ChemProt Macro F1 0.7344 0.1345 0.3250 0.1392 0.1280 0.3391 0.0718 0.1758 0.3756 0.0967 0.4612* 0.3111
[RE]DDI2013 Macro F1 0.7919 0.2004 0.2968 0.1305 0.2126 0.3312 0.1779 0.1706 0.3276 0.1663 0.6218 0.5700
[MLC]HoC Macro F1 0.8882 0.6722 0.7109 0.1285 0.6671 0.7093 0.3072 0.6994 0.7099 0.1797 0.6957* 0.4221
[MLC]LitCovid Macro F1 0.8921 0.5967 0.5883 0.3825 0.6009 0.5901 0.4808 0.6179 0.6077 0.3305 0.5725* 0.4273
[QA]MedQA(5-option) Accuracy 0.4195 0.4988 0.7156 0.2522 0.5161 0.7439 0.2899 0.5208 0.7651* 0.3504 0.4462* 0.3975
[QA]PubMedQA Accuracy 0.7340 0.6560 0.6280 0.5520 0.4600 0.7100 0.2660 0.6920 0.7580* 0.6000 0.8040* 0.7680
[Summarization]PubMed Rouge-L 0.4316 0.2274 0.2419 0.1190 0.2351 0.2427 0.0989 0.2423 0.2444 0.1629 0.1857* 0.1684
[Summarization]MS^2 Rouge-L 0.2080 0.0889 0.1224 0.0948 0.1132 0.1248 0.0320 0.1013 0.1218 0.1205 0.0934* 0.0059
[Simplification]Cochrane Rouge-L 0.4476 0.2365 0.2375 0.2081 0.2447 0.2385 0.2207 0.2470 0.2469 0.2283 0.2355 0.2370
[Simplification]PLOS Rouge-L 0.4368 0.2323 0.2253 0.2121 0.2449* 0.2386 0.1836 0.2416 0.2409 0.1656 0.2583 0.2577
Macro-average 0.6536 0.3814 0.4561 0.2362 0.3848 0.4750 0.2614 0.4052 0.4862 0.2866 0.5131 0.4422

5. Additional results

Additional results are under Supplementary_Materials.

6. Citation

If you use our work, please cite:

Chen, Q., Du, J., Hu, Y., Keloth, V.K., Peng, X., Raja, K., Zhang, R., Lu, Z. and Xu, H., 2023. Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations. arXiv preprint arXiv:2305.16326.