This is the official repository for the BioDEX paper.
BioDEX is a raw resource for drug safety monitoring that bundles full-text and abstract-only PubMed papers with drug safety reports. These reports contain structured information about an Adverse Drug Events (ADEs) described in the papers, and are produced by medical experts in real-world settings.
BioDEX contains 19k full-text papers, 65k abstracts, and over 256k associated drug-safety reports.
Our data and models are available on Hugging Face. If you're interested in full drug-reports, use BioDEX-ICSR. If you're here to only extract reactions (as in In-Context Learning for Extreme Multi-Label Classification), use BioDEX-Reactions.
This repository is structured as follows:
demo.ipynb
contains some quick demonstrations of the data.analysis/
contains the data and notebooks to reproduce all plots in the paper.src/
contains all code to represent the data objects and calculate the metrics.data_creation/
contains the code to create the Report-Extraction dataset starting from the raw resource. Code to create the raw resource from scratch from will be released soon.task/icsr_extraction/
contains the code to train and evaluate models for the Report-Extraction task.
- Installation
- Demos
- Train and Evaluate models
- Limitations
- Contact
- Data License
- Citation
- BioDEX Data Schema
Create the conda environment and install the code:
conda create -n biodex python=3.9
conda activate biodex
pip install -r requirements.txt
pip install .
You can find the code for these demos in demo.ipynb
or in the sections below.
import datasets
# load the raw dataset
dataset = datasets.load_dataset("BioDEX/raw_dataset")['train']
print(len(dataset)) # 65,648
# investigate an example
article = dataset[1]['article']
report = dataset[1]['reports'][0]
print(article['title']) # Case Report: Perioperative Kounis Syndrome in an Adolescent With Congenital Glaucoma.
print(article['abstract']) # A 12-year-old male patient suffering from congenital glaucoma developed bradycardia, ...
print(article['fulltext']) # ...
print(article['fulltext_license']) # CC BY
print(report['patient']['patientsex']) # 1
print(report['patient']['drug'][0]['activesubstance']['activesubstancename']) # ATROPINE SULFATE
print(report['patient']['drug'][0]['drugadministrationroute']) # 040
print(report['patient']['drug'][1]['activesubstance']['activesubstancename']) # MIDAZOLAM
print(report['patient']['drug'][1]['drugindication']) # Anaesthesia
print(report['patient']['reaction'][0]['reactionmeddrapt']) # Kounis syndrome
print(report['patient']['reaction'][1]['reactionmeddrapt']) # Hypersensitivity
Optional, use our code to parse the raw resource into Python objects for easy manipulation
import datasets
from src.utils import get_matches
# load the raw dataset
dataset = datasets.load_dataset("BioDEX/raw_dataset")['train']
dataset = get_matches(dataset)
print(len(dataset)) # 65,648
# investigate an example
article = dataset[1].article
report = dataset[1].reports[0]
print(article.title) # Case Report: Perioperative Kounis Syndrome in an Adolescent With Congenital Glaucoma.
print(article.abstract) # A 12-year-old male patient suffering from congenital glaucoma developed bradycardia, ...
print(article.fulltext) # ...
print(article.fulltext_license) # CC BY
print(report.patient.patientsex) # 1
print(report.patient.drug[0].activesubstance.activesubstancename) # ATROPINE SULFATE
print(report.patient.drug[0].drugadministrationroute) # 040
print(report.patient.drug[1].activesubstance.activesubstancename) # MIDAZOLAM
print(report.patient.drug[1].drugindication) # Anaesthesia
print(report.patient.reaction[0].reactionmeddrapt) # Kounis syndrome
print(report.patient.reaction[1].reactionmeddrapt) # Hypersensitivity
import datasets
# load the report-extraction dataset
dataset = datasets.load_dataset("BioDEX/BioDEX-ICSR")
print(len(dataset['train'])) # 9,624
print(len(dataset['validation'])) # 2,407
print(len(dataset['test'])) # 3,628
example = dataset['train'][0]
print(example['fulltext_processed'][:1000], '...') # TITLE: # SARS-CoV-2-related ARDS in a maintenance hemodialysis patient ...
print(example['target']) # serious: 1 patientsex: 1 drugs: ACETAMINOPHEN, ASPIRIN ...
from transformers import AutoTokenizer, T5ForConditionalGeneration
import datasets
# load the report-extraction dataset
dataset = datasets.load_dataset("BioDEX/BioDEX-ICSR")
# load the model
model_path = "BioDEX/flan-t5-large-report-extraction"
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# get an input and encode it
input = dataset['validation'][1]['fulltext_processed']
input_encoded = tokenizer(input, max_length=2048, truncation=True, padding="max_length", return_tensors='pt')
# forward pass
output_encoded = model.generate(**input_encoded, max_length=256)
output = tokenizer.batch_decode(output_encoded, skip_special_tokens=True)
output = output[0]
print(output) # serious: 1 patientsex: 2 drugs: AMLODIPINE BESYLATE, LISINOPRIL reactions: Intentional overdose, Metabolic acidosis, Shock``` -->
All code for this task is located in task/icsr_extraction/
.
Make sure to activate the biodex
environment!
cd tasks/icsr_extraction
python run_encdec_for_icsr_extraction.py \
--overwrite_cache False \
--seed 42 \
--dataset_name BioDEX/BioDEX-ICSR \
--text_column fulltext_processed \
--summary_column target \
--model_name_or_path google/flan-t5-large \
--output_dir ../../checkpoints/flan-t5-large-report-extraction \
--max_source_length 2048 \
--max_target_length 256 \
--do_train True \
--do_eval True \
--lr_scheduler_type linear \
--warmup_ratio 0.0 \
--learning_rate 0.0001 \
--optim adafactor \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--eval_accumulation_steps 16 \
--num_train_epochs 5 \
--bf16 True \
--evaluation_strategy epoch \
--logging_strategy steps \
--save_strategy epoch \
--logging_steps 100 \
--save_total_limit 1 \
--report_to wandb \
--load_best_model_at_end True \
--metric_for_best_model loss \
--greater_is_better False \
--predict_with_generate True \
--generation_max_length 256 \
--num_beams 1 \
--repetition_penalty 1.0
Thus far, we only consider fine-tuning encoder-decooder models in the paper. Training a decoder-only model is still a work in progress, but we've supplied some code at ./tasks/icsr_extraction/run_decoder_for_icsr_extraction.py
Using our model on Hugging Face.
cd tasks/icsr_extraction
python run_encdec_for_icsr_extraction.py \
--overwrite_cache False \
--seed 42 \
--dataset_name BioDEX/BioDEX-ICSR \
--text_column fulltext_processed \
--summary_column target \
--model_name_or_path BioDEX/flan-t5-large-report-extraction \
--output_dir ../../checkpoints/flan-t5-large-report-extraction \
--max_source_length 2048 \
--max_target_length 256 \
--do_train False \
--do_eval True \
--lr_scheduler_type linear \
--warmup_ratio 0.0 \
--learning_rate 0.0001 \
--optim adafactor \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--eval_accumulation_steps 16 \
--num_train_epochs 5 \
--bf16 True \
--evaluation_strategy epoch \
--logging_strategy steps \
--save_strategy epoch \
--logging_steps 100 \
--save_total_limit 1 \
--report_to wandb \
--load_best_model_at_end True \
--metric_for_best_model loss \
--greater_is_better False \
--predict_with_generate True \
--generation_max_length 256 \
--num_beams 1 \
--repetition_penalty 1.0
Add --do_predict True
to get the results on the test set.
We use the DSP framework to perform in-context learning experiments.
At the time of writing, DSP does not support a truncation strategy. This is vital for our task given the long inputs. To fix this and reproduce our results, you need to replace the predict.py
file of your local dsp package (path/to/local/dsp/primitives/predict.py
) with the adapted version located at tasks/icsr_extraction/dsp_predict_path.py
.
Run text-davinci-003
:
cd tasks/icsr_extraction
python run_gpt3_for_icsr_extraction.py \
--max_dev_samples 100 \
--max_tokens 128 \
--max_prompt_length 4096 \
--n_demos 7 \
--output_dir ../../checkpoints/ \
--model_name text-davinci-003 \
--fulltext True
Run gpt-4
:
cd tasks/icsr_extraction
python run_gpt3_for_icsr_extraction.py \
--max_dev_samples 100 \
--max_tokens 128 \
--max_prompt_length 4096 \
--n_demos 7 \
--output_dir ../../checkpoints/ \
--model_name gpt-4 \
--chat_model True \
--fulltext True
Add --validation_split test
to get the results on the test set.
See section 9 of the BioDEX paper for limitations and ethical considerations.
Open an issue on this GitHub page or email karel[dot]doosterlinck[at]ugent[dot].be
and preferrably include "[BioDEX]" in the subject.
BioDEX bundles the following resources:
- Medline: This produces all
article
fields exceptfulltext
andfulltext_license
- FAERS: This produces all
report
fields and is covered under a CC0 license, as stated on their website. - PubMed Central Open Access Subset: This produced the
fulltext
andfulltext_license
fields for thearticle
. The PubMed Open Access Subset covers papers that are copyrighted under Creative Commens or similar liberal distributions. BioDEX features full-text papers from the commercial (CC0, CC BY, CC BY-SA, CC BY-ND) and non-commercial (CC BY-NC, CC BY-NC-SA, CC BY-NC-ND) set. This license is denoted per applicable BioDEX example in thefulltext_license
field of thearticle
.
Medline was provided by courtesy of the U.S. National Library of Medicine (NLM). This does not imply the NLM has endorsed BioDEX. The data distributed in BioDEX does not reflect the most current/accurate data available from NLM.
Filter the raw resource to only include fulltext papers with a commercial license:
import datasets
# load the raw dataset
dataset = datasets.load_dataset("BioDEX/raw_dataset")['train']
print(len(dataset)) # 65,648
# remove all fulltext papers with no commercial license
commercial_licenses = {'CC0', 'CC BY', 'CC BY-SA', 'CC BY-ND'}
def remove_noncom_paper(example):
# remove the fulltext if no commercial license, keep all the other data of the example
if example['article']['fulltext_license'] not in commercial_licenses:
example['article']['fulltext'] = None
return example
dataset_commercial = dataset.map(remove_noncom_paper)
print(len(dataset_commercial)) # 65,648 (no examples were dropped, only some fulltext fields were removed)
If you want to train a report-extraction model on this commercial dataset, repeat the steps outlined in data_creation/icsr_extraction/icsr_extraction.ipynb
with this new dataset_commercial
to create a new report-extraction dataset.
@misc{doosterlinck2023biodex,
title={BioDEX: Large-Scale Biomedical Adverse Drug Event Extraction for Real-World Pharmacovigilance},
author={Karel D'Oosterlinck and François Remy and Johannes Deleu and Thomas Demeester and Chris Develder and Klim Zaporojets and Aneiss Ghodsi and Simon Ellershaw and Jack Collins and Christopher Potts},
year={2023},
eprint={2305.13395},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
View in fullscreen. (Adapted from pubmed-parser)
fields | description |
---|---|
title | Title of the article |
pmid | PubMed ID |
issue | The Issue of the journal |
pages | Pages of the article in the journal publication |
abstract | Abstract of the article |
fulltext | The full text associated with the article from the PubMed Central Open Access Subset, if available |
fulltext_license | The license associated with the full text paper from the PubMed Central Open Access Subset, if available |
journal | Journal of the given paper |
authors | Authors, each separated by ';' |
affiliations | The affiliations of the authors |
pubdate | Publication date. Defaults to year information only. |
doi | DOI |
medline_ta | Abbreviation of the journal name |
nlm_unique_id | NLM unique identification |
issn_linking | ISSN linkage, typically use to link with Web of Science dataset |
country | Country extracted from journal information field |
mesh_terms | List of MeSH terms with corresponding MeSH ID, each separated by ';' e.g. 'D000161:Acoustic Stimulation; D000328:Adult; ...' |
publication_types | List of publication type list each separated by ';' e.g. 'D016428:Journal Article' |
chemical_list | List of chemical terms, each separated by ';' |
keywords | List of keywords, each separated by ';' |
reference | String of PMID each separated by ';' or list of references made to the article |
delete | Boolean, 'False' means paper got updated so you might have two |
pmc | PubMed Central ID |
other_id | Other IDs found, each separated by ';' |
View in fullscreen. (Adapted from OpenFDA)
fields | description | values |
---|---|---|
authoritynumb | Populated with the Regulatory Authority’s case report number, when available. | Undefined |
companynumb | Identifier for the company providing the report. This is self-assigned. | Undefined |
duplicate | This value is `1` if earlier versions of this report were submitted to FDA. openFDA only shows the most recent version. | Undefined |
fulfillexpeditecriteria | Identifies expedited reports (those that were processed within 15 days). | 1: True, 2: False |
occurcountry | The name of the country where the event occurred. | name: Country codes, link: http://data.okfn.org/data/core/country-list |
patient.drug.items.actiondrug | Actions taken with the drug. | 1: Drug withdrawn, 2: Dose reduced, 3: Dose increased, 4: Dose not changed, 5: Unknown, 6: Not applicable |
patient.drug.items.activesubstance.activesubstancename | Product active ingredient, which may be different than other drug identifiers (when provided). | Undefined |
patient.drug.items.drugadditional | Dechallenge outcome information—whether the event abated after product use stopped or the dose was reduced. Only present when this was attempted and the data was provided. | 1: Yes, 2: No, 3: Does not apply |
patient.drug.items.drugadministrationroute | The drug’s route of administration. | 001: Auricular (otic), 002: Buccal, 003: Cutaneous, 004: Dental, 005: Endocervical, 006: Endosinusial, 007: Endotracheal, 008: Epidural, 009: Extra-amniotic, 010: Hemodialysis, 011: Intra corpus cavernosum, 012: Intra-amniotic, 013: Intra-arterial, 014: Intra-articular, 015: Intra-uterine, 016: Intracardiac, 017: Intracavernous, 018: Intracerebral, 019: Intracervical, 020: Intracisternal, 021: Intracorneal, 022: Intracoronary, 023: Intradermal, 024: Intradiscal (intraspinal), 025: Intrahepatic, 026: Intralesional, 027: Intralymphatic, 028: Intramedullar (bone marrow), 029: Intrameningeal, 030: Intramuscular, 031: Intraocular, 032: Intrapericardial, 033: Intraperitoneal, 034: Intrapleural, 035: Intrasynovial, 036: Intratumor, 037: Intrathecal, 038: Intrathoracic, 039: Intratracheal, 040: Intravenous bolus, 041: Intravenous drip, 042: Intravenous (not otherwise specified), 043: Intravesical, 044: Iontophoresis, 045: Nasal, 046: Occlusive dressing technique, 047: Ophthalmic, 048: Oral, 049: Oropharingeal, 050: Other, 051: Parenteral, 052: Periarticular, 053: Perineural, 054: Rectal, 055: Respiratory (inhalation), 056: Retrobulbar, 057: Sunconjunctival, 058: Subcutaneous, 059: Subdermal, 060: Sublingual, 061: Topical, 062: Transdermal, 063: Transmammary, 064: Transplacental, 065: Unknown, 066: Urethral, 067: Vaginal |
patient.drug.items.drugauthorizationnumb | Drug authorization or application number (NDA or ANDA), if provided. | Undefined |
patient.drug.items.drugbatchnumb | Drug product lot number, if provided. | Undefined |
patient.drug.items.drugcharacterization | Reported role of the drug in the adverse event report. These values are not validated by FDA. | 1: Suspect (the drug was considered by the reporter to be the cause), 2: Concomitant (the drug was reported as being taken along with the suspect drug), 3: Interacting (the drug was considered by the reporter to have interacted with the suspect drug) |
patient.drug.items.drugcumulativedosagenumb | The cumulative dose taken until the first reaction was experienced, if provided. | Undefined |
patient.drug.items.drugcumulativedosageunit | The unit for `drugcumulativedosagenumb`. | 001: kg (kilograms), 002: g (grams), 003: mg (milligrams), 004: µg (micrograms) |
patient.drug.items.drugdosageform | The drug’s dosage form. There is no standard, but values may include terms like `tablet` or `solution for injection`. | Undefined |
patient.drug.items.drugdosagetext | Additional detail about the dosage taken. Frequently unknown, but occasionally including information like a brief textual description of the schedule of administration. | Undefined |
patient.drug.items.drugenddate | Date the patient stopped taking the drug. | Undefined |
patient.drug.items.drugenddateformat | Encoding format of the field `drugenddateformat`. Always set to `102` (YYYYMMDD). | Undefined |
patient.drug.items.drugindication | Indication for the drug’s use. | Undefined |
patient.drug.items.drugintervaldosagedefinition | The unit for the interval in the field `drugintervaldosageunitnumb.` | 801: Year, 802: Month, 803: Week, 804: Day, 805: Hour, 806: Minute, 807: Trimester, 810: Cyclical, 811: Trimester, 812: As necessary, 813: Total |
patient.drug.items.drugintervaldosageunitnumb | Number of units in the field `drugintervaldosagedefinition`. | Undefined |
patient.drug.items.drugrecurreadministration | Whether the reaction occured after readministration of the drug. | 1: Yes, 2: No, 3: Unknown |
patient.drug.items.drugrecurrence.drugrecuraction | Populated with the Reaction/Event information if/when `drugrecurreadministration` equals `1`. | Undefined |
patient.drug.items.drugrecurrence.drugrecuractionmeddraversion | The version of MedDRA from which the term in `drugrecuraction` is drawn. | Undefined |
patient.drug.items.drugseparatedosagenumb | The number of separate doses that were administered. | Undefined |
patient.drug.items.drugstartdate | Date the patient began taking the drug. | Undefined |
patient.drug.items.drugstartdateformat | Encoding format of the field `drugstartdate`. Always set to `102` (YYYYMMDD). | Undefined |
patient.drug.items.drugstructuredosagenumb | The number portion of a dosage; when combined with `drugstructuredosageunit` the complete dosage information is represented. For example, *300* in `300 mg`. | Undefined |
patient.drug.items.drugstructuredosageunit | The unit for the field `drugstructuredosagenumb`. For example, *mg* in `300 mg`. | 001: kg (kilograms), 002: g (grams), 003: mg (milligrams), 004: µg (micrograms) |
patient.drug.items.drugtreatmentduration | The interval of the field `drugtreatmentdurationunit` for which the patient was taking the drug. | Undefined |
patient.drug.items.drugtreatmentdurationunit | None | 801: Year, 802: Month, 803: Week, 804: Day, 805: Hour, 806: Minute |
patient.drug.items.medicinalproduct | Drug name. This may be the valid trade name of the product (such as `ADVIL` or `ALEVE`) or the generic name (such as `IBUPROFEN`). This field is not systematically normalized. It may contain misspellings or idiosyncratic descriptions of drugs, such as combination products such as those used for birth control. | Undefined |
patient.patientagegroup | Populated with Patient Age Group code. | 1: Neonate, 2: Infant, 3: Child, 4: Adolescent, 5: Adult, 6: Elderly |
patient.patientdeath.patientdeathdate | If the patient died, the date that the patient died. | Undefined |
patient.patientdeath.patientdeathdateformat | Encoding format of the field `patientdeathdate`. Always set to `102` (YYYYMMDD). | Undefined |
patient.patientonsetage | Age of the patient when the event first occured. | Undefined |
patient.patientonsetageunit | The unit for the interval in the field `patientonsetage.` | 800: Decade, 801: Year, 802: Month, 803: Week, 804: Day, 805: Hour |
patient.patientsex | The sex of the patient. | 0: Unknown, 1: Male, 2: Female |
patient.patientweight | The patient weight, in kg (kilograms). | Undefined |
patient.reaction.items.reactionmeddrapt | Patient reaction, as a MedDRA term. Note that these terms are encoded in British English. For instance, diarrhea is spelled `diarrohea`. MedDRA is a standardized medical terminology. | name: MedDRA, link: http://www.fda.gov/ForIndustry/DataStandards/StructuredProductLabeling/ucm162038.htm |
patient.reaction.items.reactionmeddraversionpt | The version of MedDRA from which the term in `reactionmeddrapt` is drawn. | Undefined |
patient.reaction.items.reactionoutcome | Outcome of the reaction in `reactionmeddrapt` at the time of last observation. | 1: Recovered/resolved, 2: Recovering/resolving, 3: Not recovered/not resolved, 4: Recovered/resolved with sequelae (consequent health issues), 5: Fatal, 6: Unknown |
patient.summary.narrativeincludeclinical | Populated with Case Event Date, when available; does `NOT` include Case Narrative. | Undefined |
primarysource.literaturereference | Populated with the Literature Reference information, when available. | Undefined |
primarysource.qualification | Category of individual who submitted the report. | 1: Physician, 2: Pharmacist, 3: Other health professional, 4: Lawyer, 5: Consumer or non-health professional |
primarysource.reportercountry | Country from which the report was submitted. | Undefined |
primarysourcecountry | Country of the reporter of the event. | name: Country codes, link: http://data.okfn.org/data/core/country-list |
receiptdate | Date that the _most recent_ information in the report was received by FDA. | Undefined |
receiptdateformat | Encoding format of the `transmissiondate` field. Always set to 102 (YYYYMMDD). | Undefined |
receivedate | Date that the report was _first_ received by FDA. If this report has multiple versions, this will be the date the first version was received by FDA. | Undefined |
receivedateformat | Encoding format of the `transmissiondate` field. Always set to 102 (YYYYMMDD). | Undefined |
receiver.receiverorganization | Name of the organization receiving the report. Because FDA received the report, the value is always `FDA`. | Undefined |
receiver.receivertype | The type of organization receiving the report. The value,`6`, is only specified if it is `other`, otherwise it is left blank. | 6: Other |
reportduplicate.duplicatenumb | The case identifier for the duplicate. | Undefined |
reportduplicate.duplicatesource | The name of the organization providing the duplicate. | Undefined |
reporttype | Code indicating the circumstances under which the report was generated. | 1: Spontaneous, 2: Report from study, 3: Other, 4: Not available to sender (unknown) |
safetyreportid | The 8-digit Safety Report ID number, also known as the case report number or case ID. The first 7 digits (before the hyphen) identify an individual report and the last digit (after the hyphen) is a checksum. This field can be used to identify or find a specific adverse event report. | Undefined |
safetyreportversion | The version number of the `safetyreportid`. Multiple versions of the same report may exist, it is generally best to only count the latest report and disregard others. openFDA will only return the latest version of a report. | Undefined |
sender.senderorganization | Name of the organization sending the report. Because FDA is providing these reports to you, the value is always `FDA-Public Use.` | Undefined |
sender.sendertype | The name of the organization sending the report. Because FDA is providing these reports to you, the value is always `2`. | 2: Regulatory authority |
serious | Seriousness of the adverse event. | 1: The adverse event resulted in death, a life threatening condition, hospitalization, disability, congenital anomaly, or other serious condition, 2: The adverse event did not result in any of the above |
seriousnesscongenitalanomali | This value is `1` if the adverse event resulted in a congenital anomaly, and absent otherwise. | Undefined |
seriousnessdeath | This value is `1` if the adverse event resulted in death, and absent otherwise. | Undefined |
seriousnessdisabling | This value is `1` if the adverse event resulted in disability, and absent otherwise. | Undefined |
seriousnesshospitalization | This value is `1` if the adverse event resulted in a hospitalization, and absent otherwise. | Undefined |
seriousnesslifethreatening | This value is `1` if the adverse event resulted in a life threatening condition, and absent otherwise. | Undefined |
seriousnessother | This value is `1` if the adverse event resulted in some other serious condition, and absent otherwise. | Undefined |
transmissiondate | Date that the record was created. This may be earlier than the date the record was received by the FDA. | Undefined |
transmissiondateformat | Encoding format of the `transmissiondate` field. Always set to 102 (YYYYMMDD). | Undefined |