Generating the database is done in five steps:
- download and pre-process the structured product labels (
spl_processor.py
) - identify adverse reaction terms and construct feature sentence fragments (
construct_application_data.py
) - apply the model to score feature sentence fragments (
predict.py
) - compile the results into csv datafiles for each label section (
create_onsides_datafiles.py
) - integrate the results with standard vocabularies and build the csv files (
build_onsides.py
)
The steps above are detailed below. However,
this process is assisted by the Deployment Tracker (demployment_tracker.py
). The
Deployment Tracker will walk through the process by checking for the necessary files. If
any are missing it will print the command used to generate them to the standard output.
Some steps (SPL Processing) need to be performed before running the tracker. In this case
the tracker confirms a recent run and prompts the user to re-run if potentially out of date.
The final step is the generate the database files using build_onsides.py
. The Deployment
Tracker will provide instructions once the rest of the steps are complete.
To run the Deployment Tracker:
python3 src/deployment_tracker.py --release v2.0.0-AR
For convenience the remaining commands can be piped to bash as follows:
python3 src/deployment_tracker.py --release v2.0.0-AR | bash
If on a GPU-enabled machine, the --gpu
flag can be used to specify which
gpu to use for the steps that require it.
python3 src/deployment_tracker.py --release v2.0.0-AR --gpu 2
Run the deployment tracker for each section and release version. A full re-deployment of
v2.0.0
would require the following set of commands:
# check for any updates to the SPLs
python3 src/spl_processor.py --update
# deploy for Adverse Reactions section
python3 src/deployment_tracker.py --release v2.0.0-AR | bash
# deploy for Boxed Warnings section
python3 src/deployment_tracker.py --release v2.0.0-BW | bash
# generate database files
python3 src/build_onsides.py --vocab ./data/omop/vocab_5.4 --release v2.0.0
A model is required in the './model/' directory to be able to score and evaluate the extracted ADEs. The deployment_tracker.py script will automatically detect the model and train a new model if no model exists in this directory; however, if you would like to use the original model trained for the public OnSIDES data, it is available to download here.
The structured product labels (SPLs) are made available for download by DailyMed and updated on a monthly, weekly, and daily basis at https://dailymed.nlm.nih.gov/dailymed/spl-resources-all-drug-labels.cfm.
We implemented a script to manage the download and pre-processing of these files assuming a
one-time bulk download of a full release of prescription labels, followed by periodic updates
(assumed to be monthly). The prescriptin drug label files and the parsed text files are stored
in the data subdirectory at ./data/spl/rx/
.
To initiate a full release download, run spl_processor.py
as follows:
python3 src/spl_processor.py --full
The latest full release of the prescription drug labels will be downloaded, checksum verified, and then pre-processed into json files.
To make a periodic update, run spl_processor.py
as follows:
python3 src/spl_processor.py --update
The processor will check the dates of the downloaded full release as well as any updates and look for additional available updates. If any are available, it will download the files, checksum verify them, and then pre-process them into json files.
To identify the ADR terms and construct the feature sentences, use the construct_application_data.py
script. The feature method (--method
), number of words (--nwords
), label section (--section
),
label medicine type (--medtype
), and the directory of the parsed label json files (--dir
) are
all required parameters. For example:
python3 src/construct_application_data.py --method 14 --nwords 125 --section AR --medtype rx --dir data/spl/rx/dm_spl_release_human_rx_part5
This will need to be run for each subdirectory of labels and for each section (Adverse Reactions, Boxed Warnings, etc.). This script will create a sentences file at the directory path provided. For example, the above command creates a file named:
data/spl/rx/dm_spl_release_human_rx_part5/sentences-rx_method14_nwords125_clinical_bert_application_set_AR.txt.gz
The trained model can be then applied to each of the feature files created in step 2 using
the predict.py
script. The required parameters are the trained model path (--model
) and
the path to the feature file generated in Step 2 (--examples
). For example:
python3 src/predict.py --model ./models/final-bydrug-PMB_14-AR-125-all_222_24_25_1e-05_256_32.pth --examples data/spl/rx/dm_spl_release_human_rx_part5/sentences-rx_method14_nwords125_clinical_bert_application_set_AR.txt.gz
The resulting gzipped csv file of model outputs will be saved in the same directory as the examples. For example, the above creates a file named:
data/spl/rx/dm_spl_release_human_rx_part5/final-bydrug-PMB-sentences-rx_ref14-AR-125-all_222_24_25_1e-05_256_32.csv.gz
This step can run into memory and compute time errors with very large sentence files. Most of the parts of the full release download tend to cause issues. To avoid these errors and speed up the process overall the sentence files can be split, processed individually, and then recombined.
The following bash code snippet shows how this can be done using the full release part5
as of Oct 2022. The part5 file will split the file into 100 MB chunks, each with their
own header. A parameterized bash script is also available (split_and_predict.sh
).
cd data/spl/rx/dm_spl_release_human_rx_part5/
mkdir -p splits
gunzip sentences-rx_method14_nwords125_clinical_bert_application_set_AR.txt.gz
tail -n +2 sentences-rx_method14_nwords125_clinical_bert_application_set_AR.txt | split -d -C 100m - --filter='sh -c "{ head -n1 sentences-rx_method14_nwords125_clinical_bert_application_set_AR.txt; cat; } > $FILE"' splits/sentences-rx_method14_nwords125_clinical_bert_application_set_AR_split
gzip sentences-rx_method14_nwords125_clinical_bert_application_set_AR.txt
cd -
Then run predict.py
on each split:
for f in data/spl/rx/dm_spl_release_human_rx_part5/splits/*
do
echo python3 src/predict.py --model ./models/final-bydrug-PMB_14-AR-125-all_222_24_25_1e-05_256_32.pth --examples $f
done | bash
Finally, recombine the results and archive them:
cd data/spl/rx/dm_spl_release_human_rx_part5/
zcat splits/*.csv.gz | gzip > final-bydrug-PMB-sentences-rx_ref14-AR_222_24_25_1e-05_256_32.csv.gz
rm -rf splits
cd -
This process will have to be done for each part of a full release and for each
update available. Reminder note that the Deployment Tracker (deployment_tracker.py
)
script will automate this process for you (see above for how to run).
The previous step produced scores for each ADR mention in each label. However, a single ADR
is often mentioned multiple times per label. We collapse these different instances down into
a single score using an aggregation function (e.g. mean) and produce files that we can
then load into an SQL database. We do this using create_onsides_datafiles.py
. There are
three required parameters to this script: the path to the results file (--results
), the
path to the sentences file (--examples
), and which deployment release was used to generate
the scores (--release
). The releases available can be found in the experiments.json
file
under deployments
.
To compile the results for AR-v2.0.0
, for example:
python3 src/create_onsides_datafiles.py --release v2.0.0-AR --results data/spl/rx/dm_spl_release_human_rx_part5/final-bydrug-PMB-sentences-rx_ref14-AR-125-all_222_24_25_1e-05_256_32.csv.gz --examples data/spl/rx/dm_spl_release_human_rx_part5/sentences-rx_method14_nwords125_clinical_bert_application_set_AR.txt.gz
Which will create a "compiled" file in the labels directory:
data/spl/rx/dm_spl_release_human_rx_part5/compiled/v2.0.0/AR.csv.gz
To build the final version of the OnSIDES database files we leverage the standard
vocabularies in the OMOP Common Data Model (CDM). These can be downloaded using the
ATHENA tool made available from OHDSI.org. In this
implementation we use OMOP CDM v5.4. Download the vocabularies through ATHENA
(including MedDRA, which will require a EULA) and save them into a subdirectory of ./data
.
The files are built by running the build_onsides.py
script with the path to the download
vocabularies (--vocab
) and the version number (--release
). All sections which have a trained
model specified in the experiments.json
file for the provided version will be collated.
python3 src/build_onsides.py --vocab ./data/omop/vocab_5.4 --release v2.0.0
The build_onsides.py
script will iterate through each of the SPL subdirectories looking
for available compiled results files (results of Step 4). If any subdirectories are missing
the compiled results, the script throw an error and halt execution.
Once completed, the results will be saved to the releases
subdirectory by the version
number and the date in YYYYMMDD
format.
./releases/v2.0.0/20221029/