M-CURES is a risk stratification model to predict clinical deterioration in hospitalized COVID-19 patients developed in response to the pandemic. Our objective was to create a simple and and transferable machine learning model using demographic (personal characteristic) and clinical variables from electronic health record data. Through the use of a novel paradigm for model development and code sharing, including both a data-driven and clinician-driven feature selection technique, M-CURES was built at a single institution, and achieved strong internal and external validation results across 13 medical centers in the United States. The model was validated in both detecting patients at risk of clinical deterioration, as well as detecting patients who were low-risk and could potentially be safely discharged. Our full paper is available at: https://doi.org/10.1136/bmj-2021-068576.
To assist other institutions in the validation and use of this model, all code and documentation are available here.
If you use M-CURES in your research, please cite the following publication:
@article{MCURES,
author = {Kamran, Fahad and Tang, Shengpu and Ötleş, Erkin and McEvoy, Dustin S and Saleh, Sameh N and Gong, Jen and Li, Benjamin Y and Dutta, Sayon and Liu, Xinran and Medford, Richard J and Valley, Thomas S and West, Lauren R and Singh, Karandeep and Blumberg, Seth and Donnelly, John P and Shenoy, Erica S and Ayanian, John Z and Nallamothu, Brahmajee K and Sjoding, Michael W and Wiens, Jenna},
title = "{Early identification of patients admitted to hospital for covid-19 at risk of clinical deterioration: model development and multisite external validation study}",
journal = {The BMJ},
publisher = {BMJ Publishing Group Ltd},
year = {2022},
volume = {376},
doi = {10.1136/bmj-2021-068576},
}
- Refer to
requirements.txt
for the necessary pip packages. - preprocessing: Run
./run.sh
. - evaluation: Run the
Evaluation_Primary.ipynb
andEvaluation_Secondary.ipynb
notebook to evaluate M-CURES. To save model predictions for a set of input data, runcalculate_score.py
.
An example usage of the pipeline is provided with dummy input data in preprocessing/sample_input
and evaluation/sample_cohort.csv
. The easiest way to use the code is to create local copies of preprocessing
-> preprocessing_UM
and evaluation
-> evaluation_UM
and replace the input files with real data. Please refer to the sample input files (and descriptions below) for formatting requirements.
-
windows_map.csv
contains all 4h windows for allhosp_id
s.- hosp_id column is the unique identifier for the encounter
- window_id column is the index of 4h windows for the current encounter
- ID column is "{hosp_id}-{window_id}"
-
windows.csv
has the same content as theID
column inwindows_map.csv
-
sample_cohort.csv
is used byEvaluation_Primary.ipynb
: predicting composite outcome that happens within the first 5 days. It has the sameID
,hosp_id
, andwindow_id
columns as inwindows_map.csv
, and it contains an additional columny
specifying the outcome label. The labels "y" for each window are defined as follows:- If a patient encounter experiences the outcome, then windows after the outcome window are not used for prediction and should not be included. Only windows before the outcome window are included and they have a label of 1.
- If a patient does not have an outcome then all of their windows have a label of 0, and we only include up to the first 30 windows (first 5 days).
Every encounter should have no more than 30 windows.
-
sample_cohort_outcome_ever_past_2days.csv
is used byEvaluation_Secondary.ipynb
: predicting composite outcome that happens after 48h using the first 48h data. It has the same format assample_cohort.csv
, except it only contains encounters who have the outcome after two days, and they
label specifies if the outcome occurs ever (rather than within the first 5 days). Every encounter should have exactly 12 windows (48h worth of data).
For details on the expected values of each variable, please refer to preprocessing/metadata/out_*/{discretization|feature_names}.json
.
demog.csv
contains three columns:- age_value: numeric
- sex_value: ['M', 'F']
- race_value:
- "African American"
- "American Indian or Alaska Native"
- "Asian"
- "Caucasian"
- "Native Hawaiian and Other Pacific Islander"
- "Other"
- "Patient Refused"
- "Unknown"
The other input data files all have four columns: ['ID', 't', 'variable_name', 'variable_value'].
- The
ID
column specifies a 4h window of a specific encounter and should be contained in thewindows_map.csv
file. - The
t
column is measured in minutes relative to the start of the current 4h window.
Below are the expected variable_name
s in each file:
vitals.csv
- heartrate
- temperature
- sbp
- dbp
- respiratoryrate
- spo2
flow.csv
: (note the underscore prefix)- '_307928' for "O2 flow rate"
- '_313030' for "Pulse Oximetry type"
- "Intermittent"
- "Continuous"
- '_314689' for "BP: Patient Position"
- "Lying"
- "Sitting"
- "Standing"
- '_355444' for "Head of Bed Position"
- "HOB at 15 degrees"
- "HOB at 30 degrees"
- "HOB at 45 degrees"
- "HOB at 60 degrees"
- "HOB at 90 degrees"
- "HOB flat (medical condition)"
- "Reverse Trendelenberg"
- "other (see comments)"
labs.csv
- pH (Ven Blood Gas): '81723_value' and '81723_hilonormal_flag'
- pCO2 (Art Blood Gas): '84066_value' and '84066_hilonormal_flag'
meds.csv
- currently none supported