Segen's Medical Dictionary proposes the following definition of patient pathway:
Patient Pathway: The route that a patient follows from the first contact with an NHS member of staff (typically his or her GP) through referral to the completion of treatment. The pathway also covers the period from entry into a hospital, or a treatment centre until discharge. It is a timeline on which every event relating to treatment can be entered, including consultations, diagnosis, treatment, medication, diet, assessment, teaching and preparing for discharge from hospital. The pathway provides an outline of the events likely to happen on the patient's journey and can be used both to inform the patient as well as to plan services as a template for common services and operations.
— "patient pathway." Segen's Medical Dictionary. 2011. Farlex, Inc. 18 Jun. 2021 https://medical-dictionary.thefreedictionary.com/patient+pathway
We classify pathway events and data along a set of predefined classes, and we represent their discretized values in a structured representation that encodes time. Our pathway extractor takes as input the CSV data generated by Synthea, and encodes the patient data into our pathway representation. More precisely, we consider the following classes from data generated by Synthea: demographics (patient details), observations (results of clinical exams and vitals), conditions (diagnoses and care plans), medications, procedures and outcomes (readmission, death, survival at point in time). Some data consist of isolated events (they occur at a specific point in time), others instead have a duration. The following figure is a representation of patient pathway as timeline: isolated events are shown as dots, and events with duration are represented as horizontal bars; the pathway includes several conditions, and it shows details of a set of observations happening at the same time (2016-08-01).
Longitudinal timeline representation of a pathway
The timeline representation of a pathway is easily understandable by humans, but it is not effective for automated analysis with machine learning or deep learning. For this reason we use the pathway extractor to produce an image-like representation of the pathway data. The process for building our image-like representation of a pathway consists of three steps:
- mapping of the data points in a 3-dimensional grid
- flattening into a bi-dimensional grid
- numerical encoding.
As a first step, we discretize all the values, and we arrange them in a 3-dimensional grid, where the first dimension gives the order of the events (time), the second dimension spans the different classes in our representation (i.e. demographics, observations, conditions, medications, procedures, and outcomes), and the third dimension represents concurrence of events (data points of a certain class that happen at the same time). The following figure shows the three-dimensional grid representation of the pathway timeline shown above.
Three dimensional grid representation of pathway
Note that along the time dimension we retain the order of events, but we do not encode their corresponding dates (although this would be possible with a relatively simple extension of the representation). We discretize values using a set of configurable rules that group values into custom bins. The rules are defined using spreadsheets that are easily interpretable by practitioners, and are parsed and executed using the Drools rule engine. We currently have 246 rules, covering demographics, medications dispenses, observations (based on age and gender of the patient, the LOINC code of the medication and its units), and outcomes.
As a second step, we flatten the three-dimensional grid into a bi-dimensional grid where the horizontal axis represents the order of events, and the vertical axis represents the different classes in our representation. Multiple concurrent data points having the same class are placed one after the other along the horizontal axis. In other words, we compute different bi-dimensional slices for every value of the time dimension, then we rotate those slices along the class dimension, and we concatenate them as shown in the following figure.
From the three-dimensional grid representation to the final bi-dimensional grid representation of pathway
Finally, we encode the values of the bi-dimensional grid in a numeric space (typically the set ℝ of real numbers, but it can also be the set ℕ of natural numbers, depending on the downstream analysis task; in some experiment we encode values in the RGB space, so that the final representation can be visualized as an image). This effectively produces a numeric representation of the patient pathway that encodes discretized values along distinct dimensions and time, and that is easily amenable to machine learning and deep learning tasks: for an example application see the following article: Nguyen-Duc, T., Natasha Mulligan, G. Mannu and J. Bettencourt-Silva. “Deep EHR Spotlight: a Framework and Mechanism to Highlight Events in Electronic Health Records for Explainable Predictions.” AMIA 2021 Virtual Informatics Summit (2021).
The Pathway Extractor is a stand-alone Java application (built with Spring Boot) to extract pathway images from patient data. It is currently designed to work with patient data generated by Synthea, but it can be extended to other patient data formats (More specifically, the current version is designed to process patient data generated by the Synthea code with commit id 5a44709c3a12eee32a3b38cb8d502b944580a276
.)
From the command line, go to the main project folder, and run ./mvnw clean package
. This will package the app into a single jar in the target/
directory.
All configuration is done using properties in the file application.properties
. The following is an example:
com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.data.path=/Users/marco/a/50k_patients_seed_3/csv/
com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.data.split.to.path=
com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.data.split.max.number.of.patients.per.chunk=
com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.included.medical.types=OBSERVATIONS, IMAGING_STUDIES, ALLERGIES, CONDITIONS, MEDICATIONS, PROCEDURES, CAREPLANS
com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.included.conditions.codes=53741008, 74400008, 49436004, 44054006, 22298006, 230690007, 410429000, 65966004, 431855005, 126906006
com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.produce.javascript.data.for.visualizations.for.patient.with.ids=0007782f-ca96-4ad1-bdf7-516ab5dc1ef5, 00ab992b-f309-444e-abd4-191cd7fd97e5, 00073f65-c57e-42c9-88e4-8a4e0e3b9457
com.ibm.research.drl.deepguidelines.pathways.extractor.output.PathwayImagesWriter=NIO
com.ibm.research.drl.deepguidelines.pathways.extractor.output.path=/Users/marco/a/pathways_images/
com.ibm.research.drl.deepguidelines.pathways.extractor.output.max.output.file.size.in.bytes=
com.ibm.research.drl.deepguidelines.pathways.extractor.output.pathway.image.max.columns=400
com.ibm.research.drl.deepguidelines.pathways.extractor.output.pathway.image.without.pathway.event.features.from.start.and.stop.pathway.events=true
com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.data.path
defines the input folder containing the data set generated by Synthea, and for which you want to extract the pathway imagescom.ibm.research.drl.deepguidelines.pathways.extractor.synthea.data.split.to.path
if the input Synthea data set is too big to be processed (for example 0.5 million patients), you can tell Pathway Extractor to split the input data set into smaller data sets, which will then be processed individually. This properties specifies a folder where the Pathway Extractor can write the smaller data sets created by splitting the input data set. Each of the smaller data sets will contain at most the number of patients specified by the value of the propertycom.ibm.research.drl.deepguidelines.pathways.extractor.synthea.data.split.max.number.of.patients.per.chunk
. If you leave these two options empty, the input data set will not be split.com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.included.medical.types
defines the list of medical types that you want to include as data. The medical types specified here must be values of the enumSyntheaMedicalTypes
com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.included.conditions.codes
specifies the condition codes that will be used to generate pathways. If the list is empty, the Pathway Extractor generates pathways for all conditions in the Synthea fileconditions.csv
. If the list is not empty, then Pathway Extractor generates pathways only for those conditions in the Synthea fileconditions.csv
whose code is contained in the list will.com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.produce.javascript.data.for.visualizations.for.patient.with.ids
defines a list of patient IDs for which you want to produce visualizations. If the list is empty, then no visualization is produced. Visualization will include, for each patient:- all pathways, as time line representation
- all
PathwayMatrix
instances, as a visualization of the 3D sparse matrix - all
PathwayImage
instances, as a visualization of the 2D sparse matrix - the
IntervalTree
instance (mainly for debug purposes)
com.ibm.research.drl.deepguidelines.pathways.extractor.output.PathwayImagesWriter
specifies the class for writing pathway images to the output file(s). Valid options for this property are are NIO or BOS: NIO uses Java NIO FileChannel, BOS uses a traditional BufferedOutputStream.com.ibm.research.drl.deepguidelines.pathways.extractor.output.path
specifies the output path where Pathway Extractor writes pathway images. If you leave this value empty, then the Pathway Extractor write output to the temporary folder of your computer (/tmp
on Linux, or$TMPDIR
on OS X).com.ibm.research.drl.deepguidelines.pathways.extractor.output.max.output.file.size.in.bytes
specifies the maximum size in bytes of the output file containing the pathway images. Note that the output file can grow big very quickly, because of the way we serialize pathway images to CSV. If the file size reaches the maximum value specified by this property, then Pathway Extractor will automatically create another file, and so on. The maximum value for this property isjava.lang.Integer.MAX_VALUE
, i.e. 2147483647.com.ibm.research.drl.deepguidelines.pathways.extractor.output.pathway.image.max.columns
specifies the maximum number of columns for a pathway image. Pathway images with fewer columns are padded by adding empty columns. Pathway images with more columns are discarded (not written to output file).com.ibm.research.drl.deepguidelines.pathways.extractor.output.pathway.image.without.pathway.event.features.from.start.and.stop.pathway.events
: whentrue
the code of the condition that originated a pathway will be removed from the corresponding pathway image (this is useful for training a neural network (or a classifier) to recognize the condition given the pathway image).
- Configure the execution by setting options in
application.properties
. - Copy your
application.properties
and executable jar file (for examplepatient_pathway_extractor-1.0.0.jar
) in the same folder - From the command line, run
java -Xmx8g -jar patient_pathway_extractor-1.0.0.jar
- it is recommended to use 8GB of memory to process a Synthea dataset with 50000 patients.
The Pathway Extractor produces a CSV file (or a set of CSV files depending on the configuration) containing the pathway images. Each line of the output file has the following format:
<pathway start date>,<pathway stop date>,<patient id>,<condition id>,<originating-condition-code>,<... pathway image serialized as a single CSV line ...>
- the first four fields are the
Pathway
ID - the fifth field
originating-condition-code
is the SNOMED-CT code of condition that originated the pathway - the pathway image serializes as a single CSV line is obtained by concatenating each row of the
PathwayImage
. Each row is possibly padded to a fixed length (see configuration propertycom.ibm.research.drl.deepguidelines.pathways.extractor.output.pathway.image.max.columns
) adding empty cells.
If the configuration property com.ibm.research.drl.deepguidelines.pathways.extractor.synthea.produce.javascript.data.for.visualizations.for.patient.with.ids
specifies a non-empty list of patient IDs, the Pathway Extractor will also produce HTML files for the corresponding visualizations, which will be saved to a temporary file. The file names of the visualiazions are listed in the LOG. The following are example of visualizations for the patient 0007782f-ca96-4ad1-bdf7-516ab5dc1ef5
from Synthea data set generated with options -p 50000 -s 3
.
- Horizontal bars represent events that have a start and stop date.
- Dots represent isolated events (they happen at a certain date).
- Hovering the mouse over the bars/dots displays data about the event (those data that we encode in the pathway images).
This is a visualization of the sparse 3D matrix that contains the data in PathwayMatrix
. Each slice along the time axis gives the values of the different dimensions (conditions, medications, observations, etc.) at that time.
This is a visualization of the sparse 2D matrix that contains the data in PathwayImage
. In this variant, each slice of the sparse 3D matrix above is trimmed to its last non empty cell, and then all slices are concatenated. The vertical axis corresponds to the different dimensions (conditions, medications, observations, etc.) shown in the PathwayMatrix
visualization. The horizontal axis is time. The image above is only a partial view of the entire PathwayImage
, which is much longer on the horizontal axis.
This is a visualization of the IntervalTree
that contains events with start/stop date (interval) for the patient. This visualization is mainly useful for debug purposes. Each node in the tree contains:
- the minimum of the interval, i.e. its start date
- the maximum of the interval, i.e. its stop date
- the maximum of the subtree rooted at this node
Hovering over the node you can see the
Set<PathwayEvent>
that belong to the node, i.e. allPathwayEvent
instances having the same start/stop date (interval represented by the node).
The folloing image illustrates a few high-level data structure used in the code, and their relationship.