LymphoML: An Interpretable Computational Method Identifies Morphologic Features that Correlate with Lymphoma Subtype
- Clone the lymphoma-ml repository:
git clone https://github.com/stanfordmlgroup/lymphoma-ml.git
- Make the virtual environment:
conda env create -f environment.yml
- Activate the virtual environment:
conda activate lymphoma
- Download the latest version of CellProfiler. We recommend running CellProfiler in headless mode (using the command-line) for running over a large number of images.
The code in our study is organized in the following major components, which roughly corresponds to the directory structure in this repo.
- Data Processing
- StarDist
- Deep-Learning Models
- CellProfiler
- Spatial Feature Extraction
- Interpretable Models
- Statistical Analysis
We walk through the steps to reproduce the results in our study.
Note: to reproduce the deep-learning results, you only need to run the notebooks/scripts specified in sections 1-3.
The processing directory contains files used to process the raw data for ingestion in deep-learning models or CellProfiler.
- First, run process_raw_data_to_hdf5.ipynb to extract patches from each TMA SVS file and save the results in HDF5 file format.
- Next, run cores_to_tiff.py to save each TMA core as a TIFF file.
- Finally, run process_hdf5_to_data_splits.ipynb, which splits the data into train/val/test splits.
The Stardist directory contains files used to run the StarDist algorithm for nuclei segmentation.
- Run build_stardist_segmentations.py to run a pre-trained StarDist model checkpoint over each TMA core.
- The stardist_tutorial.ipynb notebook displays the output of StarDist on sample patches/cores.
The deep learning consists of training and testing procedures for two types of deep learning models: Self-Supervised ResNet on H&E images and the TripletNet architecture trained on the CAMELYON 16 challenge. This set of instructions assumes that you have filled the predictions
and checkpoints
with the desired paths in the config.json
.
- Having split the data into hdf5 files, we can train the models the corresponding configuration files given in the
yaml
directory. This has a few example configurations for Naive training (the experiments we reported, as well as some partial results with Multiple Instance Learning that are yet to be tested in depth). Generate the relevant configuration based on the documentation provided in the examples. - Run the train command
train_naive.py
to run naive training (andtrain_mil.py
for partial MIL experiments). This will train the model with the parameters (number of GPUs, learning rate, batch size, model architecture, etc.) specified and write the final checkpoint to the path required - Run
eval_naive.py
to get the final CSV with the individual predictions per patch. These are aggregated bycore_level_metrics.ipynb
in the same notebook to give the final TMA core-level and patient-level metrics. - Statistical analysis is done in the sections further below.
The CellProfiler directory contains files used to run the CellProfiler pipeline on each TMA core and train/evaluate models for lymphoma subtype classification.
The pipelines subdirectory contains the CellProfiler project and pipeline files. Run the CellProfiler pipeline using the following command (e.g. for TMA 1):
cellprofiler -c -r -p stardist.cppipe -o ~/processed/cellprofiler_out/stardist/tma_1 -i ~/processed/cellprofiler_in/tma_1
The feature_processing subdirectory contains files used to process the output CellProfiler spreadsheets.
- Run patch_identifiers.py to assign a
patch_id
for each cell. The flags-p
and-n
can be used to specify the number of pixels per patch or the number of patches extracted per core respectively.
Run the following command to extract nine (approximately) equally-sized patches from each core.
python patch_identifiers.py -n 9
- Run feature_aggregation.py to aggregate features across all cells with the same
patch_id
.
The models subdirectory contains files used to train/evaluate gradient boosting models for lymphoma subtype classification.
- Run lgb_model.ipynb to train and evaluate a gradient boosting model on the CellProfiler features.
By default, this notebook runs eight-way lymphoma subtype classification using only nuclear morphological features. This notebook also contains options for performing different modifications of this base task:
-
Set
ENABLE_DLBCL_CLASSIFICATION
to perform DLBCL vs non-DLBCL classification -
Set
ENABLE_LABEL_GROUPING
to grouping lymphoma subtypes into clinically relevant categories -
Set
FEATURES
to experiment using other features (e.g. nuclear intensity/texture features, cytoplasmic features, or all features). -
Run immunostains.ipynb to preprocess IHC stains data and group lymphoma subtypes if necessary. The same LightGBM model as coded in lgb_model.ipynb was used after that for immunostains experiment in the paper.
The spatial subdirectory contains files used to extract Ripley K function values and concatenate this information to the rest of H&E featurs.
- Section "Save Centroid Location info" in spatial_features_processing.ipynb saves centroids location information from stardist output into csv files.
- Run ripleyK.r to compute the spatial relationships between centroids on each saved patch.
- Run "Concatenating R Output to Other H&E Features" section in spatial_features_processing.ipynb to combine the spatial features and your current feature dataframe.
The stats subdirectory contains code that we use to compute confidence intervals for all our experiment results.
- Set
num_replicates
to change number of bootstrapped samples to generate. - Set
per_class
to compute confidence intervals for a specific class/label.