HistoKernel: Whole Slide Image Level Maximum Mean Discrepancy Kernels for Pan-Cancer Predictive Modelling
This repository contains the code for the following manuscript:
HistoKernel: Whole Slide Image Level Maximum Mean Discrepancy Kernels for Pan-Cancer Predictive Modellings, submitted to Medical Image Analysis for review.
Computational Pathology (CPath) uses multi-gigapixel Whole Slide Images (WSIs) for various clinical tasks. However, due to the size of these images current methods are forced to make patch-level predictions which are then aggregated into WSI-level predictions. This work proposes a novel solution to the aggregation problem. By utilizing Maximum Mean Discrepancy (MMD) to measure similarity between WSIs we generate a WSI-level similarity kernel that kernel-based approaches can leverage. We perform a comprehensive analysis of this novel approach by performing WSI retrieval (n = 9,362), drug sensitivity regression (n = 551), point mutation classification (n = 3,419), survival analysis (n = 2,291) and multi-modal learning (n=956), outperforming existing methods. We also propose a novel perturbation based method to provide patch-level explainability of our model. This work opens up avenues for further exploration of WSI-level predictive modelling with kernel-based methods.
Interactive demo for patch-level predictions of survival analyis in KIRC is avaialbe at: https://tiademos.dcs.warwick.ac.uk/bokeh_app?demo=HistoKernel
scipy
numpy
matplotlib
geomloss
pandas
torch
sksurv
lifelines
sklearn
seaborn
tqdm
Download the FFPE whole slide images from GDC portal (https://portal.gdc.cancer.gov/).
Download corresponding gene point mutation and Disease Specific Survival from cBioPortal (https://www.cbioportal.org/).
Download drug sensitivity scores for breast cancer patients (https://github.com/engrodawood/HiDS).
Download patient topic data (https://github.com/engrodawood/HiGGsXplore).
For each WSI perform:
- Tile extraction: extract 1024x1024 tiles from the large WSI at a spatial resolution of 0.50 microns-per-pixel
- Patches capturing less that 40% of informative tissue are discarded (mean pixel intensity above 200)
- Feature extraction: extract a feature vector for each tile using
RetCCL
Details can be found in the paper.
Using the code under MMD_distance_matrix_generator
to generate an
Details can be found in the paper and MMD_distance_matrix_generator.
To perfrom the downstream tasks (point mutation prediction, Drug Sensitivty prediction
, Survival Analysis
, WSI Retrival
and Multi-Modal Learning) mentioned in the paper navigate to the appropraite folder in this GitHub.
Some intermediate data are put into the folder data
.
* first author