Multiplex tissue imaging are a collection of increasingly popular single-cell spatial proteomics and transcriptomics assays for characterizing biological tissues both compositionally and spatially. However, several technical issues limit the utility of multiplex tissue imaging, including the limited number of RNAs and proteins that can be assayed, tissue loss, and protein probe failure. In this work, we demonstrate how machine learning methods can address these limitations by imputing protein abundance at the single-cell level using multiplex tissue imaging datasets from a breast cancer cohort. We first compared machine learning methods’ strengths and weaknesses for imputing single-cell protein abundance. Machine learning methods used in this work include regularized linear regression, gradient-boosted regression trees, and deep learning autoencoders. We also incorporated cellular spatial information to improve imputation performance. Using machine learning, single-cell protein expression can be imputed with mean absolute error ranging between 0.05-0.3 on a [0,1] scale. Our results demon-strate (1) the feasibility of imputing single-cell abundance levels for many proteins using machine learning to overcome the technical constraints of multiplex tissue imaging and (2) how including cellular spatial information can substantially enhance imputation results.
- src: python source code for all analyses
- data: Data folder used to store the data used in this study
- figures: Figures and tables generated by this study
- results: Results of all experiments
- Python 3.9 or higher
- RAM: 32GB or higher
- CPU: 8 cores or higher
Hardware Note:
This research was performed and tested on an Intel platform and a Redhat platform.
Software library installation may require manual intervention on other hardware platforms, especially M1/M2/M3 Macs.
- Use
venv
orconda
to create a virtual environment that includes python 3.9 - E.g.
python3 -m venv venv
orconda create -n mti python=3.9
- Activate the virtual environment
- Install needed software libraries:
pip install -r requirements.txt
Run this script from the root dir to execute all analyses and plot results:
./run_research.sh
This script will run all experiments and scripts in order as well as generating the figures.
IMPORTANT NOTE:
To replicate all results and findings ALL scripts have to be executed in order.
We recommend executing every experiment >=30 times to achieve statistical significance. Executing every experiment >=30
times can take >2
weeks.
If you want to execute only parts of the research use the scripts below.
To create all spatial features, run the following script from the root dir:
./src/data_preparation/download_data.sh
./src/data_preparation/prepare_spatial_data.sh
To run the Null Model experiments, run the following script from the root dir:
./src/null_model/run_experiments.sh
To run the Elastic Net experiments, run the following script from the root dir:
./src/en/run_experiments.sh <num_iterations>
E.g.
./src/en/run_experiments.sh 30
This will run the experiments 30 times.
To run the LGBM experiments, run the following script from the root dir:
./src/lgbm/run_experiments.sh <num_iterations>
To run the Auto Encoder experiments, run the following script from the root dir:
./src/ae/run_experiments.sh <num_iterations>
./src/cleanup/clean_score_datasets.sh
./src/data_preparation/create_ae_supplemental_files.sh
./src/classifier/run_downstream_classification.sh
To create all figures and table as well as supplemental material, run the following script from the root dir:
./src/figures/create_figures.sh
Attention
To create the figures of the manuscript, ALL experiments have to be executed first. We recommend to execute every experiment at least 30x to achieve statistical significance.
In this preprint we show that it is possible to impute protein abundance levels in multiplex tissue imaging data using machine learning. We also show that including spatial information can improve imputation performance. We compare three different machine learning methods: Elastic Net, Light Gradient Boosting Machines, and Autoencoders. While all three methods perform well, Autoencoders offer the ability to impute multiple proteins at once, providing a distinct advantage over the other two methods.