The new repo includes a Nextflow pipeline and 6 additional methods.
In this repository, you can find the analysis scripts and plots pertaining to the dissertation. There are scripts to run and evaluate the five deconvolution methods (cell2location, MuSiC, stereoscope, RCTD, and SPOTlight). Later, I also apply cell2location and RCTD on real data.
To perform benchmarking, synthetic data has to be created from a reference scRNA-seq dataset. Then, we run different deconvolution methods on the datasets and evaluate them. I made use of seven scRNA-seq datasets to generate synthetic spatial data using the package synthvisium (not yet publicly available). The raw datasets along with the download links are listed below.
Dataset | Direct download link |
---|---|
Brain cortex | Link |
Cerebellum (sc) | Link |
Cerebellum (sn) | |
Hippocampus | Link |
Kidney | Link |
PBMC | Link |
SCC (patient 5) | Link |
(Both cerebellum datasets can be downloaded from the link.)
I did not preprocess the scRNA-seq data myself so I cannot share the scripts here, but the procedure is described in section 5.1 of the text. You can also follow this Seurat vignette for a quick preprocessing of the PBMC data.
As an alternative to synthvisium, you can generate synthetic data using scripts from SPOTlight, stereoscope, or cell2location. Some sample code for running these functions can be found at Scripts/synthetic_data_generation
, although the cell2location functions have to be cloned from here.
The countsimQC reports between different synthetic data generation algorithms can be found in the folder countsimQC/
.
Scripts for running the deconvolution methods can be found at Scripts/run_deconv
along with a description for using those files. The deconvolution results are compiled in the folder results/
.
For scripts to generate downsampled data and get the runtime of each method, check out Scripts/run_deconv_downsample
.
Evaluation scripts are found at Scripts/
with the prefix evaluation_
. These make use of the deconvolution results saved in results/
.
The summary files/
folder contains a few spreadsheets, e.g., the p-values of the pairwise Wilcoxon tests and the median RMSE values. Aside from that, the folder structure of results/
is dataset → replicate → method output. Within each replicate folder (rep
prefix), you will find:
all_metrics*.rds
: a file with the computed metrics (RMSE and six classification metrics) for all 8 dataset typesplots/
: UMAP plots shaded with inferred proportions of each cell type (not in dissertation)corr_distribution/
: density plots of the correlation across all spots (not in dissertation)
Along with high-resolution of the plots found in the thesis, you can also find scripts that are used to generate the plots. The plots are in the directory plots/
and there I try to make a link with the corresponding scripts.
You can find the scripts for preprocessing, running deconvolution tools, and evaluating the liver dataset in the folder Scripts/liver/
.