Skip to content

Latest commit

 

History

History
63 lines (41 loc) · 13.2 KB

10.scSpatial.md

File metadata and controls

63 lines (41 loc) · 13.2 KB

scRNA-seq + seqFISH as a case study for spatial transcriptomics

Overview and biological question

The first hackathon aimed to leverage the complementary strengths of sequencing and imaging-based single-cell transcriptomic profiling by using computational techniques to integrate scRNA-seq and seqFISH data in the mouse visual cortex. While single cells are considered the smallest units and building blocks of each tissue, they still require proper spatial and structural three-dimensional organization in order to assemble into a functional tissue that can exert its physiological function. In the last decade, single-cell RNA-seq (scRNA-seq) has played a key role in capturing single-cell gene expression profiles, allowing us to map different cell types and states in whole organisms. Despite this remarkable achievement, this technology is based on cellular dissociation and hence does not maintain spatial relationships between single cells. Emerging technologies can now profile the transcriptome of single cells within their original environment, offering the possibility to examine how gene expression is influenced by cell-to-cell interactions and how it is spatially organized. One such approach is sequential single-molecule fluorescence in situ hybridization (seqFISH [@doi:10.1038/nmeth.2892, @doi:10.1016/j.neuron.2016.10.001]), which can identify single molecules at (sub)cellular resolution with high sensitivity.

In contrast with scRNA-seq, seqFISH and many other spatial transcriptomic technologies often pose significant technological challenges, resulting in a small number of profiled genes per cell (10-100s). The newer generation of seqFISH technology (called seqFISH+ [@doi:10.1038/s41586-019-1049-y]) has dramatically enhanced its capacity to profile up to 10,000 genes, but this technology is more complex and costly than seqFISH.

New computational approaches are needed to integrate scRNA-seq and seqFISH data effectively. This first hackathon provided seqFISH and scRNA-seq data corresponding to the mouse visual cortex ([@doi:10.1038/nbt.4260], [@doi:10.1038/nn.4216]) and our participants were challenged to accurately identify cell types. The scRNA-seq data included transcriptional profiles at a high molecular resolution whereas the seqFISH data provided spatial characterization at a lower molecular resolution. Two key computational challenges were identified to enable high-resolution spatial molecular resolution. First, we explored several strategies to identify the most likely cell types in the seqFISH dataset based on information obtained from the scRNA-seq dataset. Second, we sought to transfer spatial information obtained from the seqFISH dataset to that of the scRNA-seq dataset. Cell type labels were derived from scRNA-seq analysis [@doi:10.1038/nn.4216] and previous seqFISH/scRNA-seq integration [@doi:10.1038/nbt.4260] were also provided as reference. Data were preprocessed by the organizers and consisted in 113 matching genes between the scRNA-seq dataset and the seqFISH dataset, with 1723 cells for the scRNA-seq data and 1597 cells for the seqFISH data.

{#fig:spatial width = 80%}

Caption Figure: Overview of seqFISH and scRNA-seq integration analysis. A. Assessment of cell type prediction using different data normalizations and classifiers. Normalization strategies included none (raw), counts per million (cpm), ComBat batch correction applied to cpm (cpm_combat), scRNA-seq and seqFISH scaled using the first eigenvalue (cpm_eigen), latent variables retained for both datasets after applying Partial Least Squares regression to cpm_eigen normalized data (cpm_pls). Classifiers approaches included a supervised multinomial classifier with elastic net penalty (enet), a semi-supervised multinomial classifier with elastic net penalty (ssenet) and Support Vector Machine (SVM, supervised). Each classifier was trained using the scRNA-seq data and the known (provided) cell type labels, then predicted the cell type labels in the seqFISH data; for the SVM predictions from the original study were used (Challenge 1). Gower distance between each method-normalization pair is depicted on a multidimensional scaling plot. The first dimension (x-axis) separates methods that normalize the scRNA-seq and seqFISH data together (dashed) and separately (solid), showing that normalization had a stronger impact on cell type predictions than the classification method used. B. SVM classification models with different C parameters were trained with different number of genes in scRNA-seq data using Recursive Feature Elimination (RFE) to evaluate the minimal number of genes required for data integration. The results show that a smaller gene list than what the original study proposed was sufficeint to identify cell types in both data types (Challenge 1). C. LIGER was applied to combine spatial and single cell transcriptomic datasets. From the separate and integrative analyses, plots of identified and known clusters were generated and metrics of integration performance were compared, showing some loss of information as a result of the integration (Challenge 1). D. Construction of a spatial network from cells' positions using Voronoi tessellation, where cell types were inferred from SVM trained on scRNA-seq data. Left: A neighbors aggregation method computes aggregation statistics on the seqFISH gene expression data for each node and its first order neighbors (Challenge 2). Right: Identification of spatially coherent areas that can contain one or several cell types and can be used to detect genes whose expression is modulated by spatial factors rather than cell type.

Computational challenges

Challenge 1: overlay of scRNA-seq onto seqFISH for resolution enhancement

The mouse visual cortex consists of multiple complex cell types. However, the seqFISH dataset was limited to 125 profiled genes, which were not prioritized based on their ability to discriminate between cell types. Assigning the correct cell identity presents an important challenge. In contrast, the scRNA-seq dataset is transcriptome-wide and includes the 125 genes profiled by seqFISH. This challenge proposed to use all genes to identify the cell type labels for each cell in the scRNA-seq data with high certainty. Next, we leveraged the cell type information to build a classifier based on a subset of the 125 genes shared between both datasets. The classifier was then applied to the seqFISH dataset to assign cell types.

During the hackathon, participants aimed to test various machine learning and data integration models (see Vignettes). Preliminary analyses highlighted that normalization strategies had a significant impact on the final results (Figure {@fig:spatial}A). In addition, although unique molecular identifier (UMI) based scRNA-seq and seqFISH can both be considered as count data, we observed dataset specific biases that could be attributed to either platform (imaging vs. sequencing batch effects) or sample specific sources of variation. We opted to apply a quantile normalization approach that forces a similar expression distribution for each shared gene.

Two classification approaches were considered: supervised and semi-supervised generalized linear model regularized with elastic net penalty (enet and ssenet) and supervised support vector machines (SVM). The ssenet approach builds a model iteratively: it combines both datasets and initially only retains the highest confidence labels, then gradually adds more cell type labels until all cells are classified (Figure {@fig:spatial}A). This type of self-training approach might be promising to generalize information to other datasets. To improve the SVM model, several combinations of kernels and optimal hyperparameters were assessed using a combination of randomized and zoomed search. In addition, different flavors of gene selection using recursive feature elimination were considered to identify the optimal or minimal number of genes needed to correctly classify the majority of the cells (Figure {@fig:spatial}A). Finally, different classification accuracy metrics were considered to alleviate the major class imbalance in the dataset. More than 90% of cells were excitatory or inhibitory neurons, using balanced classification error rates. We applied LIGER, an approach based on integrative non-negative matrix factorization (NMF) to integrate both datasets in a subspace based on shared factors. This enabled the transfer of cell type labels using a nearest neighbor approach (Figure {@fig:spatial}D).

Challenge 2: Identifying spatial expression patterns at the tissue level through the integration of gene expression and spatial cellular coordinates

While most tools originally developed for scRNA-seq data can be adapted for spatial transcriptomic datasets (see common challenges section), methods to extract sources of variation from spatial factors are still lacking. Novel methods that can integrate the information obtained from gene expression with that of the spatial coordinates from each cell or transcript (for sub-cellular resolution) within a tissue of interest are needed.

To identify spatial expression patterns in the seqFISH dataset, the participants first formed a spatial network based on Voronoi tessellation ([@doi:10.1101/701680]). The gene expression of each cell was spatially smoothed by calculating the average gene expression of all neighboring cells. UMAP was applied to the smoothed and aggregated data matrix to identify cell clusters with a density-based clustering approach (Figure {@fig:spatial}D). Interestingly, these results showed that the obtained clusters themselves are spatially separated and do not necessarily overlap with specific cell types, suggesting that the spatial dimension cannot be captured from the expression data only.

An unanswered question is whether the identified combinatorial spatial patterns can be extracted directly from scRNA-seq data, as previous studies have shown cellular mapping between gene expression profiles and known spatial locations [@doi:10.1038/nbt.3192; @doi:10.1016/j.cell.2019.05.006]. However, this still constitutes both a technological and analytical challenge that will require careful benchmarking in the near future (see benchmarking section).