202_Das_SingleCellRNASeq.Rmd

# 202: Analysis of single-cell RNA-seq data: Dimensionality reduction, clustering, and lineage inference

Authors:
    Diya Das^[University of California at Berkeley, Berkeley, CA, USA],
    Kelly Street^[University of California at Berkeley, Berkeley, CA, USA],
    Davide Risso^[Weill Cornell Medicine, New York, NY, USA]
Last modified: 28 June, 2018.

## Overview

### Description

This workshop will be presented as a lab session (brief introduction followed by hands-on coding)
that instructs participants in a Bioconductor workflow for the analysis of single-cell RNA-sequencing data, in three parts:

1. dimensionality reduction that accounts for zero inflation, over-dispersion, and batch effects
2. cell clustering that employs a resampling-based approach resulting in robust and stable clusters
3. lineage trajectory analysis that uncovers continuous, branching developmental processes

We will provide worked examples for lab sessions, and a set of stand-alone notes in this repository.

Note: A previous version of this workshop was well-attended at BioC 2017,
but the tools presented have been significantly updated for
interoperability (most notably, through the use of the `SingleCellExperiment` class), and we have been receiving many requests to provide an
updated workflow. We plan to incorporate feedback from this workshop into a revised version of our F1000 Workflow.

### Pre-requisites

We expect basic knowledge of R syntax. Some familiarity with S4 objects may be helpful, but not required.
More importantly, participants should be familiar with the concept and design of RNA-sequencing experiments. Direct experience with single-cell RNA-seq is not required, and the main challenges of single-cell RNA-seq compared to bulk RNA-seq will be illustrated.

### Participation

This will be a hands-on workshop, in which each student, using their laptop, will analyze a provided example datasets. The workshop will be a mix of example code that the instructors will show to the students (available through this repository) and short exercises.

### _R_ / _Bioconductor_ packages used

1. _zinbwave_ : https://bioconductor.org/packages/zinbwave
2. _clusterExperiment_: https://bioconductor.org/packages/clusterExperiment
3. _slingshot_: https://bioconductor.org/packages/slingshot

### Time outline

| Activity                                   | Time |
|--------------------------------------------|------|
| Intro to single-cell RNA-seq analysis      | 15m  |
| zinbwave (dimensionality reduction)        | 30m  |
| clusterExperiment (clustering)             | 30m  |
| slingshot (lineage trajectory analysis)    | 30m  |
| Questions / extensions                     | 15m  |

### Workshop goals and objectives

Learning goals

* describe the goals of single-cell RNA-seq analysis 
* identify the main steps of a typical single-cell RNA-seq analysis
* evaluate the results of each step in terms of model fit 
* synthesize results of successive steps to interpret biological significance and develop biological models
* apply this workflow to carry out a complete analysis of other single-cell RNA-seq datasets

Learning objectives

* compute and interpret low-dimensional representations of single-cell data
* identify and remove sources of technical variation from the data
* identify sub-populations of cells (clusters) and evaluate their robustness
* infer lineage trajectories corresponding to differentiating cells
* order cells by developmental "pseudotime"
* identify genes that play an important role in cell differentiation 

## Getting started

```{r options, echo=FALSE, results="hide",message=FALSE, error=FALSE, include=FALSE, autodep=TRUE}
knitr::opts_chunk$set(cache=FALSE, error=FALSE, message=FALSE, warning=FALSE)
```

The workflow presented in this workshop consists of four main steps:

1. dimensionality reduction accounting for zero inflation and over-dispersion and adjusting for gene and cell-level covariates, using the `zinbwave` Bioconductor package;
2. robust and stable cell clustering using resampling-based sequential ensemble clustering, as implemented in the `clusterExperiment` Bioconductor package;
3. inference of cell lineages and ordering of the cells by developmental progression along lineages, using the `slingshot` R package; and
4. DE analysis along lineages.

```{r schema, echo=FALSE, out.width="90%", fig.cap="Workflow for analyzing scRNA-seq datasets. On the right, main plots generated by the workflow."}
knitr::include_graphics("202_Das_SingleCellRNASeq/schema_workflow.png")
```

Throughout the workflow, we use a single `SingleCellExperiment` object to store the scRNA-seq data along with any gene or cell-level metadata available from the experiment.

### The data

```{r stemcelldiff, echo=FALSE, out.width="60%", fig.align="center", fig.cap = "Stem cell differentiation in the mouse olfactory epithelium. This figure was reproduced with kind permission from Fletcher et al. (2017)."}
knitr::include_graphics("202_Das_SingleCellRNASeq/stemcelldiff_Fletcher2017_2e.png")
```

This workshop uses data from a scRNA-seq study of stem cell differentiation in the mouse olfactory epithelium (OE) [@Fletcher2017]. The olfactory epithelium contains mature olfactory sensory neurons (mOSN) that are continuously renewed in the epithelium via neurogenesis through the differentiation of globose basal cells (GBC), which are the actively proliferating cells in the epithelium. When a severe injury to the entire tissue happens, the olfactory epithelium can regenerate from normally quiescent stem cells called horizontal basal cells (HBC), which become activated to differentiate and reconstitute all major cell types in the epithelium.

The scRNA-seq dataset we use as a case study was generated to study the differentitation of HBC stem cells into different cell types present in the olfactory epithelium. To map the developmental trajectories of the multiple cell lineages arising from HBCs, scRNA-seq was performed on FACS-purified cells using the Fluidigm C1 microfluidics cell capture platform followed by Illumina sequencing. The expression level of each gene in a given cell was quantified by counting the total number of reads mapping to it. Cells were then assigned to different lineages using a statistical analysis pipeline analogous to that in the present workflow. Finally, results were validated experimentally using in vivo lineage tracing. Details on data generation and statistical methods are available in [@Fletcher2017; @Risso2017; @Street2017; @Risso2018].

In this workshop, we describe a sequence of steps to recover the lineages found in the original study, starting from the genes x cells matrix of raw counts publicly-available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE95601.

The following packages are needed.


```{r packages}
suppressPackageStartupMessages({
  # Bioconductor
  library(BiocParallel)
  library(SingleCellExperiment)
  library(clusterExperiment)
  library(scone)
  library(zinbwave)
  library(slingshot)
  # CRAN
  library(gam)
  library(RColorBrewer)
})
set.seed(20)
```

### Parallel computing

The `BiocParallel` package can be used to allow for parallel computing in `zinbwave`. Here, we use a single CPU to run the function, `register`ing the serial mode of `BiocParallel`. Users that have access to more than one core in their system are encouraged to use multiple cores to increase speed.

```{r parallel}
register(SerialParam())
```

## The `SingleCellExperiment` class

Counts for all genes in each cell are available as part of the GitHub R package `drisso/fletcher2017data`. Before filtering, the dataset has 849 cells and 28,361 detected genes (i.e., genes with non-zero read counts). 

```{r datain}
library(fletcher2017data)

data(fletcher)
fletcher
```

Throughout the workshop, we use the class `SingleCellExperiment` to keep track of the counts and their associated metadata within a single object. 

```{r sceschema, echo=FALSE, out.width="90%", fig.cap="Schematic view of the SingleCellExperiment class."}
knitr::include_graphics("202_Das_SingleCellRNASeq/SingleCellExperiment.png")
```

The cell-level metadata contain quality control measures, sequencing batch ID, and cluster and lineage labels from the original publication [@Fletcher2017]. Cells with a cluster label of `-2` were not assigned to any cluster in the original publication.

```{r}
colData(fletcher)
```


## Pre-processing

Using the Bioconductor package `scone`, we remove low-quality cells according to the quality control filter implemented in the function `metric_sample_filter` and based on the following criteria (Figure \@ref(fig:scone)): (1) Filter out samples with low total number of reads or low alignment percentage and (2) filter out samples with a low detection rate for housekeeping genes. See the [scone vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/scone/inst/doc/sconeTutorial.html) for details on the filtering procedure.

### Sample filtering

```{r scone, fig.cap="SCONE: Filtering of low-quality cells."}
# QC-metric-based sample-filtering
data("housekeeping")
hk = rownames(fletcher)[toupper(rownames(fletcher)) %in% housekeeping$V1]

mfilt <- metric_sample_filter(counts(fletcher), 
                              nreads = colData(fletcher)$NREADS,
                              ralign = colData(fletcher)$RALIGN,
                              pos_controls = rownames(fletcher) %in% hk,
                              zcut = 3, mixture = FALSE,
                              plot = TRUE)
```

```{r sconeFilt}
# Simplify to a single logical
mfilt <- !apply(simplify2array(mfilt[!is.na(mfilt)]), 1, any)
filtered <- fletcher[, mfilt]
dim(filtered)
```

After sample filtering, we are left with `r ncol(filtered)` good quality cells.

Finally, for computational efficiency, we retain only the 1,000 most variable genes. This seems to be a reasonnable choice for the illustrative purpose of this workflow, as we are able to recover the biological signal found in the published analysis ([@Fletcher2017]). In general, however, we recommend care in selecting a gene filtering scheme, as an appropriate choice is dataset-dependent.

We can use to functions from the `clusterExperiment` package to compute a filter statistics based on the variance (`makeFilterStats`) and to apply the filter to the data (`filterData`).


```{r}
filtered <- makeFilterStats(filtered, filterStats="var", transFun = log1p)
filtered <- filterData(filtered, percentile=1000, filterStats="var")
filtered
```

In the original work [@Fletcher2017], cells were clustered into 14 different clusters, with 151 cells not assigned to any cluster (i.e., cluster label of `-2`). 

```{r original}
publishedClusters <- colData(filtered)[, "publishedClusters"]
col_clus <- c("transparent", "#1B9E77", "antiquewhite2", "cyan", "#E7298A", 
              "#A6CEE3", "#666666", "#E6AB02", "#FFED6F", "darkorchid2", 
              "#B3DE69", "#FF7F00", "#A6761D", "#1F78B4")
names(col_clus) <- sort(unique(publishedClusters))
table(publishedClusters)
```

## Normalization and dimensionality reduction: ZINB-WaVE

In scRNA-seq analysis, dimensionality reduction is often used as a preliminary step prior to downstream analyses, such as clustering, cell lineage and pseudotime ordering, and the identification of DE genes. This allows the data to become more tractable, both from a statistical (cf. curse of dimensionality) and computational point of view. Additionally, technical noise can be reduced while preserving the often intrinsically low-dimensional signal of interest [@Peer2017; @Pierson2015; @Risso2017].

Here, we perform dimensionality reduction using the zero-inflated negative binomial-based wanted variation extraction (ZINB-WaVE) method implemented in the Bioconductor R package `zinbwave`. The method fits a ZINB model that accounts for zero inflation (dropouts), over-dispersion, and the count nature of the data. The model can include a cell-level intercept, which serves as a global-scaling normalization factor. The user can also specify both gene-level and cell-level covariates. The inclusion of observed and unobserved cell-level covariates enables normalization for complex, non-linear effects (often referred to as batch effects), while gene-level covariates may be used to adjust for sequence composition effects (e.g., gene length and GC-content effects). A schematic view of the ZINB-WaVE model is provided in Figure \@ref(fig:zinbschema). For greater detail about the ZINB-WaVE model and estimation procedure, please refer to the original manuscript [@Risso2017].

```{r zinbschema, echo=FALSE, out.width="95%", fig.cap="ZINB-WaVE: Schematic view of the ZINB-WaVE model. This figure was reproduced with kind permission from Risso et al. (2017)."}
knitr::include_graphics("202_Das_SingleCellRNASeq/zinb_schema.png")
```

As with most dimensionality reduction methods, the user needs to specify the number of dimensions for the new low-dimensional space. Here, we use `K = 50` dimensions and adjust for batch effects via the matrix `X`.

```{r zinb,eval=FALSE}
clustered <- zinbwave(filtered, K = 50, X = "~ Batch", residuals = TRUE, normalizedValues = TRUE)))
```

Note that the `fletcher2017data` package includes the object `clustered` that already contains the ZINB-WaVE factors. We can load such objects to avoid waiting for the computations.

```{r}
data(clustered)
```

### Normalization

The function `zinbwave` returns a `SingleCellExperiment` object that includes normalized expression measures, defined as deviance residuals from the fit of the ZINB-WaVE model with user-specified gene- and cell-level covariates. Such residuals can be used for visualization purposes (e.g., in heatmaps, boxplots). Note that, in this case, the low-dimensional matrix `W` is not included in the computation of residuals to avoid the removal of the biological signal of interest.

```{r norm}
assayNames(clustered)
norm <- assay(clustered, "normalizedValues")
norm[1:3,1:3]
```

### Dimensionality reduction

The `zinbwave` function's main use is to perform dimensionality reduction. The resulting low-dimensional matrix `W` is stored in the `reducedDim` slot named `zinbwave`.

```{r dm}
reducedDimNames(clustered)
W <- reducedDim(clustered, "zinbwave")
dim(W)
W[1:3, 1:3]
```


The low-rank matrix `W` can be visualized in two dimensions by performing multi-dimensional scaling (MDS) using the Euclidian distance. To verify that `W` indeed captures the biological signal of interest, we display the MDS results in a scatterplot with colors corresponding to the original published clusters (Figure \@ref(fig:mdsW)).  

```{r mdsW, fig.cap="ZINB-WaVE: MDS of the low-dimensional matrix W, where each point represents a cell and cells are color-coded by original published clustering."}
W <- reducedDim(clustered)
d <- dist(W)
fit <- cmdscale(d, eig = TRUE, k = 2)
plot(fit$points, col = col_clus[as.character(publishedClusters)], main = "",
     pch = 20, xlab = "Component 1", ylab = "Component 2")
legend(x = "topleft", legend = unique(names(col_clus)), cex = .5, fill = unique(col_clus), title = "Sample")
```

## Cell clustering: RSEC

The next step is to cluster the cells according to the low-dimensional matrix `W` computed in the previous step. We use the resampling-based sequential ensemble clustering (RSEC) framework implemented in the `RSEC` function from the Bioconductor R package `clusterExperiment`. Specifically, given a set of user-supplied base clustering algorithms and associated tuning parameters (e.g., _k_-means, with a range of values for _k_), RSEC generates a collection of candidate clusterings, with the option of resampling cells and using a sequential tight clustering procedure as in [@Tseng2005]. A consensus clustering is obtained based on the levels of co-clustering of samples across the candidate clusterings. The consensus clustering is further condensed by merging similar clusters, which is done by creating a hierarchy of clusters, working up the tree, and testing for differential expression between sister nodes, with nodes of insufficient DE collapsed. As in supervised learning, resampling greatly improves the stability of clusters and considering an ensemble of methods and tuning parameters allows us to capitalize on the different strengths of the base algorithms and avoid the subjective selection of tuning parameters. 

```{r rsec_50,eval=FALSE}
clustered <- RSEC(clustered, k0s = 4:15, alphas = c(0.1),
                  betas = 0.8, reduceMethod="zinbwave",
                  clusterFunction = "hierarchical01", minSizes=1,
                  ncores = NCORES, isCount=FALSE,
                  dendroReduce="zinbwave",
                  subsampleArgs = list(resamp.num=100,
                                       clusterFunction="kmeans",
                                       clusterArgs=list(nstart=10)),
                  verbose=TRUE,
                  consensusProportion = 0.7,
                  mergeMethod = "none", random.seed=424242,
                  consensusMinSize = 10)
```

Again, the previously loaded `clustered` object already contains the results of the `RSEC` run above, so we do not evaluate the above chunk here.

```{r}
clustered
```

Note that the results of the `RSEC` function is an object of the `ClusterExperiment` class, which extends the `SingleCellExperiment` class, by adding additional information on the clustering results.

```{r}
is(clustered, "SingleCellExperiment")
slotNames(clustered)
```

The resulting candidate clusterings can be visualized using the `plotClusters` function (Figure \@ref(fig:examinemakeConsensus)), where columns correspond to cells and rows to different clusterings. Each sample is color-coded based on its clustering for that row, where the colors have been chosen to try to match up clusters that show large overlap accross rows. The first row correspond to a consensus clustering across all candidate clusterings.

```{r examinemakeConsensus, fig.cap="RSEC: Candidate clusterings found using the function RSEC from the clusterExperiment package."}
plotClusters(clustered)
```

The `plotCoClustering` function produces a heatmap of the co-clustering matrix, which records, for each pair of cells, the proportion of times they were clustered together across the candidate clusters (Figure \@ref(fig:plotcoclust)). 

```{r plotcoclust, fig.cap="RSEC: Heatmap of co-clustering matrix."}
plotCoClustering(clustered)
```

The distribution of cells across the consensus clusters can be visualized in Figure \@ref(fig:barplotOurs) and is as follows:

```{r tableclust}
table(primaryClusterNamed(clustered))
```

```{r barplotOurs, fig.cap="RSEC: Barplot of number of cells per cluster for our workflow's RSEC clustering."}
plotBarplot(clustered, legend = FALSE)
```

The distribution of cells in our clustering overall agrees with that in the original published clustering (Figure \@ref(fig:addPublishedClusters)), the main difference being that several of the published clusters were merged here into single clusters. This discrepancy is likely caused by the fact that we started with the top 1,000 genes, which might not be enough to discriminate between closely related clusters.

```{r addPublishedClusters, fig.cap="RSEC: Barplot of number of cells per cluster, for our workflow's RSEC clustering, color-coded by original published clustering."}
clustered <- addClusterings(clustered, colData(clustered)$publishedClusters, 
                     clusterLabel = "publishedClusters")

## change default color to match with Figure 7
clusterLegend(clustered)$publishedClusters[, "color"] <- 
  col_clus[clusterLegend(clustered)$publishedClusters[, "name"]]

plotBarplot(clustered, whichClusters=c("makeConsensus", "publishedClusters"),
            xlab = "", legend = FALSE,missingColor="white")
```
 
```{r addPublishedClusters2, fig.cap="RSEC: Confusion matrix of number of cells per cluster, for our workflow's RSEC clustering and the original published clustering."}
plotClustersTable(clustered, whichClusters=c("makeConsensus","publishedClusters"))
```

Figure \@ref(fig:heatmapsClusters) displays a heatmap of the normalized expression measures for the 1,000 most variable genes, where cells are clustered according to the RSEC consensus.

```{r heatmapsClusters, fig.cap="RSEC: Heatmap of the normalized expression measures for the 1,000 most variable genes, where rows correspond to genes and columns to cells ordered by RSEC clusters."}
# Set colors for additional sample data
experimentColors <- bigPalette[1:nlevels(colData(clustered)$Experiment)]
batchColors <- bigPalette[1:nlevels(colData(clustered)$Batch)]
metaColors <- list("Experiment" = experimentColors,
                   "Batch" = batchColors)

plotHeatmap(clustered, 
            whichClusters = c("makeConsensus","publishedClusters"), clusterFeaturesData = "all",
            clusterSamplesData = "dendrogramValue", breaks = 0.99,
            colData = c("Batch", "Experiment"),
            clusterLegend = metaColors, annLegend = FALSE, main = "")
```

Finally, we can visualize the cells in a two-dimensional space using the MDS of the low-dimensional matrix `W` and coloring the cells according to their newly-found RSEC clusters (Figure \@ref(fig:mdsWce)); this is anologous to Figure \@ref(fig:mdsW) for the original published clusters.

```{r mdsWce, fig.cap="RSEC: MDS of the low-dimensional matrix W, where each point represents a cell and cells are color-coded by RSEC clustering."}
plotReducedDims(clustered,whichCluster="primary",reducedDim="zinbwave",pch=20,
	xlab = "Component1", ylab = "Component2",legendTitle="Sample",main="",
	plotUnassigned=FALSE
)
```

## Cell lineage and pseudotime inference: Slingshot

We now demonstrate how to use the Bioconductor package `slingshot` to infer branching cell lineages and order cells by developmental progression along each lineage. The method, proposed in [@Street2017], comprises two main steps: (1) The inference of the global lineage structure (i.e., the number of lineages and where they branch) using a minimum spanning tree (MST) on the clusters identified above by `RSEC` and (2) the inference of cell pseudotime variables along each lineage using a novel method of simultaneous principal curves. The approach in (1) allows the identification of any number of novel lineages, while also accommodating the use of domain-specific knowledge to supervise parts of the tree (e.g., known terminal states); the approach in (2) yields robust pseudotimes for smooth, branching lineages. 

This analysis is performed out by the `slingshot` function and the results are stored in a `SlingshotDataSet` object. The minimal input to this function is a low-dimensional representation of the cells and a set of cluster labels; these can be separate objects (ie. a matrix and a vector) or, as below, components of a `SingleCellExperiment` object. When a `SingleCellExperiment` object is provided as input, the ouput will be an updated object containing a `SlingshotDataSet` as an element of the `int_metadata` list, which can be accessed through the `SlingshotDataSet` function. For more low-level control of the lineage inference procedure, the two steps may be run separately via the functions `getLineages` and `getCurves`.

From the original published work, we know that the starting cluster should correspond to HBCs and the end clusters to MV, mOSN, and mSUS cells. Additionally, we know that GBCs should be at a junction before the differentiation between MV and mOSN cells (Figure \@ref(fig:stemcelldiff)). The correspondance between the clusters we found here and the original clusters is as follows.

```{r tabagain}
table(data.frame(original = publishedClusters, ours = primaryClusterNamed(clustered)))
```

Cluster name | Description | Color | Correspondence
-------------|-------------|-------| ----------
c1 | HBC | red | original 1, 5
c2 | mSUS | blue | original 4, 7
c3 | mOSN  | green | original 9, 12 
c4 | GBC | orange | original 2, 3, 11
c5 | Immature Neuron | purple | original 10, 14
c6 | MV | brown | original 15
c7 | mOSN | light blue | original 9

To infer lineages and pseudotimes, we will apply Slingshot to the 4-dimensional MDS of the low-dimensional matrix `W`. We found that the Slingshot results were robust to the number of dimensions _k_ for the MDS (we tried _k_ from 2 to 5). Here, we use a semi-supervised version of Slingshot, where we only provide the identity of the start cluster but not of the end clusters.

```{r}
pseudoCe <- clustered[,!primaryClusterNamed(clustered) %in% c("-1")]
X <- reducedDim(pseudoCe,type="zinbwave")
mds <- cmdscale(dist(X), eig = TRUE, k = 4)
lineages <- slingshot(mds$points, clusterLabels = primaryClusterNamed(pseudoCe), start.clus = "c1")
```

Before discussing the simultaneous principal curves, we examine the global structure of the lineages by plotting the MST on the clusters. This shows that our implementation has recovered the lineages found in the published work (Figure \@ref(fig:tree)). The `slingshot` package also includes functionality for 3-dimensional visualization as in Figure \@ref(fig:stemcelldiff), using the `plot3d` function from the package `rgl`.

```{r tree, fig.cap="Slingshot: Cells color-coded by cluster in a 4-dimensional MDS space, with connecting lines between cluster centers representing the inferred global lineage structure."}
colorCl<-convertClusterLegend(pseudoCe,whichCluster="primary",output="matrixColors")[,1]
pairs(lineages, type="lineages", col = colorCl)
```

Having found the global lineage structure, `slingshot` then constructed a set of smooth, branching curves in order to infer the corresponding pseudotime variables. Simultaneous principal curves are constructed from the individual cells along each lineage, rather than the cell clusters. During this iterative process, a cell may even be reassigned to a different lineage if it is significantly closer to the corresopnding curve. This makes `slingshot` less reliant on the original clustering and generally more stable. The final curves are shown in Figure \@ref(fig:curves).

```{r curves, fig.cap="Slingshot: Cells color-coded by cluster in a 4-dimensional MDS space, with smooth curves representing each inferred lineage."}
pairs(lineages, type="curves", col = colorCl)
```

```{r lineages}
lineages
```

As an alternative, we could have incorporated the MDS results into the `clustered` object and applied `slingshot` directly to it. Here, we need to specify that we want to use the MDS results, because `slingshot` would otherwise use the first element of the `reducedDims` list (in this case, the 10-dimensional `W` matrix from `zinbwave`).

```{r sling_sce}
reducedDim(pseudoCe, "MDS") <- mds$points
pseudoCe <- slingshot(pseudoCe, reducedDim = "MDS", start.clus = "c1")
pseudoCe
colData(pseudoCe)
```

The result of `slingshot` applied to a `ClusterExperiment` object is still of class `ClusterExperiment`. Note that we did not specify a set of cluster labels, implying that `slingshot` should use the default `primaryClusterNamed` vector.

In the workflow, we recover a reasonable ordering of the clusters using the largely unsupervised version of `slingshot`. However, in some other cases, we have noticed that we need to give more guidance to the algorithm to find the correct ordering. `getLineages` has the option for the user to provide known end cluster(s), which represents a constraint on the MST requiring those clusters to be leaf nodes. Here is the code to use `slingshot` in a supervised setting, where we know that clusters `c3`, `c6` and `c2` represent terminal cell fates.

```{r slingshotsupervised, eval=FALSE}
pseudoCeSup <- slingshot(pseudoCe, reducedDim = "MDS", start.clus = "c1",
                          end.clus = c("c3", "c6", "c2"))
```

## Differential expression analysis along lineages

After assigning the cells to lineages and ordering them within lineages, we are interested in finding genes that have non-constant expression patterns over pseudotime. 

More formally, for each lineage, we use the robust local regression method loess to model in a flexible, non-linear manner the relationship between a gene's normalized expression measures and pseudotime. We then can test the null hypothesis of no change over time for each gene using the `gam` package. We implement this approach for the neuronal lineage and display the expression measures of the top 100 genes by p-value in the heatmap of Figure \@ref(fig:heatmapsignificant). 	

```{r fitgam}
t <- colData(pseudoCe)$slingPseudotime_1
y <- transformData(pseudoCe)
gam.pval <- apply(y,1,function(z){
  d <- data.frame(z=z, t=t)
  tmp <- gam(z ~ lo(t), data=d)
  p <- summary(tmp)[4][[1]][1,5]
  p
})
```


```{r heatmapsignificant, fig.cap="DE: Heatmap of the normalized expression measures for the 100 most significantly DE genes for the neuronal lineage, where rows correspond to genes and columns to cells ordered by pseudotime."}
topgenes <- names(sort(gam.pval, decreasing = FALSE))[1:100]

pseudoCe1 <- pseudoCe[,!is.na(t)]
orderSamples(pseudoCe1)<-order(t[!is.na(t)])

plotHeatmap(pseudoCe1[topgenes,], clusterSamplesData = "orderSamplesValue", breaks = .99)
```