FlowCT
: A semi-automated workflow for deconvolution of immunophenotypic data and objective reporting on large datasets
FlowCT
should be used as an analysis pipeline to retrieve all available information from FCS files. It is scalable, being able to analyze from thousands to millions of cells and from tenth to hundreds of markers (depending on computational power availability). It includes different algorithms for batch removal and for cell clustering. Additionally, by using the SingleCellExperiment
object class, it has wide compatibility with existing packages designed for single-cell RNAseq analysis.
Here we propose an ideal pipeline and we will explain, step by step the most important functions of the package by applying it to a full example. This tutorial is based on a serie of paired samples (peripheral blood [PB] and bone marrow [BM] from the same subjects) from a total of 10 patients (data available here). For a deeper explanation and additional options in each function please carefully read package documentation (?function_name
).
Before starting: All the FCS files should include compensation before being analyzed with
FlowCT
. This step should be done with commonly available flow cytometry software. Additionally, all the samples should have been stained with the same panel. It is not important that the name of the marker is included within the FCS nor the channels have the same order.
Our pipeline accepts both unprocessed and pre-processed (i.e., manual doublet removal or population pre-selection) FCS files.
Choose the working directory which include the FCSs to be analyzed: setwd("your/path/with/fcs")
Verify that all FCSs has the same marker names. As already stated, this is an important step. We included a function to perform a quick check and fix (providing the same markers order) of eventual inconsistencies...
library(FlowCT)
unify.FCSheaders(directory = ".", pattern = "fcs", fix = F)
# names freq length
# 1 FSC-A:NA, ...truncated..., V450-A:CD27, V500-A:CD45RA 27 13
# 2 FSC-A:NA, ...truncated..., Pacific Blue-A:CD27, AmCyan-A:CD45RA 3 13
...if name inconsistencies are found, the option fix
should be changed to TRUE
and select the desired order within those showed by the function (if not provided, the most frequent one will be choosen). All the files with inconsistencies will be replaced by fixed ones (with the fixed preffix) while a copy of the original files will be maintained in the ./original_files/ folder (within the working directory). If some files have a different number of markers, they will be discarded in a new folder called ./discarded_files_because_diffs/.
The metadata matrix includes the most important information regarding the files to be analyzed. We have two possibilities to generate this matrix:
-
Outside R, by preparing a file which include all the information of the files to be analyzed (with Excel or text editors);
-
inside R, by manually creating the matrix.
The latter one is the easiest approach if there is need/possibility of quickly adding or removing files from the analysis at different time-points. To do that, we chose to use a specific code for the name of each FCS, by including all the relevant information separated by a specific character. We use the code condition_patientID_other.fcs that let us to calculate a new metadata matrix very quickly every time we add (or remove) new files to (from) our folder.
(filenames <- list.files(pattern = "fcs", path = "."))
md <- data.frame(filename = filenames, #mandatory
sample_id = 1:length(filenames),
condition = sapply(filenames,function(x) strsplit(x, split = "_|\\.")[[1]][1]),
patient_id = sapply(filenames,function(x) strsplit(x, split = "_|\\.")[[1]][2]))
# note: it could be useful to transform some variables to factor
md$condition <- as.factor(md$condition)
The SCE
class is a lightweight Bioconductor container initially born for storing and manipulating single-cell genomics data (see here a more detailed description). However, we found extremely useful its use with flow cytometry data. To do that, original fluorescence data and transformed data were used in the place of RNA counts and logcounts.
We designed a function (fcs.SCE
) which include several options to optimize the generation of the SCE from a bunch of FCS files.
fcs <- fcs.SCE(directory = ".", pattern = "fcs", metadata = md, events = 10000, transf.cofactor = 500, project.name = "ImmTh")
-
Taking into account that this pipeline has been firstly designed for the analysis of large number of files with an high events number from clinical trials, a downsampling (reduction of total events per file to be considered in the analysis) step is usually performed (
events
option). This step is optional and strongly dependent on the computational power of the system used for downstream analyses. We here selected 10000 events for each file. -
Furthermore, we added a transformation option to deal with the bad representation of negative data values with digital cytometer: the Log transformation from the cytometer leads to compression of data against the axes, poor visual representation of low intensity or unstained populations and errors in subsequent automatic gating/clustering algorithms. Accordingly, we used
arcsinh
transformation by using a cofactor of 500 (which should be optimized according to data in thetrans.cofactor
option; e.g. FCSs from the Aurora spectral cytometer would need a cofactor of 4000). Both the transformed and the non-transformed matrix are stored in theSCE
object within the"transformed"
and"raw"
assays, respectively.
In this step we will proceed to rename markers in a format more suitable for representation and to avoid conflicting symbols in R (e.g., "FITC-A: CD62L" will be transformed in "CD62L"). Additionally, we will select groups of markers to be used in downstream analyses (please note that even if not correct, we included FSC-A and SSC-A within the group of surface markers. This has been done to simplify the use of these markers in the clustering procedure).
marker.names(fcs) #see current channel names
new_names <- c("FSC_A","FSC_H","SSC_A","SSC_H", "CD62L", "CXCR3", "CD8", "CCR4",
"CCR6","CD4", "CD27", "CD45RA")
fcs <- marker.names(fcs, new.names = new_names)
# markers selection
surface_markers <- c("FSC_A", "SSC_A","CD62L", "CXCR3", "CD8", "CCR4", "CCR6",
"CD4", "CD27", "CD45RA")
physical_markers <- c("FSC_H", "SSC_H")
Once we have a SCE
built on events-reduced/names-checked/compensated files, we can start the quality check. To do that, we designed a one-step function which will perform an automatic detection/removal of unwanted anomalies and doublets. The algorithm removes suspected anomalies deriving from changes in the flow rate, signal acquisition instability and outliers/margin events in the dynamic range (based on the flowAI
package). Next, the function eliminates the doublets by using a code that automatically design a density-based ellipsoid gate in the FSC-A/FSC-H dotplot and use the 95th percentile as limit for singlet cells (code has been written by Droumeva R.).
fcs <- qc.and.removeDoublets(fcs, physical.markers = c("FSC_A", "FSC_H", "SSC_A",
"SSC_H"), return.fcs = F)
# with return.fcs = T, high QC FCS will be saved in a separate folder
Despite the great efforts in the standardization of flow cytometry protocols, biological and technical variability can be reduced but not completely abrogated. To deal with this problem we offer four different approaches which follow two different philosophies.
- The first two,
gaussNorm
andwarp
methods (included within theflowStats
package), intent to reduce the technical variation by "shifting" and aligning the area of high local density (landmarks) in each channel. This is a per channel modification and could be applied with the functionmarker.normalization
.
fcs <- marker.normalization(fcs.SCE = fcs, marker.to.norm = c("CCR6", "CCR4"),
norm.method = "warp")
- The second approach, more holistic, relies on the possibility of "aligning" clusters of cells and modify the expression matrix accordingly to the new position. This wrapper function (
batch.removal
) includes the single-cell (RNAseq) methodsSeurat
andharmony
(please, refer to the original development sites, [1] and [2], for additional information).
fcs <- batch.removal(fcs, method = "harmony", batch = "patient_id")
While
gaussNorm
andwarp
methods can easily handle millions of events, the computational power required for bothharmony
and (especially)Seurat
usually limits the number of events that could be processed to hundreds of thousands.
From an operative point of view, the first step to perform in this process is to look at the distribution of peaks in each channel and identify the parameter(s) to be aligned.
multidensity(fcs.SCE = fcs, assay.i = "transformed", subsampling = 100)
This function present two additional options that could be useful in certain scenarios:
-
It is possible to limit the parameters to be shown through the option
show.markers = c("param1", "param2")
. -
It is possible to select if histograms should be shown as "stacked" or "overlapped" histogram. The latter simply rely on the decision of a cut-off (
ridgeline.lim = x
): if your FCSs are beyond this value they will be represented as "stacked" otherwise we will see an "overlay" histogram.
By looking at the histograms we identified many markers which are not perfectly "aligned". Anyway, by taking into account that about 50% of the whole cellularity is composed by granulocytes which could interfere with the batch removal procedure, we decided to apply this step lately to the lymphocytes population only.
At this point, it could be useful to check for possible confounding variables that may affect further downstream analysis, such as important differences in the total number of cells between conditions. It is possible to choose the variables to use to group patients (x.axis
option).
cell.count.bx(fcs.SCE = fcs, x.axis = "condition", assay.i = "transformed")
As reported in the Figure 2A, we have a significant difference in the cell number of PB and BM samples, which could be expected due to the better quality of blood samples... but it will barely affect our subsequent analysis.
Dimensionality reduction (DR) algorithms aim to maintain the structure of the data (i.e., the distance between the different subpopulations or patients in our case) while reducing the dimensions needed to observe differences (by jointing similar objects and separating different ones). PCA is one of these techniques and perform DR by projecting data to new coordinates preserving the variance.
Mimicking a transcriptomic approach, Nowicka M et al. proposed to use the median marker expression calculated over all cells to highlight similarities/dissimilarities between samples. We implemented this approach with median.dr
and median.heatmap
functions. We also added the possibility to use kmeans
to cluster patients according to their characteristics (ellipses will be eventually generated, in the DR, and identified clusters will be available for downstream analyses).
cell.count.bx(fcs.SCE = fcs, x.axis = "condition")
median.dr(fcs, color.by = "filename", label.by = "condition", assay.i = "transformed",
num.k = NULL) # number of kmeans to calculate (and drawing)
median.heatmap(fcs, assay.i = "transformed",
not.metadata = c("sample_id", "filename")) # hide some features from annotation
Of note, both representations well discriminate PB from BM samples. This could be due to the different cellular composition that could affect median fluorescence values (e.g., PB does not include erythroid precursors which are negatives for all markers). This point will be clarified by performing a quick look at "single-cell level".
We decided to perform DR (both PCA and UMAP [Uniform Manifold Approximation and Projection]) and to generate a clustered heatmap at single-cell level (by subsampling 500 cells from each patient).
fcs500 <- sub.samples(fcs.SCE = fcs, subsampling = 500)
fcs500 <- dim.reduction(fcs500, dr.method = c("UMAP", "pca"), assay.i = "transformed")
dr.plotting(fcs500, plot.dr = "PCA", color.by = "condition", size = 0.8,
assay.i = "transformed", colors = c("#c03728", "#919c4c"))
dr.plotting(fcs500, plot.dr = "UMAP", color.by = "condition", size = 0.8,
assay.i = "transformed", colors = c("#c03728", "#919c4c"))
sc.heatmap(fcs500, markers.to.use = surface_markers, assay.i = "transformed")
As reported in Figure 3, single-cell PCA (A) demonstrated an almost superimposable distribution between samples irrespective of the grouping according to tissue origin. Even better, the UMAP (B) reports an almost identical distribution of cells derived from PB or BM, with the exception of a specific cluster of cells more abundant in the BM, presenting a complete negativity for the expression of all surface markers (as reported in the heatmap, C)... as we mentioned, they are likely to be erythroblasts, physiologically absent in PB.
As already stated before, analyzing datasets of flow cytometry data with 8 or more colors (for exploratory/discovery purpose) with traditional gating strategies is absolutely not easy, time-consuming and difficult to reproduce. Additionally, our capability to work with two-dimensional plots (or rarely with three-dimensional ones) only, renders quite impossible to keep an overview of all markers behavior on selected cell types, increasing the risk of missing valuable information.
To overcome these limits, several methods of unsupervised cell clustering have been developed. Among the different available methods, we included in our pipeline 4 different clustering approaches: FlowSOM
, Phenograph
, Seurat
and the recently introduced PARC
. Each of these methods have its drawbacks that should be carefully evaluated at the time of clustering method decision. It is often a good idea to compare clustering results before choosing with which one proceed to next steps. FlowSOM
is the quickest one, while Phenograph
and Seurat
are the most precise. PARC
is the newest one and it seems to balance precision and speed.
The function to perform cell clustering is clustering.flow
and include different options that could be added to each specific method (parc
, som
, Seurat
, phenograph
). Here we will use FlowSOM
to quickly identify and select lymphocytes for subsequent, more detailed, analyses. Within the num.k
option it is possible to select the number of final clusters we expect to identify (please take into account that the FlowSOM
algorithm works better if you "overcluster" your dataset (i.e., you select a number of clusters higher than the populations you expect to identify).
fcs <- clustering.flow(fcs, method = "SOM", assay.i = "transformed", num.k =15)
The application of dimensionality reduction algorithms (already described previously) to single-cell expression data leads to the production of beautiful easy-to-interpret images that give us the idea of distance between groups of cells. Additionally, each cell could be colored according to the relative expression of each marker analyzed or to the previously identified cluster. Due to the high computational power necessary to perform such analyses, it is often convenient to reduce the cells number to be analyzed for each sample just before running the dimensionality reduction code. To be noted that by performing this step at this point, we do not affect the clustering process that has already been performed on the whole dataset.
For our example, taking into account that our aim is to identify blood and BM macro-populations, we decided to perform dimensionality reduction (PCA and UMAP) on 500 cells per sample. By side, our dim.reduction
function offer the possibility to calculate the t-stochastic neighbour embedding (t-SNE), currently less used due to its limitations such as the high computational demand, long computational times and difficulty to effectively reduce large datasets or a very large number of dimensions (usually in single-cell -omics, where dimensions for each cell are thousands, t-SNE is calculated on PCA-reduced dimensions to reduce the computational load). The obtained reduced dimensions could be plotted in a ggplot2
graph by using the function dr.plotting as
showed below.
fcs500 <- sub.samples(fcs.SCE = fcs, subsampling = 500)
fcs500 <- dim.reduction(fcs500, dr.method = c("UMAP", "pca"), assay.i = "transformed")
dr.plotting(fcs500, plot.dr = "UMAP", assay.i = "transformed", colors = color_clusters,
color.by = "SOM.k15") # UMAP according to FlowSOM clusters
# color.by is not expecified, all marker will be shown (faceted)
dr.plotting(fcs500, plot.dr = "UMAP", assay.i = "transformed")
median.heatmap(fcs.SCE = fcs, assay.i = "transformed",
cell.clusters = "SOM.k15") # heatmap grouped according to FlowSOM clusters
Tips: the
dr.plotting
function allows for figure faceting according to a selected variable. This could be done by adding the optionfacet.by
to the function.
Once that all the processes of clustering and dimensional reduction have been completed, we can proceed to the next step which include the identification of each cluster as a concrete population. This could be done by exporting an "enriched" FCS and analyzing it outside of R, or within R by combining UMAP, heatmaps and dot-plots images.
We can export our down-sampled dataset in a new FCS file containing for each cell: the DR coordinates (e.g., UMAP1, UMAP2, PCA1, PCA2); the number of clusters according to selected method; other metadata variables and the fluorescence intensity value from the untrasformed matrix. This file could be loaded in any flow cytometry software and subsequently analyzed to identify the population to which each cluster belongs. At this point the analysis becomes familiar to researchers with experience in flow cytometry, indeed, what should be done now is to identify for each cluster, by manual gating, the cell population which it belongs.
In Figure 5B, we drawn a gate around the cluster number 1 and by looking at the distribution in bidimensional dot plots (physical parameters and CD8-SSC dot plots are reported as the most useful) we could conclude that these cells belong to the "granulocytes" group. In our dataset, we assigned to 21 out of 40 clusters the name "lymphocytes", and in Figure 5C are reported the reliability of these results. Specifically, by manual gating we were able to identify the 22.76% of total cellularity as lymphocytes, a result almost identical to the one obtained by merging the 10 clusters identified by FlowSOM
that we recognized as lymphocytes (23.56%). By fusing the XYZ lymphocytes clusters, we observed that the algorithm includes erroneously in these groups about a 3% of cells (0.71% of total cellularity) that are not lymphocytes; on the other side, no CD4+ or CD8+ lymphocytes could be found in the remaining clusters. This error, that could already be considered acceptable, could be eventually reduced in further subclustering analyses focused on lymphocytes group only (see after).
Within FlowCT
, we included a specific function, flowplot
, that allows users to depict flow cytometry dot-plot and histograms to help to identify the cell population included within the cluster. To be noted that it is possible to select the clusters to be plotted, which additionally improves the possibility to correctly identify each cluster.
flowplot(fcs500, x.axis = c("CD62L", "CD4", "CD4"), y.axis = c("CD45RA", "CD8",
"SSC_H"), color.by = "SOM.k25", assay.i = "transformed")
Of note, it is always better to plot expression data from the original, "untouched", matrix as compared to the batch-removed one.
Once all clusters have been correctly identified, it is possible to produce final images reporting the merged clusters as reduced derived clusters with the chosen name. In our dataset the 25 clusters were divided into the subsequent macro-populations: 18/25 lymphocytes, 3/25 granulocytes, 2/25 erythroblasts/debris, 1/25 eosinophils, 1/25 monocytes. It is possible to upload the new cluster names in 2 ways: by entering them within the script, or by providing a spreadsheet including old and new names (we will use the first approach here and the second subsequently).
# prepare data.frame with pairs: cluster-population (it could be done on the fly)
replacedata <- data.frame(original_cluster = levels(fcs$SOM.k25))
replacedata$new_cluster <- c(rep("granulocytes", 3), "lymphocytes", "eosinophils",
"granulocytes", rep("lymphocytes", 3), "erythroblasts/debris", "monocytes",
"lymphocytes", "lymphocytes", "erythroblasts/debris", "monocytes",
rep("granulocytes", 10))
fcs500$SOM_named <- clusters.rename(fcs500$SOM.k25, cluster = replacedata$original_cluster,
name = replacedata$new_cluster)
fcs$SOM_named <- clusters.rename(fcs$SOM.k25, cluster = replacedata$original_cluster,
name = replacedata$new_cluster)
dr.plotting(fcs500, plot.dr = "UMAP", color.by = "SOM_named", assay.i = "transformed",
colors = color_clusters)
median.heatmap(fcs.SCE = fcs500, assay.i = "transformed", cell.clusters = "SOM_named",
colors = color_clusters)
At this point, the next step is to perform a deeper analysis on specific subpopulation identified with the FlowSOM
clustering followed by cluster merging. Our panel have been designed to evaluate T cells population, and, accordingly, we decided to perform a further analysis on lymphocytes. In this view, it could be useful (especially in term of speed), in specific settings, to prepare a new SCE object including all the events belonging to lymphocytes population.
fcsL <- fcs[,fcs$SOM_named == "lymphocytes"]
Again, we calculated the cells number for each condition (to be noted how, obviously, PB samples present a higher number of lymphocytes as compared to BM ones) and calculate an heatmap on the median expression for each marker.
cell.count.bx(fcsL, assay.i = "transformed", x.axis = "condition")
median.heatmap(fcsL, assay.i = "transformed", not.metadata = c("sample_id", "filename"))
At this point, we decided to proceed with the Seurat
method for batch removal. It is important to disclose, within the function, the variable that include the batch (in our case all patients have been acquired in different days and this should be considered the most important batch; we assume that BM and PB of the same patients, being stained and acquired simultaneously by the same person are not affected by any batch) and the name of the new assay that will be generated. Additionally, further options specific for each method could be passed to the function by using the argument seurat.params
or harmony.params
.
Optional: the
Harmony
's method is often a good and quick alternative to Seurat, however, in our experience, we observed Seurat as a better method when handling less than 12 markers. Important: this could be a looooooooooong step.
fcsL <- batch.removal(fcsL, method = "seurat", batch = "patient_id",
new.matrix.name = "normalized", seurat.params = list(reduction = "cca", dims = 1:10))
We can now look at the histograms before and after batch correction:
multidensity(fcs.SCE = fcsL, assay.i = "transformed", subsampling = 100)
multidensity(fcs.SCE = fcsL, assay.i = "normalized", subsampling = 100)
We can further perform a check in a dimensionality reduced plot to observe the presence and eventual disappearance of batch effect after batch.removal function.
fcsL <- dim.reduction(fcsL, dr.method = c("PCA", "UMAP"),assay.i = "normalized")
fcsL_old <- dim.reduction(fcsL, dr.method = c("PCA", "UMAP"),assay.i = "transformed")
dr.plotting(fcsL_old, plot.dr = "UMAP", assay.i = "transformed",
colors = color_clusters, size = 0.3,color.by = "patient_id")
dr.plotting(fcsL, plot.dr = "UMAP", assay.i = "normalized", colors = color_clusters,
size = 0.3, color.by = "patient_id")
Tips: Due to the low number of dimensions used, in some analysis
Seurat
could produce infinite or NAs values within the expression matrix which could impairs some subsequent functions. If this happens a workaround could be to assign a "0" value toNA
s (this should be carefully took into account) and a "-99999" or "99999" to positive or negative infinite values.assay(fcsL, "normalized")[is.infinite(assay(fcsL, "normalized"))] <- -99999 assay(fcsL, "normalized")[is.na(assay(fcsL, "normalized"))] <- 0
As done before, the next steps will be to apply a similar workflow to the new reduced object to obtain detailed information about all the lymphocyte subpopulations present in our dataset. We will then perform a clustering with the PARC
algorithm and will apply it to the previously calculated PCA and UMAP. Lastly, we will perform a cluster identification (by either exporting a new FCS containing only the lymphocytes subpopulation or by taking advantages of FlowCT tools) to correctly characterize each subpopulation.
fcsL <- clustering.flow(fcsL, method = "PARC", assay.i = "normalized")
dr.plotting(fcsL, plot.dr = "UMAP",color.by = "PARC", assay.i = "normalized",
colors = color_clusters, size = 0.3)
median.heatmap(fcs.SCE = fcsL, assay.i = "normalized", cell.clusters = fcsL$PARC)
dr.plotting(fcsL, plot.dr = "UMAP", assay.i = "normalized", colors = color_clusters,
size = 0.3)
export.metaFCS(fcs.SCE = fcsL, output.name = "subclust1")
Finally, in this last step of the clustering process, we could assign the population name to each cluster and update the clustering representations to show us the correct names. As shown in Figure 11, we were able to identify 28 different lymphocytes subpopulations (represented in UMAP and heatmap on median marker expression). All subpopulations are well separated and recognizable.
replacedataL <- readxl::read_excel("Th_new_clustering.xlsx", col_types = "text")
fcsL$PARC_L_named <- clusters.rename(fcsL$PARC, cluster = replacedataL$original_cluster,
name = replacedataL$new_cluster)
dr.plotting(fcsL, plot.dr = "UMAP", color.by = "PARC_L_named", colors = color_clusters,
size = 0.3)
median.heatmap(fcs.SCE = fcsL, assay.i = "normalized", cell.clusters = fcsL$PARC_L_named)
dr.plotting(fcsL, plot.dr = "UMAP", color.by = "PARC_L_named", colors = color_clusters,
size = 0.3, facet.by = "condition")
Another useful tool for check marker's implication on each cell population:
## differential dotplots
dotplot.DE(fcs.SCE = fcsL_rm, markers.to.use = surface_markers,
clusters.named = "PARC_L_named")
To have a better idea of distribution and phenotype of each population we represented our results in a phylogenic tree taking advantage of the recently developed ggtree
package. We designed a wrapper function, circ.tree
, which initially let us show the nodes to select as population dividers (see below) and then let us color them accordingly. The heatmap associated to each cell population reports the median expression of each marker, while the dimension of the circle is proportional to the relative abundance of the population itself: Naïve CD4 cells are the most abundant T population in our dataset.
circ.tree(fcs.SCE = fcsL, cell.clusters = "PARC_L_named", nodes = "display")
circ.tree(fcs.SCE = fcsL, cell.clusters = "PARC_L_named", nodes = c(29, 36,31))
The most useful thing that we can obtain from this analysis, in a clinical trial perspective, is a table including all the abundance (in term of both absolute count and percentage) or median fluorescence of each identified population for each patient. Such table indeed, could be subsequently used in any statistical software (even out of R) and population proportions could be correlated with clinical parameters such as time to endpoint, response to treatment, etc.
Within FlowCT
, the relative abundance of each population could be reported in a barplot (including a bar dedicated to each sample and grouped for tissue of origin), and as a table, including all results, readily available for downstream use.
prop_tableL <- barplot.cell.pops(fcs.SCE = fcsL_rm, cell.clusters = "PARC_L_named",
count.by = "sample_id", facet.by = "condition", return.mode = "percentage")
dataset <- data.frame(md, as.data.frame.matrix(t(prop_tableL)))
write.table(dataset, file = "results.txt", sep = "\t")
med <- median.values(fcs.SCE = fcsL_rm) # median values (from non-transformed data)
dataset2 <- data.frame(md, med)
write.table(dataset2, file = "results_median.txt", sep = "\t")
Tips: Often, clusters which include debris or unknown populations are identified. It is important (and useful) to remove these clusters for statistical analysis. Please, be aware of what you are removing.
delete_pops <- c("Unc", "debris") fcsL_rm <- remove.pop(fcsL, clusters.named = "PARC_L_named", population = delete_pops)
Subsequently, we performed a correlation analysis between PB and BM to investigate potential differences in the distribution of each subpopulation. To this aim, we provide a function, corplot.condition
, which allow us to decide which condition to correlate and which parameters should be investigated for correlation. Interestingly, we found a direct correlation for almost all populations between PB and BM.
corplot.conditions(fcs.SCE = fcsL_rm, cell.clusters = "PARC_L_named",
condition.column = "condition")
Lastly, we compared the abundance of each population according to tissue of origin. We provide different methods to visualize comparison results (and respective significance levels) which are available with the following code.
## boxplot
boxplot.cell.clustering(fcs.SCE = fcsL_rm, cell.clusters = "PARC_L_named",
facet = T, return.stats = F, plot.only.sig = c(T, 0.05), facet.free.scale = "free")
sig <- boxplot.cell.clustering(fcs.SCE = fcsL_rm, cell.clusters = "PARC_L_named",
return.stats = T)
## Dumbbell plot
dumbPlot(fcs.SCE = fcsL_rm, cell.clusters = "PARC_L_named", return.stats = F,
condition.column = "condition")
## parallel plot
parallel.plot(fcs.SCE = fcsL_rm, cell.clusters = "PARC_L_named", return.stats = F,
condition.column = "condition")
Between all machine learning (ML) modeling advantages, feature selection is extremelly useful for identify those populations with a strong implication with our event of interest (for example, disease progression). Thus, we also parepared a (very basic) wrapper including some ML approaches for "playing" with our identified immune populations:
- biosigner, which helps to assess the relevance of variables through different binary (i.e., they don't take into account the survival times).
- randomForestSRC... the classical random forest approach, also being able to use survival times for regression (and not only for classfication, like the previous one).
- SurvBoost helps to select more relevant variables (i.e., immune populations) through gradient boosting. Note: for installing this package, please, go to indicated link (it's not available within CRAN repository).
They use either percentages (or absolute number of events) or categorized values (according different cutoffs calculations through FlowCT::pop.cutoff
) depending of cutoff.type
's values. Because each ML method has a lot of internal parameters for tunning them, you can also include them within our function, always as list. Let's see some minimal examples with SurvBoost
:
# add survival data
s <- readxl::read_excel("survival_data.xlsx")
colData(fcsL_rm) <- merge(colData(fcsL_rm), s, by = "patient_id")
head(colData(fcsL_rm))
# DataFrame with 6 rows and 9 columns
# patient_id filename sample_id condition cell_id PARC PARC_L_named PFS censored
# <character> <character> <integer> <factor> <character> <factor> <factor> <numeric> <numeric>
# 1 1 BM_1_017294.red.fcs 1 BM 1.4 12 CD8_TM_CXCR3p_CD27p 10 1
# 2 1 BM_1_017294.red.fcs 1 BM 1.9 12 CD8_TM_CXCR3p_CD27p 10 1
# 3 1 BM_1_017294.red.fcs 1 BM 1.17 6 CD4_N 10 1
# 4 1 BM_1_017294.red.fcs 1 BM 1.30 5 B_cells 10 1
# 5 1 BM_1_017294.red.fcs 1 BM 1.32 1 NK_cells 10 1
# 6 1 BM_1_017294.red.fcs 1 BM 1.33 10 CD8_EM_CXCR3p_CD27p 10 1
ml1 <- prog.pop.selection(fcs.SCE = fcs, cell.clusters = "PARC_L_named",
time.var = "PFS", event.var = "censored",
cutoff.type = "terciles",
# these are internal parameters for controlling survboost
method = "survboost",
method.params = list(rate = 0.1,
control_method = "num_selected",
control_parameter = list(num_select = 5)))
ml1bis <- prog.pop.selection(fcs.SCE = fcs, cell.clusters = "PARC_L_named",
time.var = "PFS", event.var = "censored",
cutoff.type = "none", # percentages are used as they are
method = "survboost",
method.params = list(rate = 0.1,
control_method = "likelihood"))
And, of course, you can also use the classical predict
function when using training and validation datasets (in the case of SurvBoost
is slightly different, just for knowing):
# select the 70% of your samples for training (set.seed, IMPORTANT!!!)
set.seed(333)
train_idx <- sample(unique(fcs$filename), length(unique(fcs$filename))*0.7)
ml2 <- prog.pop.selection(fcs.SCE = fcs, cell.clusters = "PARC_L_named",
time.var = "PFS", event.var = "censored", train.index = train_idx,
cutoff.type = "maxstat", method = "survboost",
method.params = list(rate = 0.1,
control_method = "num_selected",
control_parameter = list(num_select = 3)),
return.ML.object = T) #TRUE, for later predict
SurvBoost::predict.boosting(ml2$ML.object, newdata = ml2$survival.data$data$test)
We designed FlowCT
keeping in mind the possibility to interconnect it with pipelines already available for single cell analysis. Within the R environment, scater
and Seurat
represent 2 of the most widely used toolkits and each of them includes tools that could be useful for flow cytometry analysis. By using the SCE
structure for storing flow cytometry data, we made transition within the different packages easy.
To use scater
commands with FlowCT
objects, the easiest thing to do is to create a new assay named "logcounts"
and transfer there the ones we would like to use:
library(scater)
assay(fcsL, "logcounts")<-assay(fcsL, "transformed")
We can now use all the useful utilities for data visualization included in scater
such as: plotExpression
, which could be used in combination with flowplot
to correctly assign cluster identities; the dimensionality reduction functions derived from plotReducedDim
including plotUMAP
; ggcells
function, which perfectly interact with ggplot2
for custom graph such as density plot:
plotExpression(fcsL, rownames(fcsL)[9], x = "condition", colour_by="condition")+
scale_fill_manual(values = color_clusters) + theme(legend.position="none")
plotExpression(fcsL, rownames(fcsL)[7], x = "PARC", colour_by="PARC") +
scale_fill_manual(values = color_clusters) + theme(legend.position="none")
plotUMAP(fcsL, colour_by = "CD8")
ggcells(fcsL, mapping=aes(x=UMAP.1, y=UMAP.2, colour=CD4)) +
geom_point() + stat_density_2d() +
facet_wrap(~condition) +
scale_colour_distiller(direction = 1) + theme_minimal()
Furthermore, if needed, it is possible to calculate dimensionality reductions from scater
commands including runUMAP
or runPCA
.
The conversion from a SCE
experiment to a Seurat
object has been recently made "painless". Indeed, the Seurat
package provides a function dedicated to this kind of "migration"; importantly, it should be indicated the assays that should be passed to the counts and data slots:
library(seurat)
fcsL_Se <- as.Seurat(fcsL, counts = "raw", data = "transformed")
fcsL_Se <- ScaleData(fcsL_Se, features = rownames(fcsL_Se))
At this point, all the functions commonly used for single cell RNAseq analysis are available for flow cytometry data. Of note, Seurat
batch removal and clustering functions are already included in FlowCT
, but it is possible to perform also these analysis by using the Seurat
object. All dimensional reductions could be calculated and plotted within Seurat
and visualizing tools including heatmaps could be easily implemented in flow cytometry pipeline.
We can use the UMAP representation to visualize cell populations:
DimPlot(fcsL_Se, reduction = "UMAP", group.by = "PARC_L_named", repel = T, label = T) +
scale_color_manual(values = color_clusters) + theme(legend.position="none")
Or visualize marker expression, with the possibility to set upper and lower cut-off to improve visualization or to "blend" two markers to visualize double expressing cells
FeaturePlot(fcsL_Se, features = c("CCR4"), pt.size=0.1, min.cutoff = "q10",
split.by = "condition", max.cutoff = "q90") +
scale_colour_gradientn(colours = brewer.pal(n = 9, name = "PuBu")[2:9])
FeaturePlot(fcsL_Se, features = c("CD4", "CXCR3"), blend = TRUE, blend.threshold = 0.1)
Furthermore, we can identify the markers which are significantly over-represented in each population and plot them in a heatmap:
Idents(object = fcsL_Se) <- "PARC_L_named"
fcsL.markers <- FindAllMarkers(fcsL_Se, only.pos = TRUE, min.pct = 0.25,
logfc.threshold = 0.25)
top5 <- fcsL.markers %>% group_by(cluster) %>% top_n(n = 5, wt = avg_logFC)
DoHeatmap(subset(fcsL_Se, downsample = 100), features = top5$gene, size = 3) +
NoLegend()
Or we can use all others visualization tools present in seurat such as dotplot and ridgeplot
DotPlot(fcsL_Se, features = rownames(fcsL_Se), cols = c("blue", "red"),
dot.scale = 8, split.by = "condition") + RotatedAxis()
RidgePlot(fcsL_Se, features = c("CD8","CD4"), ncol = 2)+ NoLegend()
The objective of pseudotime analysis is to measure the relative progression of each cell along a specific process (such as maturation or differentiation) allowing us to understand the (pseudo)temporal behaviour without explicit time series. Of note, this analysis is possible when elements within the dataset behave asynchronously and each is at a different stage of "evolution". The algorithm, by creating a relative ordering of the cell, can define a series of "states" that constitute a trajectory for the process of interest.
To explain this step, we will use a dataset including the CD8 population from the 24 colors example. Batch effect has been removed by using the canonical correlation algorithm from Seurat
package and the corrected expression matrix has been saved under the "Seurat"
assay. As first step, we will calculate a DiffusionMap
by using the scater
package:
fcsL <- runDiffusionMap(fcsL, exprs_values = "Seurat")
plotDiffusionMap(fcsL, colour_by = 'PARC') + scale_fill_manual(values = color_clusters)
# if marker expression on DiffusionMap is planned to be visualized, it is important,
# as previously reported, to create the "logcounts" assay within the SCE object
assay(fcsL, "logcounts")<-assay(fcsL, "transformed")
We then selected the clusters including the CD8 populations only:
fcsL_pseudo <- fcsL[, !fcsL$PARC=="5" & !fcsL$PARC=="0" & !fcsL$PARC=="8" &
!fcsL$PARC=="12" & !fcsL$PARC=="14"& !fcsL$PARC=="13"& !fcsL$PARC=="17"&
!fcsL$PARC=="15"& !fcsL$PARC=="23"& !fcsL$PARC=="20"& !fcsL$PARC=="29"&
!fcsL$PARC=="27"& !fcsL$PARC=="16"]
plotDiffusionMap(fcsL_pseudo,colour_by = 'PARC') +
scale_fill_manual(values = color_clusters)
To calculate the "pseudotime" trajectory we take advantage of the slingshot
package. As reported by the authors, the purpose of this package "is to use clusters of cells to uncover global structure and convert this structure into smooth lineages represented by one-dimensional variables, called "pseudotime". It is important to note that slingshot does not require previous knowledge on cell cluster evolution (i.e., it does not need a "start" and "end" cluster). Due to computational power necessary for subsequent steps, we subsampled our dataset just to take 3000 CD8 lymphocytes from each sample. Next, we performed pseudotime calculation on PARC
clusters and DiffusionMap
DR by using the slingshot
command.
library(slingshot)
library(RColorBrewer)
fcsL_sling <- sub.samples(fcs.SCE = fcsL_pseudo, subsampling = 3000)
plotDiffusionMap(fcsL_pseudo, colour_by = 'PARC') +
scale_fill_manual(values = color_clusters)
sim <- slingshot(fcsL_sling, clusterLabels = 'PARC', reducedDim = 'DiffusionMap')
Pseudotime calculation will be stored within the sim
object. It could be easily moved within the original object for downstream analysis.
fcsL_sling$Pseudotime<-sim$slingPseudotime_1
plotDiffusionMap(fcsL_sling[,!is.na(fcsL_sling$Pseudotime)], colour_by="Pseudotime")
Lastly, we can take advantage of the package monocle 2 to depict a "smoothed" heatmap showing all the significant changes that happens in markers involved in pseudotime progression. To do that, we will convert our SCE
object in a Seurat
object and then in a monocle
object (a working conversion tool from SCE
to monocle
is not available at this moment)
library(monocle)
set.seed(333)
fcsL_sling_SEU <- as.Seurat(fcsL_sling, counts = "raw", data = "transformed")
fcsL_sling_mon <- as.CellDataSet(subset(fcsL_sling_SEU, downsample = 150), assay = "RNA")
Next, we will create a new object that only contains the cells for which a pseudotime value has been calculated. This will be then used for the identification of genes that are significantly modulated along "progression". Those genes will be then represented in an heatmap with the plot_pseudotime_heatmap
function. Of note, it is possible to automatically group markers that are modulated in the same way to help understanding the biological substrate of pseudotime. Detailed information on slingshot
and monocle
are available on the developer's websites.
cds_subset<-fcsL_sling_mon[,!is.na(fcsL_sling_mon$Pseudotime)]
diff_test_res <- differentialGeneTest(cds_subset, fullModelFormulaStr = "~sm.ns(Pseudotime)")
sig_gene_names <- row.names(subset(diff_test_res, qval < 0.05))
plot_pseudotime_heatmap(cds_subset[sig_gene_names[c(1:5,7:14)],], num_clusters = 3,
cores = 2, show_rownames = T, return_heatmap = TRUE, norm_method = "vstExprs")