In Single-cell Expression Atlas we're interested in relating cell groupings (clusters, cell types) between experiments and across species boundaries. There is a naive way of doing this using marker genes, and more advanced ways such as SAMap. This workflow executes both of those methods (note: the workflow was written before most recent SAMap developments, so is a little of of date in that regard), and produces comparisons of the number of the predicted cross-species intersecting cell groups. See this presentation for some initial results.
Note that SAMap is executed with a workflow that parallelises a BLAST step.
It is possible to compare cell groups (clusters, cell types) across species via the 'marker' genes of each group. Since each set of marker genes is a product of the context in which it was derived (sorted cell population, sub-tissue, tissue, whole organism), that context must be matched for comparison of marker gene sets (via ortholog relationships) to be valid. With that in mind this workflow will:
- Take anndata objects from SCXA analysis for two experiments containing comparable 'organism' parts, even if they're not labelled at the same granularity.
- Match the organism parts beween experiments, using the Uberon ontology to re-label where the granularity of organism part annotation is not consistent.
- Subset the experiments to only the common organism parts.
- Re-filter, re-normalise, and re-derive marker genes using Scanpy.
- Derive mapped pairs of cell groupings between the two experiments.
The Snakemake workflow in this repository performs the above steps given the following inputs:
- Two annData files with .project.h5ad extensions, stored in an 'inputs' directory
- The two species names
- A .obo ontology file in the inputs directory.
- An ortholog mapping file in the inputs directory.
Ortholog mappings can be derived from BiomaRt and should look like:
homo_sapiens_gene_id homo_sapiens_gene_name mus_musculus_gene_id mus_musculus_gene_name
ENSG00000198888 MT-ND1 ENSMUSG00000064341 mt-Nd1
ENSG00000198763 MT-ND2 ENSMUSG00000064345 mt-Nd2
ENSG00000198804 MT-CO1 ENSMUSG00000064351 mt-Co1
ENSG00000198712 MT-CO2 ENSMUSG00000064354 mt-Co2
ENSG00000198899 MT-ATP6 ENSMUSG00000064357 mt-Atp6
ENSG00000198938 MT-CO3 ENSMUSG00000064358 mt-Co3
ENSG00000198840 MT-ND3 ENSMUSG00000064360 mt-Nd3
ENSG00000212907 MT-ND4L ENSMUSG00000065947 mt-Nd4l
ENSG00000198886 MT-ND4 ENSMUSG00000064363 mt-Nd4
So the file should be tab-delimited and have 'gene_id' fields prefixed by the two input species names.
The Snakemake config file can then be constructed like the example, and the pipeline run like:
Snakemake --use-conda --cores 2