Pipeline for benchmarking atlas-level single-cell integration

This repository contains the snakemake pipeline for our benchmarking study for data integration tools. In this study, we benchmark 16 methods (see here) with 4 combinations of preprocessing steps leading to 68 methods combinations on 85 batches of gene expression and chromatin accessibility data. The pipeline uses the scib package and allows for reproducible and automated analysis of the different steps and combinations of preprocesssing and integration methods.

Resources

On our website we visualise the results of the study.
The scib package that is used in this pipeline can be found here.
For reproducibility and visualisation we have a dedicated repository: scib-reproducibility.
The data used in the study on figshare

Please cite:

Luecken, M.D., Büttner, M., Chaichoompu, K. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods 19, 41–50 (2022). https://doi.org/10.1038/s41592-021-01336-8

Installation

To reproduce the results from this study, two separate conda environments are needed for python and R operations. Please make sure you have either mambaforge or conda installed on your system to be able to use the pipeline. We recommend using mamba, which is also available for conda, for faster package installations with a smaller memory footprint.

We provide python and R environment YAML files in envs/, together with an installation script for setting up the correct environments in a single command. based on the R version you want to use. The pipeline currently supports R 3.6 and R 4.0, and we generally recommend using version R 4.0. Call the script as follows e.g. for R 4.0

bash envs/create_conda_environments.sh -r 4.0

Check the script's help output in order to get the full list of arguments it uses.

bash envs/create_conda_environments.sh -h

Once installation is successful, you will have the python environment scib-pipeline-R<version> and the R environment scib-R<version> that you must specify in the config file.

R version	Python environment name	R environment name	Test data config YAML file
4.0	`scib-pipeline-R4.0`	`scib-R4.0`	`configs/test_data-R4.0.yml`
3.6	`scib-pipeline-R3.6`	`scib-R3.6`	`configs/test_data-R3.6.yml`

Note: The installation script only works for the environments listed in the table above. The environments used in our study are included for reproducibility purposes and are described in envs/.

For a more detailed description of the environment files and how to install the different environments manually, please refer to the README in envs/.

Running the Pipeline

This repository contains a snakemake pipeline to run integration methods and metrics reproducibly for different data scenarios preprocessing setups.

Generate Test data

A script in data/ can be used to generate test data. This is useful, in order to ensure that the installation was successful before moving on to a larger dataset. The pipeline expects an anndata object with normalised and log-transformed counts in adata.X and counts in adata.layers['counts']. More information on how to use the data generation script can be found in data/README.md.

Setup Configuration File

The parameters and input files are specified in config files. A description of the config formats and example files can found in configs/. You can use the example config that use the test data to get the pipeline running quickly, and then modify a copy of it to work with your own data.

Pipeline Commands

To call the pipeline on the test data e.g. using R 4.0

snakemake --configfile configs/test_data-R4.0.yaml -n

This gives you an overview of the jobs that will be run. In order to execute these jobs with up to 10 cores, call

snakemake --configfile configs/test_data-R4.0.yaml --cores 10

More snakemake commands can be found in the documentation.

Visualise the Workflow

A dependency graph of the workflow can be created anytime and is useful to gain a general understanding of the workflow. Snakemake can create a graphviz representation of the rules, which can be piped into an image file.

snakemake --configfile configs/test_data-R3.6.yaml --rulegraph | dot -Tpng -Grankdir=TB > dependency.png

Tools

Tools that are compared include:

Name		Name	Last commit message	Last commit date
Latest commit History 1,174 Commits
.github/workflows		.github/workflows
configs		configs
data		data
envs		envs
sanity_check		sanity_check
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
ScibConfig.py		ScibConfig.py
Snakefile		Snakefile
dependency.png		dependency.png
figure.png		figure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline for benchmarking atlas-level single-cell integration

Resources

Please cite:

Installation

Running the Pipeline

Generate Test data

Setup Configuration File

Pipeline Commands

Visualise the Workflow

Tools

About

Releases

Packages

Contributors 10

Languages

License

theislab/scib-pipeline

Folders and files

Latest commit

History

Repository files navigation

Pipeline for benchmarking atlas-level single-cell integration

Resources

Please cite:

Installation

Running the Pipeline

Generate Test data

Setup Configuration File

Pipeline Commands

Visualise the Workflow

Tools

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages