Ensemblex: an accuracy-weighted ensemble genetic demultiplexing framework for single-cell RNA sequencing
- Introduction
- Installation
- Step 1: Set up
- Step 2: Preparation of input files
- Step 3: Genetic demultiplexing by constituent tools
- Step 4: Application of Ensemblex
- Tutorial
- Contributing
- Change log
- License
- Acknowledgement
Ensemblex is an accuracy-weighted ensemble framework for genetic demultiplexing of pooled single-cell RNA seqeuncing (scRNAseq) data. Ensemblex can be used to demultiplex pools with or without prior genotype information. When demultiplexing with prior genotype information, Ensemblex leverages the sample assignments of four individual, constituent genetic demultiplexing tools:
- Demuxalot (Rogozhnikov et al. )
- Demuxlet (Kang et al. )
- Souporcell (Heaton et al. )
- Vireo-GT (Huang et al. )
When demultiplexing without prior genotype information, Ensemblex leverages the sample assignments of four individual, constituent genetic demultiplexing tools:
- Demuxalot (Rogozhnikov et al. )
- Freemuxlet (Kang et al. )
- Souporcell (Heaton et al. )
- Vireo (Huang et al. )
Upon demultiplexing pools with each of the four constituent genetic demultiplexing tools, Ensemblex processes the output files in a three-step pipeline to identify the most probable sample label for each cell based on the predictions of the constituent tools:
Step 1: Probabilistic-weighted ensemble
Step 2: Graph-based doublet detection
Step 3: Ensemble-independent doublet detection
As output, Ensemblex returns its own cell-specific sample labels and corresponding assignment probabilities and singlet confidence score, as well as the sample labels and corresponding assignment probabilities for each of its constituents. The demultiplexed sample labels could then be used to perform downstream analyses.
To facilitate the application of Ensemblex, we provide a pipeline that demultiplexes pooled cells by each of the individual constituent genetic demultiplexing tools and processes the outputs with the Ensemblex algorithm.
The pipelines comprise of four distinct steps:
- Selection of Ensemblex pipeline and establishing the working directory (Set up)
- Prepare input files for constituent genetic demultiplexing tools
- Genetic demultiplexing by constituent demultiplexing tools
- Application of the Ensemblex framework
Below we provide a quick-start guide for using Ensemblex. For comprehensive documentation, please see the Ensemblex site. In the Ensemblex documentation, we outline each step of the Ensemblex pipeline, illustrate how to run the pipeline, define best practices, and provide a tutorial with pubicly available datasets.
Install the Enseblex container and load Apptainer:
## Download the Ensemblex container
curl "https://zenodo.org/records/11639103/files/ensemblex.pip.zip?download=1" --output ensemblex.pip.zip
## Unzip the Ensemblex container
unzip ensemblex.pip.zip
## Load Apptainer
module load apptainer/1.2.4
To test if the Ensemblex container is installed properly, run the following code:
## Define the path to ensemblex.pip
ensemblex_HOME=/path/to/ensemblex.pip
## Print help message
bash $ensemblex_HOME/launch_ensemblex.sh -h
Which should return the following help message:
-------------------
Usage: /home/fiorini9/scratch/ensemblex.pip/launch_ensemblex.sh [arguments]
mandatory arguments:
-d (--dir) = Working directory (where all the outputs will be printed) (give full path)
--steps = Specify the steps to execute. Begin by selecting either init-GT or init-noGT to establish the working directory.
For GT: vireo, demuxalot, demuxlet, souporcell, ensemblexing
For noGT: vireo, demuxalot, freemuxlet, souporcell, ensemblexing
optional arguments:
-h (--help) = See helps regarding the pipeline arguments
--vcf = The path of vcf file
--bam = The path of bam file
--sortout = The path snd nsme of vcf generated using sort
-------------------
For a comprehensive help, visit https://neurobioinfo.github.io/ensemblex/site/ for documentation.
Initiate the pipeline:
## Create and navigate to the working directory
mkdir working_directory
cd /path/to/working_directory
## Define the path to ensemblex.pip
ensemblex_HOME=/path/to/ensemblex.pip
## Define the path to the working directory
ensemblex_PWD=/path/to/working_directory
## Initiate the pipeline for demultiplexing with prior genotype information
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-GT
Initiate the pipeline:
## Create and navigate to the working directory
mkdir working_directory
cd /path/to/working_directory
## Define the path to ensemblex.pip
ensemblex_HOME=/path/to/ensemblex.pip
## Define the path to the working directory
ensemblex_PWD=/path/to/working_directory
## Initiate the pipeline for demultiplexing without prior genotype information
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-noGT
The following files are required:
File | Description |
---|---|
gene_expression.bam | Gene expression bam file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam) |
gene_expression.bam.bai | Gene expression bam index file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam.bai) |
barcodes.tsv | Barcodes tsv file of the pooled cells (e.g., 10X Genomics barcodes.tsv) |
pooled_samples.vcf | vcf file describing the genotypes of the pooled samples |
genome_reference.fa | Genome reference fasta file (e.g., 10X Genomics genome.fa) |
genome_reference.fa.fai | Genome reference fasta index file (e.g., 10X Genomics genome.fa.fai) |
genotype_reference.vcf | Population reference vcf file (e.g., 1000 Genomes Project) |
Prepare the input files:
## Define all of the required files
BAM=/path/to/possorted_genome_bam.bam
BAM_INDEX=/path/to/possorted_genome_bam.bam.bai
BARCODES=/path/to/barcodes.tsv
SAMPLE_VCF=/path/to/pooled_samples.vcf
REFERENCE_VCF=/path/to/genotype_reference.vcf
REFERENCE_FASTA=/path/to/genome.fa
REFERENCE_FASTA_INDEX=/path/to/genome.fa.fai
## Define the path to the working directory
ensemblex_PWD=/path/to/working_directory
## Copy the files to the input_files directory in the working directory
cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam
cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai
cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv
cp $SAMPLE_VCF $ensemblex_PWD/input_files/pooled_samples.vcf
cp $REFERENCE_VCF $ensemblex_PWD/input_files/reference.vcf
cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa
cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai
The following files are required:
File | Description |
---|---|
gene_expression.bam | Gene expression bam file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam) |
gene_expression.bam.bai | Gene expression bam index file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam.bai) |
barcodes.tsv | Barcodes tsv file of the pooled cells (e.g., 10X Genomics barcodes.tsv) |
genome_reference.fa | Genome reference fasta file (e.g., 10X Genomics genome.fa) |
genome_reference.fa.fai | Genome reference fasta index file (e.g., 10X Genomics genome.fa.fai) |
genotype_reference.vcf | Population reference vcf file (e.g., 1000 Genomes Project) |
Prepare the input files:
## Define all of the required files
BAM=/path/to/possorted_genome_bam.bam
BAM_INDEX=/path/to/possorted_genome_bam.bam.bai
BARCODES=/path/to/barcodes.tsv
REFERENCE_VCF=/path/to/genotype_reference.vcf
REFERENCE_FASTA=/path/to/genome.fa
REFERENCE_FASTA_INDEX=/path/to/genome.fa.fai
## Define the path to the working directory
ensemblex_PWD=/path/to/working_directory
## Copy the files to the input_files directory in the working directory
cp $BAM $ensemblex_PWD/input_files/pooled_bam.bam
cp $BAM_INDEX $ensemblex_PWD/input_files/pooled_bam.bam.bai
cp $BARCODES $ensemblex_PWD/input_files/pooled_barcodes.tsv
cp $REFERENCE_VCF $ensemblex_PWD/input_files/reference.vcf
cp $REFERENCE_FASTA $ensemblex_PWD/input_files/reference.fa
cp $REFERENCE_FASTA_INDEX $ensemblex_PWD/input_files/reference.fa.fai
Demultiplex the pooled cells with each of Ensemblex's constituent tools:
## Define the paths to Ensemblex and the working directory
ensemblex_HOME=/path/to/ensemblex.pip
ensemblex_PWD=/path/to/working_directory
## Demuxalot
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot
## Demuxlet
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxlet
## Souporcell
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell
## Vireo
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo
Demultiplex the pooled cells with each of Ensemblex's constituent tools:
## Define the paths to Ensemblex and the working directory
ensemblex_HOME=/path/to/ensemblex.pip
ensemblex_PWD=/path/to/working_directory
## Freemuxlet
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step freemuxlet
## Souporcell
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell
## Vireo
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo
## Demuxalot
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot
## Define the paths to Ensemblex and the working directory
ensemblex_HOME=/path/to/ensemblex.pip
ensemblex_PWD=/path/to/working_directory
## Compute ensemble classification
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step ensemblexing
## Define the paths to Ensemblex and the working directory
ensemblex_HOME=/path/to/ensemblex.pip
ensemblex_PWD=/path/to/working_directory
## Compute ensemble classification
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step ensemblexing
Any contributions or suggestions for improving the Ensemblex pipeline are welcomed and appreciated. If you encounter any issues, please open an issue in the GitHub repository. Alternatively, you are welcomed to email the developers directly; for any questions please contact Michael Fiorini: michael.fiorini@mail.mcgill.ca
Every release is documented on the GitHub Releases page.
This project is licensed under the MIT License.
The Ensemblex pipeline was produced for projects funded by the Canadian Institute of Health Research and Michael J. Fox Foundation Parkinson's Progression Markers Initiative (MJFF PPMI) in collaboration with The Neuro's Early Drug Discovery Unit (EDDU), McGill University. It is written by Michael Fiorini and Saeid Amiri with supervision from Rhalena Thomas and Sali Farhan at the Montreal Neurological Institute-Hospital. Copyright belongs MNI BIOINFO CORE.