Skip to content

joechanlab/nf-scrnaseq

Repository files navigation

Introduction

A bioinformatics pipeline that preprocesses single-cell RNA-seq data. It takes a samplesheet and CellRanger raw filtered count matrix as input, performs ambient RNA correction, doublet detection, sample aggregation, batch integration.

nf-scrnaseq

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows, where each row contains the sample name, a raw count HDF5 file and a filtered count HDF5 file from CellRanger, and whether demultiplexing is needed.

samplesheet.csv:

sample, raw_h5, filtered_h5, demultiplexing, expected_droplets
CONTROL_REP1, raw_feature_bc_matrix.h5, filtered_feature_bc_matrix.h5, false, 50000

Note

See suggestions on choice of the expected_droplets parameter here.

Next, prepare a parameter YAML file that looks as follows:

params.yml:

samplesheet: "./samplesheet.csv"      # path to the sample sheet
outdir: "./out/"                      # directory containing the outputs
experiment:
    name: "experiment"                # experiment name for prefix
atac: false                           # whether the data is 10X Multiome data
aggregation: true                     # whether to aggregate samples
remove_doublets: false                # whether to remove cells marked as doublets
remove_outliers: false                # whether to remove cells markerd as outliers
scvi:                                 # scvi parameters (see bin/scvi_norm.py)
    n_latent: 50
    n_top_genes: 1000                 # Note: large values might give error in writing h5ad
postprocessing:                       # Postprocessing parameters (see bin/postprocessing.py)
   n_pca_components: 100
   n_diffmap_components: 20
   metadata: "./metadata.csv"         # path to metadata variables
celltypist:
    model: "Human_Lung_Atlas.pkl"     # model to use for celltypist
report:
   plot: "./markers.csv"              # custom variables to plot
seacells:                             # SEACells parameters (see bin/run_SEACells.py)
    n_SEACells: 25
    build_kernel_on: "X_pca"
    n_waypoint_eigs: 10
mount: "/home,/data1"                 # path to mount for singularity
with_gpu: true                        # using GPU
maxForks: 2                           # maximum number of processes in parallel (e.g # of GPU)
max_memory: "6.GB"                    # memory information
max_time: "6.h"                       # wall time information
max_cpus: 6                           # cpu information

Now, you can run the pipeline. For local run on HPC with singularity installed, execute the following command

nextflow run ./main.nf \
   -profile singularity \
   -params-file ./params.yml \
   -w ./work/

If you are using MSKCC lilac, you can use the pre-defined lilac profile that uses the LSF executor.

nextflow run ./main.nf \
   -profile lilac \
   -params-file ./params.yml \
   -w ./work/