Skip to content

Transcriptome data processing

Marloes edited this page Nov 8, 2021 · 5 revisions

All steps described below can be executed by executing a single SnakeMake workflow

  1. Go to the folder containing the raw reads. Then run:
scmo_workflow.py cs2

This creates the Snakefile containing all the required steps and a config file.

  1. Edit the config file (config.json), make sure to set the right path to a reference Fasta file and optionally pick a different mapper or bin sizes.

  2. Execute the workflow locally by running snakemake or run on a cluster by using of the commands shown by scmo_workflow.py. Important: when running from a conda environment, do not forget to use --use-conda!

    Example submission using submission.py and a conda environment: submission.py "source activate conda;snakemake --use-conda --cluster slurm_wrapper.py --jobs 20 --restart-times 3" -y -time 50 -m 2 -N cs2_snakemake

The workflow code can be found here


This workflow:

  • Starts off from a folder containing all fastq files
  • Detects libraries
  • Demultiplexes per library, automatically detecting the right barcodes
  • Trims using cutadapt
  • Maps using STAR, sorts and indexes the reads per library
  • uses featureCounting
  • Tags and deduplicates
  • Creates count tables for deduplicated and non-deduplicated counts: -> Count table containing rows with gene information and columns with cell information

Demultiplexing

Demultiplexing adds UMI, cell, sample and sequencing index information to the header of the FastQ file. This information is later used to figure out which reads belong to the same molecule and to what cell every molecule belongs. The demultiplexer uses barcodes available in the barcodes folder. The right set of barcodes is automatically detected if -use is not supplied. The demultiplexer also trims the random primer from read 2 and stores it as meta-information, this information is later used for IVT deduplication.

Assume you are located in a directory with fastq files:

example_HYLVHBGX5_S7_L001_R1_001.fastq.gz
example_HYLVHBGX5_S7_L001_R2_001.fastq.gz
example_HYLVHBGX5_S7_L002_R1_001.fastq.gz
example_HYLVHBGX5_S7_L002_R2_001.fastq.gz
example_HYLVHBGX5_S7_L003_R1_001.fastq.gz
example_HYLVHBGX5_S7_L003_R2_001.fastq.gz
example_HYLVHBGX5_S7_L004_R1_001.fastq.gz

The script runs the following command:

demux.py -merge _ {input.fastqfiles} -o processed --y -use CS2C8U6

'-use CS2C8U6' ensures that demultiplexing is done based on celseq barcodes. If your library contains other barcodes, you should change this. It can be useful to additionally supply -hd 1, which will assign fragments with a single differing base to the closest cell barcode. This gains 5~20% of demultiplexed reads. The effect can be assessed by running with -hd 1 but not supplying the --y, this will perform a dry run on a small subset of reads.

All output files will be written to ./processed.

Inside the folder there will be 4 files:

demultiplexedR1.fastq.gz # accepted and demultiplexed R1 reads
demultiplexedR2.fastq.gz # accepted and demultiplexed R2 reads
rejectsR1.fastq.gz # rejected R1 reads
rejectsR2.fastq.gz # rejected R2 reads

These are removed as soon as the next step is done.

Adapter trimming

removes residual primers, this improves mapping performance.

The script runs the following command:

cutadapt -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz demultiplexedR1.fastq.gz demultiplexedR2.fastq.gz -m 3 -a "IlluminaSmallAdapterConcatBait=GGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTT" -a "IlluminaIndexAdapter=GGAATTCTCGGGTGCCAAGGAACTCCAGTCACN{6}ATCTCGTATGCCGTCTTCTGCTTG"  -A "IlluminaPairedEndPCRPrimer2.0=AGATCGGAAGAGCGN{6}CAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "universalPrimer=GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT;min_overlap=5" -a  "IlluminaGEX=TTTTTAATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATC;min_overlap=5" -a "IlluminaMultiplexingPCRPrimer=GGAACTCCAGTCACN{6}TCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "Aseq=TGGCACCCGAGAATTCCA" -a "Aseq=TGGCACCCGAGAATTCCA"  -a "illuminaSmallRNAAdapter=TCGTATGCCGTCTTCTGCTTGT"

Mapping

The trimmed fastq files can now be mapped using a mapper of your choice. Currently, only STAR is supported.

The script runs the following command:

STAR --runThreadN 8 --outSAMtype BAM SortedByCoordinate --outFilterMultimapNmax 20 --outMultimapperOrder Random --outSAMmultNmax 1 --readFilesCommand zcat  --readFilesIn {TrimmedR2} \
      --genomeDir {Reference_file.fa} --outFileNamePrefix {output_directory}

Feature counting

The script runs the following command:

featureCounts -s 1 -T 1 -R BAM -a {GTF_file} -o processed/{libraryName}/fCounts.txt {input.bamfile}

Tagging and deduplication

First, the bam files need to be sorted and indexed. Then all the information that was put into the read name for alignment is placed back into the bam file as tags. Reads are also flagged in order to create deduplicated csv output files in the next step. Errors in the UMI are not allowed, the hamming distance is set at 0.

The script runs the following command:

bamtagmultiome.py -method cs_feature_counts -umi_hamming_distance 0 {input_bamFile} -o {output.bamFile}

Count tables

Both a deduplicated and non-deduplicated count table will be generated. Rows will contain the gene name and chromosomal location, and columns will contain sample names.

The script runs the following command:

bamToCountTable.py --noNames {input.bam} -o {output.countTable} -joinedFeatureTags XT,chrom -sampleTags SM