Skip to content

TAPS data processing

Buys de Barbanson edited this page Jan 4, 2021 · 12 revisions

TAPS

Add paper and protocol information

Data processing

To obtain a count table the following steps have to be performed:

  • Demultiplexing
  • Adapter trimming
  • Mapping
  • Molecule assignment
  • Count table generation

Install package

As described on the main page, install singlecellmultiomics package.

git clone https://github.com/BuysDB/SingleCellMultiOmics
pip install ./SingleCellMultiOmics

Demultiplexing

Demultiplexing adds UMI, cell, sample, and sequencing index information to the header of the FastQ file.

The demultiplexer uses barcodes available in the barcodes folder. The right set of barcodes is automatically detected if -use is not supplied.

Assume you are located in a directory with fastq files:

example_HYLVHBGX5_S7_L001_R1_001.fastq.gz
example_HYLVHBGX5_S7_L001_R2_001.fastq.gz
example_HYLVHBGX5_S7_L002_R1_001.fastq.gz
example_HYLVHBGX5_S7_L002_R2_001.fastq.gz
example_HYLVHBGX5_S7_L003_R1_001.fastq.gz
example_HYLVHBGX5_S7_L003_R2_001.fastq.gz
example_HYLVHBGX5_S7_L004_R1_001.fastq.gz

To then autodetect the right barcodes and demultiplex run:

demux.py *.gz --y -merge _

It can be useful to additionally supply -hd 1, which will assign fragments with a single differing base to the closest cell barcode. This gains 5~20% of demultiplexed reads. The effect can be assessed by running with -hd 1 but not supplying the --y, this will perform a dry run on a small subset of reads.

This will create a folder ./raw_demultiplexed Inside the folder there will be 4 files:

demultiplexedR1.fastq.gz # accepted and demultiplexed R1 reads
demultiplexedR2.fastq.gz # accepted and demultiplexed R2 reads
rejectsR1.fastq.gz # rejected R1 reads
rejectsR2.fastq.gz # rejected R2 reads

Adapter trimming

removes residual primers, this improves mapping performance:

cutadapt -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz demultiplexedR1.fastq.gz demultiplexedR2.fastq.gz -m 3 -a "IlluminaSmallAdapterConcatBait=GGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTT" -a "IlluminaIndexAdapter=GGAATTCTCGGGTGCCAAGGAACTCCAGTCACN{6}ATCTCGTATGCCGTCTTCTGCTTG"  -A "IlluminaPairedEndPCRPrimer2.0=AGATCGGAAGAGCGN{6}CAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "universalPrimer=GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT;min_overlap=5" -a  "IlluminaGEX=TTTTTAATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATC;min_overlap=5" -a "IlluminaMultiplexingPCRPrimer=GGAACTCCAGTCACN{6}TCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "Aseq=TGGCACCCGAGAATTCCA" -a "Aseq=TGGCACCCGAGAATTCCA"  -a "illuminaSmallRNAAdapter=TCGTATGCCGTCTTCTGCTTGT"

Mapping

Use your mapper of preference to map the reads. Make sure to use a reference which includes spike-ins.

Obtain conversion rate

estimateTapsConversionEfficiency.py ./sorted.bam -o conv_efficiency -ref [reference_fasta_path] -method nla

Extract methylation calls:

tapsTabulator.py -min_phred_score 20 -ref [reference_fasta_path] | gzip > calls.tsv.gz