-
Notifications
You must be signed in to change notification settings - Fork 9
TAPS data processing
Add paper and protocol information
To obtain a count table the following steps have to be performed:
- Demultiplexing
- Adapter trimming
- Mapping
- Molecule assignment
- Count table generation
As described on the main page, install singlecellmultiomics package.
git clone https://github.com/BuysDB/SingleCellMultiOmics
pip install ./SingleCellMultiOmics
Demultiplexing adds UMI, cell, sample, and sequencing index information to the header of the FastQ file.
The demultiplexer uses barcodes available in the barcodes folder. The right set of barcodes is automatically detected if -use is not supplied.
Assume you are located in a directory with fastq files:
example_HYLVHBGX5_S7_L001_R1_001.fastq.gz
example_HYLVHBGX5_S7_L001_R2_001.fastq.gz
example_HYLVHBGX5_S7_L002_R1_001.fastq.gz
example_HYLVHBGX5_S7_L002_R2_001.fastq.gz
example_HYLVHBGX5_S7_L003_R1_001.fastq.gz
example_HYLVHBGX5_S7_L003_R2_001.fastq.gz
example_HYLVHBGX5_S7_L004_R1_001.fastq.gz
To then autodetect the right barcodes and demultiplex run:
demux.py *.gz --y -merge _
It can be useful to additionally supply -hd 1
, which will assign fragments with a single differing base to the closest cell barcode. This gains 5~20% of demultiplexed reads. The effect can be assessed by running with -hd 1
but not supplying the --y
, this will perform a dry run on a small subset of reads.
This will create a folder ./raw_demultiplexed Inside the folder there will be 4 files:
demultiplexedR1.fastq.gz # accepted and demultiplexed R1 reads
demultiplexedR2.fastq.gz # accepted and demultiplexed R2 reads
rejectsR1.fastq.gz # rejected R1 reads
rejectsR2.fastq.gz # rejected R2 reads
removes residual primers, this improves mapping performance:
cutadapt -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz demultiplexedR1.fastq.gz demultiplexedR2.fastq.gz -m 3 -a "IlluminaSmallAdapterConcatBait=GGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTT" -a "IlluminaIndexAdapter=GGAATTCTCGGGTGCCAAGGAACTCCAGTCACN{6}ATCTCGTATGCCGTCTTCTGCTTG" -A "IlluminaPairedEndPCRPrimer2.0=AGATCGGAAGAGCGN{6}CAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "universalPrimer=GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT;min_overlap=5" -a "IlluminaGEX=TTTTTAATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATC;min_overlap=5" -a "IlluminaMultiplexingPCRPrimer=GGAACTCCAGTCACN{6}TCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "Aseq=TGGCACCCGAGAATTCCA" -a "Aseq=TGGCACCCGAGAATTCCA" -a "illuminaSmallRNAAdapter=TCGTATGCCGTCTTCTGCTTGT"
Use your mapper of preference to map the reads. Make sure to use a reference which includes spike-ins.
estimateTapsConversionEfficiency.py ./sorted.bam -o conv_efficiency -ref [reference_fasta_path] -method nla
tapsTabulator.py -min_phred_score 20 -ref [reference_fasta_path] | gzip > calls.tsv.gz