Skip to content

NLA III data processing

Buys de Barbanson edited this page Jan 27, 2022 · 22 revisions

All steps described below can be executed by executing a single SnakeMake workflow

  1. Go to the folder containing the raw reads. Then run:
scmo_workflow.py nlaIII

This creates the Snakefile containing all the required steps and a config file.

  1. Edit the config file (config.json), make sure to set the right path to a reference Fasta file and optionally pick a different mapper or bin sizes.

  2. Execute the workflow locally by running snakemake or run on a cluster by using of the commands shown by scmo_workflow.py.

The workflow code can be found here


To obtain a count table the following steps have to performed:

  • Demultiplexing
  • Adapter trimming
  • Mapping
  • Molecule assignment
  • Count table generation

Demultiplexing

Demultiplexing adds UMI, cell, sample and sequencing index information to the header of the FastQ file. This information is later used to figure out which reads belong to the same molecule and to what cell every molecule belongs. The demultiplexer uses barcodes available in the barcodes folder. The right set of barcodes is automatically detected if -use is not supplied. The demultiplexer also trims the random primer from read 2 and stores it as meta-information, this information is later used for IVT deduplication.

Assume you are located in a directory with fastq files:

example_HYLVHBGX5_S7_L001_R1_001.fastq.gz
example_HYLVHBGX5_S7_L001_R2_001.fastq.gz
example_HYLVHBGX5_S7_L002_R1_001.fastq.gz
example_HYLVHBGX5_S7_L002_R2_001.fastq.gz
example_HYLVHBGX5_S7_L003_R1_001.fastq.gz
example_HYLVHBGX5_S7_L003_R2_001.fastq.gz
example_HYLVHBGX5_S7_L004_R1_001.fastq.gz

To then autodetect the right barcodes and demultiplex run:

demux.py *.gz --y -merge _

It can be useful to additionally supply -hd 1, which will assign fragments with a single differing base to the closest cell barcode. This gains 5~20% of demultiplexed reads. The effect can be assessed by running with -hd 1 but not supplying the --y, this will perform a dry run on a small subset of reads.

This will create a folder ./raw_demultiplexed Inside the folder there will be 4 files:

demultiplexedR1.fastq.gz # accepted and demultiplexed R1 reads
demultiplexedR2.fastq.gz # accepted and demultiplexed R2 reads
rejectsR1.fastq.gz # rejected R1 reads
rejectsR2.fastq.gz # rejected R2 reads

Adapter trimming

removes residual primers, this improves mapping performance:

cutadapt -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz demultiplexedR1.fastq.gz demultiplexedR2.fastq.gz -m 3 -a "IlluminaSmallAdapterConcatBait=GGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTT" -a "IlluminaIndexAdapter=GGAATTCTCGGGTGCCAAGGAACTCCAGTCACN{6}ATCTCGTATGCCGTCTTCTGCTTG"  -A "IlluminaPairedEndPCRPrimer2.0=AGATCGGAAGAGCGN{6}CAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "universalPrimer=GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT;min_overlap=5" -a  "IlluminaGEX=TTTTTAATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGACGATC;min_overlap=5" -a "IlluminaMultiplexingPCRPrimer=GGAACTCCAGTCACN{6}TCTCGTATGCCGTCTTCTGCTTG;min_overlap=5" -A "Aseq=TGGCACCCGAGAATTCCA" -a "Aseq=TGGCACCCGAGAATTCCA"  -a "illuminaSmallRNAAdapter=TCGTATGCCGTCTTCTGCTTGT"

Mapping

The trimmed fastq files can now be mapped using a mapper of your choice. In this example we will use BWA.If you are mapping to a SNP masked genome use Bowtie or STAR.

bwa mem -t 4 -M {your.reference} trimmed.R1.fastq.gz trimmed.R2.fastq.gz | samtools view -b - > ./unsorted.bam; samtools sort -T ./temp_sort -@ 4 ./unsorted.bam > ./sorted.unfinished.bam; 
mv ./sorted.unfinished.bam ./sorted.bam; 
samtools index ./sorted.bam; rm ./unsorted.bam;

Molecule assignment and mappability annotation

The aim of this step is to, for every fragment in sorted.bam find NLA fragments and assign them to a molecule. This can later be used for deduplication. Fragments with the same NLA cut site, cell barcode, umi, library and strand are assumed to belong to the same molecule. The NLAIII cut site (CATG) is expected to be present at the start of R1. The cut site location is recorded into the DS tag. Sometimes the read 1 will be shifted a little, for example it will start with ATG, then the cut-site assigned will be the mapping start coordinate +1. When alleles are specified using -alleles, the molecule assignment is split up by allele, this means that if two fragments map to the same location and share the same umi, but contain SNPs which indicate differing alleles, the reads are not assigned to the same molecule. For every fragment the recorded NLAIII sequence is recorded into the RZ tag, (CATG, ATG, TG).

Mappability annotation (optional)

In order to get a good estimate of coverage we need to correct for the mappability of the molecules. Bamtagmultiome annotates how well a fragment can be mapped to the mp tag. In order to estimate mappability an index is required which can be generated using createMappabilityIndex.py

Run the molecule assignment

bamtagmultiome.py sorted.bam -method nla  -o ./tagged.bam -mapfile /references/ref.mappability.stats.bgzf

If you have phased variants available these can be supplied to add allele information to every molecule:

bamtagmultiome.py sorted.bam -method nla  -o ./tagged.bam  -alleles ./allelic_information.vcf.gz -mapfile /references/ref.mappability.stats.bgzf

The starting location of read 2 of every fragment is the result of random priming, this means that for a single molecule there can be multiple mapping locations for R2 which are indicative of multiple RT copies from the same RNA template. The random hexamer sequence is stored in the rS tag and the random primer starting coordinate is stored in the rP tag. Read Molecule-iteration for more information about accessing RT reactions and molecules in Python.

Statistics

Read the Library-statistics-plots page for information about library quality statistics.

Copy number estimate

bamCopyNumber.py tagged.bam -bin_size 500_000 -rawmat raw_normalised_counts.csv -rawmatplot raw_counts.png -gcmat gc_corrected_counts.csv -gcmatplot gc_corrected_counts.png  -ref /path/to/reference/file.fasta -threads 20

Count table

The locations of the NLA cut sites can be binned into bins of a given size. In this example 250kb The minMQ parameter defines what the lowest mapping quality for a molecule should be to be counted. The --filterXA makes filters reads which map to multiple locations (excludes alternative loci) The -bin parameter defines the size of every bin The -sampleTags parameter defines what row identifiers the resulting table should have. If you would like to know the associated barcode you would use "BC" instead of "SM". The --dedup flag makes sure every molecule is counted only once, if this flag is not specified every fragment in the BAM file with a DS tag (Which is a valid NLA site) will be counted.

bamToCountTable.py -minMQ 30 ./tagged/sorted.bam -o ./count_table.csv -sampleTags SM -bin 250_000 -binTag DS --dedup -joinedFeatureTags reference_name --filterMP

If you supplied a allele vcf file the count table can be generated using allele information by adding DA to the joinedFeatureTags parameter.

bamToCountTable.py -minMQ 30 ./tagged/sorted.bam -o ./count_table.csv -sampleTags SM -bin 250_000 -binTag DS --dedup -joinedFeatureTags reference_name,DA  --filterMP

[The original code for determining the (allele specific) copy numbers used in Kester et al. can be found here](https://github.com/BuysDB/TumorEvolutionReconstruction), `bamToCountTable.py` is a more streamlined command line version of that code.

A sliding window based table can be generated by adding the -sliding flag, for example for windows offset by 50k use:

bamToCountTable.py -sliding 50_000 -minMQ 30 ./tagged/sorted.bam -o ./count_table.csv -sampleTags SM -bin 250_000 -binTag DS --dedup -joinedFeatureTags reference_name  --filterMP

Variant calling

The generated BAM file(s) contain a read group for each cell and are directly compatible with at least BCFtools, FreeBayes and GATK. Once variant calling is completed with one of those callers. Absence and presence of variants be determined by phasing with heterozygous germline variants. For this the script bamExtractVariants.py can be used. This process requires a VCF file with germline variants.