paired-end reads different UMIs #49

KaiyangZ · 2016-10-09T22:14:57Z

Hi all,

I have a set of UMI-tagged paired-end target sequencing data, and the UMIs attached to the read1 and read2 are different. I am wondering how UMI-tools handle this kind of data during deduplication.

I was try to use bwamem for alignment, but got the error " paired reads have different names" as the UMIs of read1 and read2 are different and they were attached to read names. Do you have suggestions to handle this?

Here are an example of the reads after UMI attached to names:
Read1:
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGGACC 1:N:0:ATCACGAC+NAGGTTCA
CTGCCCCTGTTCCCTCCAACTCATCTTTTCATTTTTGTTGTAACTTGACTTTTCTGCTTTCTTTGTAACCTGCCCTCCCCACAGCGAATTGCGACGATCGTTGCATTAACTCGCGAATCACGAC
+
EEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEE6EEEEEE/EEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEAA

Read2:
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGTGGG 2:N:0:ATCACGAC+NAGGTTCA
GAGGGCAGGTTACAAAGAAAGCAGAAAAGTCAAGTTACAACAAAAATGAAAAGATGAGTTGGAGGGAACAGGGGCAGGGTCCA
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEE

Thank you very much for your help in advance!

Kaiyang

IanSudbery · 2016-10-10T09:45:50Z

Hi Kaiyang,

You need to extract the UMI from your reads in paired-end mode, using the --read2-in, --read2-out, --split-barcode and --bc-pattern2 switches to umi_tools extract. This will create a super barcode that is added to both reads. Thus the two read names will become:

@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGGACCTGTGGG 1:N:0:ATCACGAC+NAGGTTCA
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGGACCTGTGGG 2:N:0:ATCACGAC+NAGGTTCA

KaiyangZ · 2016-10-10T09:58:25Z

Thank you very much lan! I was using my home-made script to extract UMIs and attach them to read names. Great to know that UMItools can handle this and I am going to reformat the umi-readnames.

Thank you again.

Kaiyang

KaiyangZ · 2016-10-13T11:31:47Z

Hi Ian,

My problem is quite similar to #31, but I did not find a solution there.

The deduplication is very slow and it varies a lot between different files, I was expecting larger bams will take longer time but it seems not in my case: a 2.4GB bam took 16 hours to finish while a 787MB bam is still running after 41 hours. I am wondering if I missing something settings, Here is the command I used:

umi_tools dedup --method=adjacency --edit-distance-threshold=1 --paired --output-stats="dedupStats" --further-stats -L _log -E _error -I input.bam -S output.bam

The input.bam files are sorted by coordinates (bwa-mem was used for mapping), and the multiple mappings are filtered out, supplimentary alignments are still there. MarkDuplicates shows that both files have PERCENT_DUPLICATION around 96% (these are target deep sequencing data, high duplication rates are not surprising).

I have no idea why it is so slow. Can sorting bams by names speed up deduplication in paired-end data? I can send you the two bam files mentioned above if they are needed.

Thank you very much in advance.

IanSudbery · 2016-10-13T13:26:24Z

Hi Kaiyang,

The easiest way to reduce the time taken to run is going to be to leave off the --output-stats and --further-stats. This can have an order of magnitude effect on the length of time taken. What we've suggested before is do the whole dedup without the stats and then just running on a one chromosome subset with stats to get an idea of that.

The time taken scales poorly with the length of your UMIs, and I looking at what you've said above, you have 12 base UMIs, which getting on the longer end. High duplication rates will also result in high times - we need to compare each base of each UMI to each base of all the other UMIs present at a single position. You might be seeing the difference because the MarkDuplicates is going to be telling you about the number of reads at each position, not the number of UMIs.

We can't run if you sort by name, so don't do that.

What version are you using? I think the version on github is probably much quicker than the version on conda or pypy. Finally, on the chrom_order branch I've been working on a way to make it quicker again.

Ian

KaiyangZ · 2016-10-13T14:16:04Z

Hi Ian,

Thank you for the suggestions. I am using the version on pypy, gonna to change to the version on github and try.

I thought lower duplication rates will lead to longer time, as Tom mentioned in #31:
"You have a relatively low duplication rate so most positions contains just a single read. The outputting therefore requires many calls to the pysam"

TomSmithCGAT · 2016-10-16T14:05:27Z

Hi Kaiyang,

Since I made that comment we've improved the outputting of paired end reads so that we no longer use the pysam.mate function to find the paired mate of a read. As I understand it, pysam.mate literally goes through the BAM looking for the mate reads, which is a very time consuming process for paired end samples with very low duplication rates and hence many outputted reads. We now output the read pairs in chromosome chunks (see class TwoPassPairWriter in dedup.py - lines 250-236)

Ian's right that the most time consuming step is now the network building which takes longer with long UMIs and positions with many unique UMIs (as one might expect with high duplication rates).

Tom

TomSmithCGAT closed this as completed Jan 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paired-end reads different UMIs #49

paired-end reads different UMIs #49

KaiyangZ commented Oct 9, 2016

IanSudbery commented Oct 10, 2016

KaiyangZ commented Oct 10, 2016

KaiyangZ commented Oct 13, 2016

IanSudbery commented Oct 13, 2016

KaiyangZ commented Oct 13, 2016

TomSmithCGAT commented Oct 16, 2016

paired-end reads different UMIs #49

paired-end reads different UMIs #49

Comments

KaiyangZ commented Oct 9, 2016

IanSudbery commented Oct 10, 2016

KaiyangZ commented Oct 10, 2016

KaiyangZ commented Oct 13, 2016

IanSudbery commented Oct 13, 2016

Ian

KaiyangZ commented Oct 13, 2016

TomSmithCGAT commented Oct 16, 2016