Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paired-end reads different UMIs #49

Closed
KaiyangZ opened this issue Oct 9, 2016 · 6 comments
Closed

paired-end reads different UMIs #49

KaiyangZ opened this issue Oct 9, 2016 · 6 comments

Comments

@KaiyangZ
Copy link

KaiyangZ commented Oct 9, 2016

Hi all,

I have a set of UMI-tagged paired-end target sequencing data, and the UMIs attached to the read1 and read2 are different. I am wondering how UMI-tools handle this kind of data during deduplication.

I was try to use bwamem for alignment, but got the error " paired reads have different names" as the UMIs of read1 and read2 are different and they were attached to read names. Do you have suggestions to handle this?

Here are an example of the reads after UMI attached to names:
Read1:
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGGACC 1:N:0:ATCACGAC+NAGGTTCA
CTGCCCCTGTTCCCTCCAACTCATCTTTTCATTTTTGTTGTAACTTGACTTTTCTGCTTTCTTTGTAACCTGCCCTCCCCACAGCGAATTGCGACGATCGTTGCATTAACTCGCGAATCACGAC
+
EEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEE6EEEEEE/EEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEAA

Read2:
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGTGGG 2:N:0:ATCACGAC+NAGGTTCA
GAGGGCAGGTTACAAAGAAAGCAGAAAAGTCAAGTTACAACAAAAATGAAAAGATGAGTTGGAGGGAACAGGGGCAGGGTCCA
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEE

Thank you very much for your help in advance!

Kaiyang

@IanSudbery
Copy link
Member

Hi Kaiyang,

You need to extract the UMI from your reads in paired-end mode, using the --read2-in, --read2-out, --split-barcode and --bc-pattern2 switches to umi_tools extract. This will create a super barcode that is added to both reads. Thus the two read names will become:

@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGGACCTGTGGG 1:N:0:ATCACGAC+NAGGTTCA
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGGACCTGTGGG 2:N:0:ATCACGAC+NAGGTTCA

@KaiyangZ
Copy link
Author

Thank you very much lan! I was using my home-made script to extract UMIs and attach them to read names. Great to know that UMItools can handle this and I am going to reformat the umi-readnames.

Thank you again.

Kaiyang

@KaiyangZ
Copy link
Author

Hi Ian,

My problem is quite similar to #31, but I did not find a solution there.

The deduplication is very slow and it varies a lot between different files, I was expecting larger bams will take longer time but it seems not in my case: a 2.4GB bam took 16 hours to finish while a 787MB bam is still running after 41 hours. I am wondering if I missing something settings, Here is the command I used:

umi_tools dedup --method=adjacency --edit-distance-threshold=1 --paired --output-stats="dedupStats" --further-stats -L _log -E _error -I input.bam -S output.bam

The input.bam files are sorted by coordinates (bwa-mem was used for mapping), and the multiple mappings are filtered out, supplimentary alignments are still there. MarkDuplicates shows that both files have PERCENT_DUPLICATION around 96% (these are target deep sequencing data, high duplication rates are not surprising).

I have no idea why it is so slow. Can sorting bams by names speed up deduplication in paired-end data? I can send you the two bam files mentioned above if they are needed.

Thank you very much in advance.

@IanSudbery
Copy link
Member

Hi Kaiyang,

The easiest way to reduce the time taken to run is going to be to leave off the --output-stats and --further-stats. This can have an order of magnitude effect on the length of time taken. What we've suggested before is do the whole dedup without the stats and then just running on a one chromosome subset with stats to get an idea of that.

The time taken scales poorly with the length of your UMIs, and I looking at what you've said above, you have 12 base UMIs, which getting on the longer end. High duplication rates will also result in high times - we need to compare each base of each UMI to each base of all the other UMIs present at a single position. You might be seeing the difference because the MarkDuplicates is going to be telling you about the number of reads at each position, not the number of UMIs.

We can't run if you sort by name, so don't do that.

What version are you using? I think the version on github is probably much quicker than the version on conda or pypy. Finally, on the chrom_order branch I've been working on a way to make it quicker again.

Ian

@KaiyangZ
Copy link
Author

Hi Ian,

Thank you for the suggestions. I am using the version on pypy, gonna to change to the version on github and try.

I thought lower duplication rates will lead to longer time, as Tom mentioned in #31:
"You have a relatively low duplication rate so most positions contains just a single read. The outputting therefore requires many calls to the pysam"

@TomSmithCGAT
Copy link
Member

Hi Kaiyang,

Since I made that comment we've improved the outputting of paired end reads so that we no longer use the pysam.mate function to find the paired mate of a read. As I understand it, pysam.mate literally goes through the BAM looking for the mate reads, which is a very time consuming process for paired end samples with very low duplication rates and hence many outputted reads. We now output the read pairs in chromosome chunks (see class TwoPassPairWriter in dedup.py - lines 250-236)

Ian's right that the most time consuming step is now the network building which takes longer with long UMIs and positions with many unique UMIs (as one might expect with high duplication rates).

Tom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants