-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paired-end reads different UMIs #49
Comments
Hi Kaiyang, You need to extract the UMI from your reads in paired-end mode, using the @NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGGACCTGTGGG 1:N:0:ATCACGAC+NAGGTTCA |
Thank you very much lan! I was using my home-made script to extract UMIs and attach them to read names. Great to know that UMItools can handle this and I am going to reformat the umi-readnames. Thank you again. Kaiyang |
Hi Ian, My problem is quite similar to #31, but I did not find a solution there. The deduplication is very slow and it varies a lot between different files, I was expecting larger bams will take longer time but it seems not in my case: a 2.4GB bam took 16 hours to finish while a 787MB bam is still running after 41 hours. I am wondering if I missing something settings, Here is the command I used: umi_tools dedup --method=adjacency --edit-distance-threshold=1 --paired --output-stats="dedupStats" --further-stats -L _log -E _error -I input.bam -S output.bam The input.bam files are sorted by coordinates (bwa-mem was used for mapping), and the multiple mappings are filtered out, supplimentary alignments are still there. MarkDuplicates shows that both files have PERCENT_DUPLICATION around 96% (these are target deep sequencing data, high duplication rates are not surprising). I have no idea why it is so slow. Can sorting bams by names speed up deduplication in paired-end data? I can send you the two bam files mentioned above if they are needed. Thank you very much in advance. |
Hi Kaiyang, The easiest way to reduce the time taken to run is going to be to leave off the The time taken scales poorly with the length of your UMIs, and I looking at what you've said above, you have 12 base UMIs, which getting on the longer end. High duplication rates will also result in high times - we need to compare each base of each UMI to each base of all the other UMIs present at a single position. You might be seeing the difference because the MarkDuplicates is going to be telling you about the number of reads at each position, not the number of UMIs. We can't run if you sort by name, so don't do that. What version are you using? I think the version on github is probably much quicker than the version on conda or pypy. Finally, on the chrom_order branch I've been working on a way to make it quicker again. Ian |
Hi Ian, Thank you for the suggestions. I am using the version on pypy, gonna to change to the version on github and try. I thought lower duplication rates will lead to longer time, as Tom mentioned in #31: |
Hi Kaiyang, Since I made that comment we've improved the outputting of paired end reads so that we no longer use the pysam.mate function to find the paired mate of a read. As I understand it, pysam.mate literally goes through the BAM looking for the mate reads, which is a very time consuming process for paired end samples with very low duplication rates and hence many outputted reads. We now output the read pairs in chromosome chunks (see class TwoPassPairWriter in dedup.py - lines 250-236) Ian's right that the most time consuming step is now the network building which takes longer with long UMIs and positions with many unique UMIs (as one might expect with high duplication rates). Tom |
Hi all,
I have a set of UMI-tagged paired-end target sequencing data, and the UMIs attached to the read1 and read2 are different. I am wondering how UMI-tools handle this kind of data during deduplication.
I was try to use bwamem for alignment, but got the error " paired reads have different names" as the UMIs of read1 and read2 are different and they were attached to read names. Do you have suggestions to handle this?
Here are an example of the reads after UMI attached to names:
Read1:
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGGACC 1:N:0:ATCACGAC+NAGGTTCA
CTGCCCCTGTTCCCTCCAACTCATCTTTTCATTTTTGTTGTAACTTGACTTTTCTGCTTTCTTTGTAACCTGCCCTCCCCACAGCGAATTGCGACGATCGTTGCATTAACTCGCGAATCACGAC
+
EEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEE6EEEEEE/EEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEAA
Read2:
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGTGGG 2:N:0:ATCACGAC+NAGGTTCA
GAGGGCAGGTTACAAAGAAAGCAGAAAAGTCAAGTTACAACAAAAATGAAAAGATGAGTTGGAGGGAACAGGGGCAGGGTCCA
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEE
Thank you very much for your help in advance!
Kaiyang
The text was updated successfully, but these errors were encountered: