dedup dir-adjacency too slow/does not complete #31

royfrancis · 2016-05-24T16:12:47Z

I started with a 12gb BAM (77 mil reads) using 3 cores and 24GB RAM. It did not complete in 2.5 hours or even close. Output BAM after 2.5 hours was 2.3MB. Then I extracted one chromosome to create a 314MB BAM. Dedup was run on this small BAM for 1 hour and it was still incomplete. A truncated BAM file of 12.4MB was exported. I ran qualimap bamqc on both files (1 chromosome bam before dedup and after dedup).

file	before dedup	after dedup
reads	5,104,918	197,280
dup rate	38.74%	2.27%

So, I suppose the deduplication must be working.

Extrapolating from this run time, my 12gb bam would take lot more than 39 hours. Would this be an expected run time? I was wondering if I was doing something wrong or some setting/switches are not optimal.

This is my script:

umi_tools dedup \
--method="directional-adjacency" \
--paired \
--output-stats="file-dedup.stats" \
-v 3 \
-E "file-dedup.error" \
-L "file-dedup.log" \
-I "file.bam" \
-S "file-dedup.bam"

The text was updated successfully, but these errors were encountered:

TomSmithCGAT · 2016-05-24T17:15:14Z

Hi Roy,

I've not used qualimap bamqc but from a quick read of the documentation it appears it detects duplicates using the start position. If you're using an aligner which can "clip" the reads it may miss-estimate the duplication rate. Would you mind running Picard MarkDuplicates as well to compare the output?

Regarding the time taken, this is indeed very slow. Would you be OK to share with me the one chromosome 314MB undeduped BAM and I can see whether I get the same performance and if so where the bottleneck is.

royfrancis · 2016-05-24T18:08:13Z

Hi Tom,
Here is the bam file for one chr. Paired-end .fq files were processed using umi_tools extract and mapped to Zebrafish Genome using STAR (default settings with gtf). I haven't used MarkDuplicates, but I can look into that.

royfrancis · 2016-05-24T20:12:22Z

MarkDuplicates does not seem to complete (no output bam or metrics report) with truncated BAM file.

TomSmithCGAT · 2016-05-25T15:12:13Z

Quick side note. You have lots of multimapped reads in your BAM.

samtools view file.bam|cut -f12|sort|uniq -c|sort gives:

26598 NH:i:10
28680 NH:i:9
32512 NH:i:8
34716 NH:i:7
47274 NH:i:6
65130 NH:i:5
132028 NH:i:4
284914 NH:i:3
735984 NH:i:2
3717082 NH:i:1

As it stands, for a read which maps to 10 locations, each mapping location will be considered independently. Thus, the final output will contain between 0-N alignments for each read, where N is the number of alignments (NH tag). I'm going to add an option to discount secondary and supplementary alignments which you might want to consider.

TomSmithCGAT · 2016-05-25T15:20:05Z

On to the real issue of the time taken.

This looks to be solely due to the time taken to find the paired end mates. You can test this yourself by dropping the --paired option. I don't have the profile stats yet because it's still not finished 2 hours later(!).

You have a relatively low duplication rate so most positions contains just a single read. The outputting therefore requires many calls to the pysam.AlignmentFile.mate() function which is not an efficient way to retrieve the paired end read. Ideally, when dealing with paired reads, the BAM is sorted by read name. However, we need a coordinate sorted BAM in order to perform the duplicate detection. It looks like we might need to implement so sort of caching to speed up the retrieval of the paired ends.

@AndreasHeger Do you have any suggestions for how to best speed up the paired end retrieval? Is caching the right way to go?

AndreasHeger · 2016-05-26T08:41:02Z

                                                                                  Hi Tom, mate is indeed best avoided in large datasets. Caching is better, but might be complicated to implement if you want to output a sorted file. It might be easier doing two passes on the bam. ‎To conserve memory,  you could do the double pass per chromosome or smaller region. Best, andreas                                                                                                                                                                                                                                                                                                                                         Sent from my BlackBerry 10 smartphone on the EE network.                                                                                                                                                                                                                From: Tom SmithSent: Wednesday, 25 May 2016 16:23To: CGATOxford/UMI-toolsReply To: CGATOxford/UMI-toolsCc: Andreas Heger; MentionSubject: Re: [CGATOxford/UMI-tools] dedup dir-adjacency too slow/does not complete (#31)On to the real issue of the time taken.

This looks to be solely due to the time taken to find the paired end mates. You can test this yourself by dropping the --paired option. I don't have the profile stats yet because it's still not finished 2 hours later(!).

You have a relatively low duplication rate so most positions contains just a single read. The outputting therefore requires many calls to the pysam.AlignmentFile.mate() function which is not an efficient way to retrieve the paired end read. Ideally, when dealing with paired reads, the BAM is sorted by read name. However, we need a coordinate sorted BAM in order to perform the duplicate detection. It looks like we might need to implement so sort of caching to speed up the retrieval of the paired ends.

@AndreasHeger Do you have any suggestions for how to best speed up the paired end retrieval? Is caching the right way to go?

—You are receiving this because you were mentioned.Reply to this email directly or view it on GitHub

IanSudbery · 2016-05-26T09:08:24Z

Thanks for that @AndreasHeger. The output is already non-sorted, so perhaps we should consider caching.

TomSmithCGAT · 2016-05-26T09:30:08Z

@IanSudbery

A very simplistic "two-pass" approach would be to run through the bamfile one chromosome at a time and store the reads as a ordered dictionary mapping reads ids to their paired reads, then re-pass through the keys to perform the dedup, emptying the dictionary after each chromosome. I'll test this out as an immediate solution.

Alternatively, we could cache as we go but this will be a little more complicated within our current workflow? I'll have a think about how to best perform the caching.

Either way, this would have to be done on a per-chromosome basis to avoid potentially huge memory requirements for chimeric read pairs. Perhaps we should fall back on the .mate() function for these read pairs?

What do you think?

IanSudbery · 2016-05-26T15:02:00Z

@TomSmithCGAT
Two pass is definitely easier to implement, but would need to do very careful memory profiling. When I was caching whole chromosomes before memory usage was getting up in the 10s of GB for large files. But was caching at the read bundle level (i.e. pre-cluster collapsing), so might be better with caching after this.

I think falling back on the .mate() for chimeric reads is the way to go.

A dynamic cache, that saved read2's in a dict, and then throw them away, or outputted them as their mates were processed would probably be more memory efficient, but harder to do. Most reads would not spend long in the cache because their mate would be nearby. Might also be slower though because for every read processed, you'd have to check if its mate was in a the cache, and add it to a cache if not.

TomSmithCGAT · 2016-05-26T17:55:03Z

@IanSudbery
Could you take a look at a 2pass option I've implemented in 35c776c.

I had to change things around a bit to allow the 2 pass option to store the read pairs on a per-contig basis. Note the new class "Writer" which takes the reads, passes them to the "ClusterAndReducer" functor and has method to write out the resultant reads and stats.

I've tested it on the above single chromosome BAM (5M reads, 2M unique read pairs). The original code took 19 hours! With "--2pass" it takes 22 mins. The maximum Virtual Memory Size according to "ps" was 2.1 Gb.

IanSudbery · 2016-05-27T12:22:44Z

@TomSmithCGAT: A couple of points

What happens if there are multimappers?
Isn't it more memory efficient to dedup, save the read1s you want to output, and then go through looking for the read2s?
I think the execution path is getting a bit complicated.

I'm working on a suggestion.

TomSmithCGAT · 2016-05-27T12:50:13Z

@IanSudbery

I think this is a general area that needs more consideration. I'm not sure there is any way to sensibly deal with secondary/supplementary alignments in the deduplication unless we keep another dictionary mapping primary to secondary/supp. alignments and output all alignments if the primary is retained. We wouldn't want to output secondary/supp. alignments in cases where the primary was removed would we? As it stands, the secondary/supp. alignments are skipped.
You're right, that's a much better approach! E.g run the 2 pass the other way around - thanks!
I agree. I guess the execution path will change for your implementation based on point 2?

IanSudbery · 2016-05-27T13:36:45Z

RE: multimappers - I think the cleanest way to deal with multimappers, is to say its not our problem. The user should deal with them independently. If they want to leave them, in, thats their decision -i think we should leave them in if they are there. The only correct alternative would be to consider all positions simultaneously. Thats beyond our scope at the moment i reckon. Im less sure about how to deal with chimeric alignments.

TomSmithCGAT · 2016-05-27T13:49:44Z

I agree this should be up to the user. On reflection perhaps the secondary/supp. skipping should be optional rather than enforced. I think we should mention multi-mapping in the documentation.

I think we should take a similar approach to chimeric reads - leave them in (pull out the paired end with .mate() as we discussed) and have an option to skip them entirely at the users choosing. Out of interest, what TLEN do chimeric reads get in the BAM?

IanSudbery · 2016-05-28T09:37:22Z

I'd go further than that. I'd say that if they want to filter out secondary/supplimentary alignments, let them do it themselves; in the way they see fit.

I'm not really sure how mate pairing works in the presence of supplementary alignments. Is the mate of the primary read1, the primary read2? or just the next on in the chain? I suspect this vaires from mapper to mapper. Could be the same is true of TLEN.

Checkout my proposal for the 2 pass pairing in bbbe020. Its a bit more complicated than I'd anticipated because it has to deal with the chimeric alignment problem, but I think its still represents less radical surgery to the execution path. I profiled using:

time -v umi_tools dedup -I in.bam -S /dev/null --paired -v5 [--2pass]

(I didn't add a seperate option for 2pass, but made it how paired retrieval works)

Its about 25% slower than your solution (5mins 30sec rather than 4mins 20 for the test BAM), but uses a quarter as much RAM. (resident set 500Mb vs 1.8GB).

TomSmithCGAT · 2016-05-28T10:21:14Z

@IanSudbery I like the simplicity and cleanness of re-defining the outfile object and it's write and close methods for paired end reads. Nice!

I beleive lines 959-964 need to be removed so that read pairs are only written once when using the --ignore-umi option.

I'm surprised it's any slower to be honest but the gain in memory efficiency more than outweighs this small difference.

TomSmithCGAT · 2016-05-28T10:24:17Z

Regarding secondary/supp. alignments in paired end, STAR appears to output a proper pair of alignments for so each alignment1 has it's own alginment2, even where the alignment1 is the same in both cases. I expect this will indeed vary for other mappers. Let's do as you say and leave it to the user to generate a suitable BAM but include some more information in the documentation regarding multimapping.

IanSudbery · 2016-05-31T09:10:38Z

@TomSmithCGAT Thanks for catching that - i've sorted it and submitted a PR.

WRT secondary/supp - I was more worried about supplementary than secondary - i.e. where different parts of read1 are mapped two several locations, as are several parts of read2. I think STAR outputs these to a separate file, but I think BWA puts them all in the same file.

TomSmithCGAT · 2016-05-31T12:01:23Z

@royfrancis Could you update UMI-Tools to the latest version (v.0.2.0). This should speed up the de-duplication stage considerably.

royfrancis · 2016-06-01T13:37:13Z

Hi Tom,
I just tried on the same 1 chromosome bam and it completed in 60 mins (2 cores+16gb RAM). For comparison, I filtered uniquely mapped reads into a new bam file and ran dedup again. This time it finished in 37 mins. The good news is that dedup reaches completion. I estimate my original bam file (unique filtered) to take about 16 hours. Btw, does this new version v.0.2.0 incorporate the changes in extract #26?

IanSudbery · 2016-06-01T15:26:13Z

Hi @royfrancis,

I don't suppose its much help, but you are probably getting hammered by the stats computation: running your chr1 BAM on my machine takes 2 mins 30 without stats and 25 mins with stats. This is probably due us not finding an efficient way to randomly sample reads from a BAM file.

royfrancis · 2016-06-02T11:52:05Z

I ran my full bam (unique filtered, 7gb) yesterday for 18 hours and it still failed. Is there some way to output the progress, so as to estimate how much of the bam file was completed?

IanSudbery · 2016-06-02T13:41:53Z

Hi Roy,

If you turn the verbosity up to 5 with -v5, a message should be printed to the log every 10,000 alignments processed.

IanSudbery · 2016-06-03T11:54:09Z

Hi @royfrancis,
While we work on the speed and memory efficiency of the stats module, perhaps one solution would be run dedup without stats on your complete data set, and just run the stats module on a one chromosome subset.

royfrancis · 2016-06-07T12:29:57Z

It still did not work for me after a 10hr run time. I am not sure what stats is, but I excluded --output-stats=, like so

umi_tools dedup \
--method="directional-adjacency" \
--paired \
-v 5 \
-E "fn-error.txt" \
-L "fn-dedup-log.txt" \
-I "file.bam" \
-S "file-dedup.bam"

IanSudbery · 2016-06-07T12:35:37Z

Yes, exactly.

IanSudbery · 2016-06-14T16:26:45Z

Hi Roy, was there anything in the error or log file?

royfrancis · 2016-06-15T06:09:51Z

Here is the log file after running the full bam for 5 hours (3 cores, 12gb ram) without stats. The run was incomplete and timed out. A 1.5gb dedup bam file was produced. The error file was empty.
log.txt

TomSmithCGAT · 2017-03-24T22:38:29Z

I'll have a play with the breath_first_search and see if we can reduce the run time by removing some of the set operations.

I downloaded the sample you linked to but there doesn't appear to be any UMIs in this sample?

AAACCCCACC:HWI-D00611:179:C7MR1ANXX:5:1213:10274:14684	163	chr1	26547764	255	43M	=	26547775	49	GGCCTGGTGTGGTGGCTCATGCCGGTAATCCCAGCACTTTGGG	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:80	nM:i:0	NM:i:0	MD:Z:43	jM:B:c,-1	jI:B:i,-1	RG:Z:foo

gpratt · 2017-03-25T08:02:19Z

Woops, forgot adjust the barcode from out in-house location to yours. For some reason years ago I decided that the UMI should go in front of the read name, AAACCCCACC is the UMI in this case.

Either way, I've uploaded a fixed version of the bam file to the same location, should work out of the box now.

I also did what I said would happen, got bored and wrote the fix :). It worked well on my test case that previously finished running, and now I'm testing it on the files that were issues in the past. I'll let you know if those worked / finished soon.

jblachly · 2017-04-03T03:10:41Z

Hi @TomSmithCGAT and @IanSudbery

Congratulations on your paper! Have been following since the blog post. I've just loaded up UMI-tools for the first time this weekend as we finally have our data back, and have also been struck by the incredibly slow speed. Found this issue and thought I would contribute some data. Perhaps it is a misconfiguration on my end? Installed with sudo pip and have also run it once using sudo umi_tools dedup as instructed. There is no CPU or memory pressure on this system. Is it possible cython is not being engaged properly?

Anyway, using ca. 1.5M reads of paired-end targeted resequencing data (where an individual locus could have high [>1000X] depth) I am seeing 90 minute runtimes. Unfortunately worsening is exponential -- latest one took 9 hrs for 2.5M reads. I wrote a script to downsample a ~1.5M read BAM file to 10%, 20%, 40%, and 80% and check times both with and without stats. FWIW, the proportional contribution of stats generation to overall runtime seems to be less the more reads one has, so I don't think in my case the suggestion earlier in the thread that this could be stats-related is entirely accurate.

Anyway, here are the data. Values are in seconds. Yes, the last row is downsampled to 99% (not 100%) as script was easier that way =) Please let me know if you have suggestions to speed this up as it is kind of a blocker ...

reads	no stats	with stats
149749	69	98
298075	237	307
593073	837	1031
1174608	3232	3664
1446610	4766	5312

gpratt · 2017-04-03T03:16:35Z

Try installing from the master branch rather than from pip. My version of this issue was fixed a week or two ago in master, but I'm guessing the change hasn't been pushed to conda / pip yet. Gabriel Pratt Bioinformatics Graduate Student, Yeo Lab University of California San Diego

…

On Sun, Apr 2, 2017 at 8:10 PM, James S Blachly, MD < ***@***.***> wrote: Hi @TomSmithCGAT <https://github.com/TomSmithCGAT> and @IanSudbery <https://github.com/IanSudbery> Congratulations on your paper! Have been following since the blog post. I've just loaded up UMI-tools for the first time this weekend as we finally have our data back, and have also been struck by the incredibly slow speed. Found this issue and thought I would contribute some data. Perhaps it is a misconfiguration on my end? Installed with sudo pip and have also run it once using sudo umi_tools dedup as instructed. There is no CPU or memory pressure on this system. Is it possible cython is not being engaged properly? Anyway, using ca. 1.5M reads of paired-end targeted resequencing data (where an individual locus could have high [>1000X] depth) I am seeing 90 minute runtimes. Unfortunately worsening is exponential -- latest one took *9 hrs for 2.5M reads*. I wrote a script to downsample a ~1.5M read BAM file to 10%, 20%, 40%, and 80% and check times both with and without stats. FWIW, the proportional contribution of stats generation to overall runtime seems to be less the more reads one has, so I don't think in my case the suggestion earlier in the thread that this could be stats-related is entirely accurate. Anyway, here are the data. Values are in seconds. Yes, the last row is downsampled to 99% (not 100%) as script was easier that way =) Please let me know if you have suggestions to speed this up as it is kind of a blocker ... reads no stats with stats 149749 69 98 298075 237 307 593073 837 1031 1174608 3232 3664 1446610 4766 5312 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABNUW4AH_w9OzXJDksagVNDmlqnvtW_0ks5rsGMzgaJpZM4IlqZd> .

jblachly · 2017-04-03T04:38:07Z

@gpratt I have commit latest commit from master branch, version 0.4.3, installed through pip. They certainly are on the ball with pip.

TomSmithCGAT · 2017-04-03T09:08:37Z

Hi @jblachly. Thanks for the detailed break down of the run time. From what you say, it sounds like the issue is the network building stage which is O(n^2) since perform an all vs. all comparison between the UMIs. Would you mind sharing your BAM so we can profile the script and see what the bottleneck is?

IanSudbery · 2017-04-03T09:37:26Z

Hi @TomSmithCGAT , expect that I don't think the stats/no stats difference tallies with that: When building stats, we do the all vs all comparisons with the randomly sampled reads, as well as the actual reads at the locus. Thus should stats not at the very least double the amount of time spent building edit-distance matricies? This suggests that once again it is the connected component finding, rather than the edit distance calculation that is to blame as we don't actually build the components for the stats?

Interestingly, the increase we are seeing in the those numbers is sub-quadratic (every doubling in reads is leading to a three fold increase in time, where as quadratic would predict a four-fold increase). To me, the problem here is not the increase in time with increasing depth, but the large amount of time taken to process even the smallest sample. However the jump 1.5hrs -> 9hrs for 1.5M->2.5M reads doesn't seem to fit that pattern.

@jblachly Most of our users are doing RNA seq and often see their run times dominated by the single highest depth locus. They are seeing runtimes of days caused by deduplication problems with single locations with 10s or even 100s of thousands of reads - but they only have a single locus like this. By having 1000s of reads per locus, but many loci like this, you may have hit on a particularly pathological case. None-the-less, 90 seconds for 100,000 reads feels wrong.

You signature states you are an MD, and you are doing resequencing, so I realise that this might be patient data that you are unable to share. However, one solution to this that I can see is that you strip out the sequence from the BAM files - BAM files are still valid if the sequence is replaced with '*'. Two ways we can do this: the tool bam2bam from the cgat package has a --strip-sequence option, or you could use samtools view to convert to SAM then use awk to remove the sequence column. We could then rebuild the BAM at our end. If you send SAM, make sure you zip it first!

If you are worried about us knowing the identity of the genes you are resequencing, then I suggest adding a secret number to each read position (the same number to every read, or even a different number for every gene - as long as it was the same number within a gene).

jblachly · 2017-04-03T22:26:12Z

Dear @IanSudbery , thanks for your willingness to help with this. I have sent you a note under separate cover with a link to the BAM files.

IanSudbery · 2017-04-04T14:07:48Z

Okay, I have located the problem here...

You have many reads where the read 1 is on a different chromosome to its mate. This causes some issues as the end of each chromosome we output all the mates from a particular chromosome. If the mates aren't found, they accumulate in a buffer and we go searching for them at the end. This exhaustive search at the end is slow, and only there as a last resort.

It is particularly slow in your case because to do the search, all reads that overlap the supposed location are retrieved and searched one by one for the mate. As you have may reads at one location, this is slow. I'm working on something better.

jblachly · 2017-04-04T14:57:22Z

@IanSudbery -- thank you so much for tracking this down. We are working with cancer samples so the mate mapping to different contig is going to be a common theme for us (and others in this field). Let me know if I can help in any way with the improvement (I can test other branches etc.).

IanSudbery · 2017-04-04T15:14:51Z

Dear @jblachly,

I've push a branch "getting_mates" (see commit c96dd45). It definitely makes things faster for you. On my machine your samples finish in 2 minutes and 3 minutes respectively without stats. The only question is what they do to other people.

@TomSmithCGAT Instead of trying to fetch all the reads at the mate location and then sorting through them looking for the read in question, this commit just reopens the input BAM file, and scans through the whole thing looking for mates. For the files at hand it takes seconds (1.5M and 2.5M reads respectively). I don't know what it would do to someone with 100M reads where the mates were not all concentrated in one location.

TomSmithCGAT · 2017-04-04T15:48:52Z

@IanSudbery. At worst this will only ever mean passing through the whole BAM file once more which is tolerable compared to the pathologically poor performance in this case when using the index to try and specifically retrieve the missing pairs. This will also obviously only ever be triggered when the user has chimeric reads since otherwise the read1s set will be emptied with each call to write_mates. I've made a couple of comments on the commit. I'm very happy to merge this into master when it's ready.

It looks like this method of outputting the orphan read pairs was implemented in the first instance of TwoPassPairWriter so it could well explain some of the other slow run times?

IanSudbery · 2017-04-04T16:37:50Z

Could be.

…

On Tue, 4 Apr 2017, 4:48 pm Tom Smith, ***@***.***> wrote: @IanSudbery <https://github.com/IanSudbery>. At worst this will only ever mean passing through the whole BAM file once more which is tolerable compared to the pathologically poor performance in this case when using the index to try and specifically retrieve the missing pairs. This will also obviously only ever be triggered when the user has chimeric reads since otherwise the read1s set will be emptied with each call to write_mates. I've made a couple of comments on the commit. I'm very happy to merge this into master when it's ready. It looks like this method of outputting the orphan read pairs was implemented in the first instance of TwoPassPairWriter so it could well explain some of the other slow run times? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFJFjmG_Ea4yz0Pl_M1fQItKDOj011qSks5rsmZkgaJpZM4IlqZd> .

TomSmithCGAT · 2017-04-05T08:29:43Z

P.s Amazing improvement in run time by the way @IanSudbery. That's >1 hour to <5 mins, right?! Thanks @jblachly for bringing this to our attention and sharing your data with us. Hopefully, we've covered most of the unusual BAMs now and the run time issues have been largely resolved 🤞

crutching · 2017-05-04T20:30:43Z

Adding my own experiences. I have been assessing this collection of tools for use in one of our processing pipelines. Using the 0.4.3 release, I am running a paired-end analysis off of a 36mil line BAM produced from amplicon alignments. I am doing one run only on chromosome 21, and the other is for the whole file. The chr21 run looked to be done in less than a minute, but then has been running for over 2 hours on the unmatched mates stage (~3500 unmatched). The other run claims 110k unmatched mates, and took maybe 30 min to make the first pass through the file.

Pulling the gather_mates branch seems to improve the situation dramatically. I am now seeing no more than a minute spent on the finding mates step, at least when run on the full data set. When running on the single chromosome, I see the following error:

UnboundLocalError: local variable 'gene_tag' referenced before assignment

Not a big deal, it doesn't look like I will need to scatter gather the dedup step now that the runtime is more reasonable, but I thought it was worth bringing up.

Last thing, not completely related to the thread. I generally see entries in the final output that are not completely sorted. Up until this point, I have had to sort before and after dedup so that the next tool in the pipeline won't break. Is this something that you already know about?

Thanks for all the work on this tool!

IanSudbery · 2017-05-05T07:49:58Z

Thanks for your input. We'll certainly investigate your unbound error. We're aware of the sort issue: the problem is that the tool outputs reads in read.start position order rather than alignment start position order. (E.g. the 3' end of reads on the reverse strand) we need to think about how to deal with this without having to write out and then sort.

…

On Thu, 4 May 2017, 9:30 pm jhl667, ***@***.***> wrote: Adding my own experiences. I have been assessing this collection of tools for use in one of our processing pipelines. Using the 0.4.3 release, I am running a paired-end analysis off of a 36mil line BAM produced from amplicon alignments. I am doing one run only on chromosome 21, and the other is for the whole file. The chr21 run looked to be done in less than a minute, but then has been running for over 2 hours on the unmatched mates stage (~3500 unmatched). The other run claims 110k unmatched mates, and took maybe 30 min to make the first pass through the file. Pulling the gather_mates branch seems to improve the situation dramatically. I am now seeing no more than a minute spent on the finding mates step, at least when run on the full data set. When running on the single chromosome, I see the following error: UnboundLocalError: local variable 'gene_tag' referenced before assignment Not a big deal, it doesn't look like I will need to scatter gather the dedup step now that the runtime is more reasonable, but I thought it was worth bringing up. Last thing, not completely related to the thread. I generally see entries in the final output that are not completely sorted. Up until this point, I have had to sort before and after dedup so that the next tool in the pipeline won't break. Is this something that you already know about? Thanks for all the work on this tool! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFJFjkwRccUtcPouGLIXga-UZbFvudWWks5r2jVzgaJpZM4IlqZd> .

TomSmithCGAT · 2017-05-05T09:03:47Z

Hi jhl667. Thanks for mentioning the error with the --chrom option. It turns out this option was not covered by testing so this bug was missed when it was introduced. The debug will be included in v0.4.4.

TomSmithCGAT · 2017-05-05T10:14:48Z

@IanSudbery I'm not sure how we can output in alignment start position order, unless we cache all reads per contig which could be very memory intensive. We might be able to reduce the memory usage somewhat with a sort of 'rolling cache' where we outputted reads from the cache when they were sufficiently upstream of the reads being read in. Even then, chimeric reads pairs would cause problems, unless we retrieve chimeric read pairs with calls to get_mate(), which would have performance issues. If we did implement sorted output order, I think we would definitely want to make this optional and warn users about the memory and performance issues.

IanSudbery · 2017-05-05T10:28:21Z

No, I think we would have to output to a temp file and then sort to the output file.

TomSmithCGAT · 2017-05-05T10:42:20Z

Ah OK. Yeah I see no reason not to add this as an option. Perhaps make clear to the user than this is how its going to work so they don't try and pipe the output, not that I think anyone is probably doing this?

crutching · 2017-05-05T16:05:59Z

@TomSmithCGAT @IanSudbery Ok, so just to be clear, the sensible thing to do in a workflow utilizing umi_tools dedup is to sort before and after? I haven't actually assessed speed differences between sorting or not before dedup, but I am assuming it would make significant difference? Honestly, for me, including an extra sort step doesn't make that much of a difference, though I still like to keep things as optimized as possible.

TomSmithCGAT · 2017-05-09T10:07:02Z

hi @jh667. An option to sort the output will be included in version 0.5. See #120.

TomSmithCGAT · 2017-05-09T10:07:05Z

I'm closing this issue now as the run time issues seem to have been dealt with. Any future issues with run time can be discussed in a separate issue

IanSudbery mentioned this issue May 31, 2016

Is 2pass pairing #35

Merged

KaiyangZ mentioned this issue Oct 13, 2016

paired-end reads different UMIs #49

Closed

gpratt mentioned this issue Mar 25, 2017

Speed up breath first search #99

Merged

TomSmithCGAT mentioned this issue May 9, 2017

Add sort option for dedup, group and count #120

Closed

TomSmithCGAT closed this as completed May 9, 2017

TomSmithCGAT mentioned this issue Oct 19, 2017

Parallel deduplication #203

Open

6 tasks

IanSudbery mentioned this issue Jul 5, 2019

umi_tools dedup: Incorrect output with using --paired #347

Open

dedup dir-adjacency too slow/does not complete #31

dedup dir-adjacency too slow/does not complete #31

Comments

royfrancis commented May 24, 2016 • edited Loading

TomSmithCGAT commented May 24, 2016

royfrancis commented May 24, 2016 • edited Loading

royfrancis commented May 24, 2016

TomSmithCGAT commented May 25, 2016

TomSmithCGAT commented May 25, 2016

AndreasHeger commented May 26, 2016

IanSudbery commented May 26, 2016

TomSmithCGAT commented May 26, 2016

IanSudbery commented May 26, 2016

TomSmithCGAT commented May 26, 2016

IanSudbery commented May 27, 2016

TomSmithCGAT commented May 27, 2016

IanSudbery commented May 27, 2016 via email

TomSmithCGAT commented May 27, 2016

IanSudbery commented May 28, 2016 • edited Loading

TomSmithCGAT commented May 28, 2016

TomSmithCGAT commented May 28, 2016

IanSudbery commented May 31, 2016

TomSmithCGAT commented May 31, 2016

royfrancis commented Jun 1, 2016 • edited Loading

IanSudbery commented Jun 1, 2016

royfrancis commented Jun 2, 2016 • edited Loading

IanSudbery commented Jun 2, 2016

IanSudbery commented Jun 3, 2016

royfrancis commented Jun 7, 2016 • edited Loading

IanSudbery commented Jun 7, 2016

IanSudbery commented Jun 14, 2016

royfrancis commented Jun 15, 2016 • edited Loading

TomSmithCGAT commented Mar 24, 2017

gpratt commented Mar 25, 2017

jblachly commented Apr 3, 2017

gpratt commented Apr 3, 2017 via email

jblachly commented Apr 3, 2017

TomSmithCGAT commented Apr 3, 2017

IanSudbery commented Apr 3, 2017

jblachly commented Apr 3, 2017

IanSudbery commented Apr 4, 2017

jblachly commented Apr 4, 2017

IanSudbery commented Apr 4, 2017

TomSmithCGAT commented Apr 4, 2017

IanSudbery commented Apr 4, 2017 via email

TomSmithCGAT commented Apr 5, 2017 • edited Loading

crutching commented May 4, 2017

IanSudbery commented May 5, 2017 via email

TomSmithCGAT commented May 5, 2017

TomSmithCGAT commented May 5, 2017

IanSudbery commented May 5, 2017

TomSmithCGAT commented May 5, 2017

crutching commented May 5, 2017

TomSmithCGAT commented May 9, 2017

TomSmithCGAT commented May 9, 2017

royfrancis commented May 24, 2016 •

edited

Loading

royfrancis commented May 24, 2016 •

edited

Loading

IanSudbery commented May 28, 2016 •

edited

Loading

royfrancis commented Jun 1, 2016 •

edited

Loading

royfrancis commented Jun 2, 2016 •

edited

Loading

royfrancis commented Jun 7, 2016 •

edited

Loading

royfrancis commented Jun 15, 2016 •

edited

Loading

TomSmithCGAT commented Apr 5, 2017 •

edited

Loading