Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dedup dir-adjacency too slow/does not complete #31

Closed
royfrancis opened this issue May 24, 2016 · 59 comments
Closed

dedup dir-adjacency too slow/does not complete #31

royfrancis opened this issue May 24, 2016 · 59 comments

Comments

@royfrancis
Copy link

royfrancis commented May 24, 2016

I started with a 12gb BAM (77 mil reads) using 3 cores and 24GB RAM. It did not complete in 2.5 hours or even close. Output BAM after 2.5 hours was 2.3MB. Then I extracted one chromosome to create a 314MB BAM. Dedup was run on this small BAM for 1 hour and it was still incomplete. A truncated BAM file of 12.4MB was exported. I ran qualimap bamqc on both files (1 chromosome bam before dedup and after dedup).

file before dedup after dedup
reads 5,104,918 197,280
dup rate 38.74% 2.27%

So, I suppose the deduplication must be working.

Extrapolating from this run time, my 12gb bam would take lot more than 39 hours. Would this be an expected run time? I was wondering if I was doing something wrong or some setting/switches are not optimal.

This is my script:

umi_tools dedup \
--method="directional-adjacency" \
--paired \
--output-stats="file-dedup.stats" \
-v 3 \
-E "file-dedup.error" \
-L "file-dedup.log" \
-I "file.bam" \
-S "file-dedup.bam"
@TomSmithCGAT
Copy link
Member

Hi Roy,

I've not used qualimap bamqc but from a quick read of the documentation it appears it detects duplicates using the start position. If you're using an aligner which can "clip" the reads it may miss-estimate the duplication rate. Would you mind running Picard MarkDuplicates as well to compare the output?

Regarding the time taken, this is indeed very slow. Would you be OK to share with me the one chromosome 314MB undeduped BAM and I can see whether I get the same performance and if so where the bottleneck is.

@royfrancis
Copy link
Author

royfrancis commented May 24, 2016

Hi Tom,
Here is the bam file for one chr. Paired-end .fq files were processed using umi_tools extract and mapped to Zebrafish Genome using STAR (default settings with gtf). I haven't used MarkDuplicates, but I can look into that.

@royfrancis
Copy link
Author

MarkDuplicates does not seem to complete (no output bam or metrics report) with truncated BAM file.

@TomSmithCGAT
Copy link
Member

Quick side note. You have lots of multimapped reads in your BAM.

samtools view file.bam|cut -f12|sort|uniq -c|sort gives:

26598 NH:i:10
28680 NH:i:9
32512 NH:i:8
34716 NH:i:7
47274 NH:i:6
65130 NH:i:5
132028 NH:i:4
284914 NH:i:3
735984 NH:i:2
3717082 NH:i:1

As it stands, for a read which maps to 10 locations, each mapping location will be considered independently. Thus, the final output will contain between 0-N alignments for each read, where N is the number of alignments (NH tag). I'm going to add an option to discount secondary and supplementary alignments which you might want to consider.

@TomSmithCGAT
Copy link
Member

On to the real issue of the time taken.

This looks to be solely due to the time taken to find the paired end mates. You can test this yourself by dropping the --paired option. I don't have the profile stats yet because it's still not finished 2 hours later(!).

You have a relatively low duplication rate so most positions contains just a single read. The outputting therefore requires many calls to the pysam.AlignmentFile.mate() function which is not an efficient way to retrieve the paired end read. Ideally, when dealing with paired reads, the BAM is sorted by read name. However, we need a coordinate sorted BAM in order to perform the duplicate detection. It looks like we might need to implement so sort of caching to speed up the retrieval of the paired ends.

@AndreasHeger Do you have any suggestions for how to best speed up the paired end retrieval? Is caching the right way to go?

@AndreasHeger
Copy link
Member

                                                                                  Hi Tom, mate is indeed best avoided in large datasets. Caching is better, but might be complicated to implement if you want to output a sorted file. It might be easier doing two passes on the bam. ‎To conserve memory,  you could do the double pass per chromosome or smaller region. Best, andreas                                                                                                                                                                                                                                                                                                                                         Sent from my BlackBerry 10 smartphone on the EE network.                                                                                                                                                                                                                From: Tom SmithSent: Wednesday, 25 May 2016 16:23To: CGATOxford/UMI-toolsReply To: CGATOxford/UMI-toolsCc: Andreas Heger; MentionSubject: Re: [CGATOxford/UMI-tools] dedup dir-adjacency too slow/does not complete (#31)On to the real issue of the time taken.

This looks to be solely due to the time taken to find the paired end mates. You can test this yourself by dropping the --paired option. I don't have the profile stats yet because it's still not finished 2 hours later(!).

You have a relatively low duplication rate so most positions contains just a single read. The outputting therefore requires many calls to the pysam.AlignmentFile.mate() function which is not an efficient way to retrieve the paired end read. Ideally, when dealing with paired reads, the BAM is sorted by read name. However, we need a coordinate sorted BAM in order to perform the duplicate detection. It looks like we might need to implement so sort of caching to speed up the retrieval of the paired ends.

@AndreasHeger Do you have any suggestions for how to best speed up the paired end retrieval? Is caching the right way to go?

—You are receiving this because you were mentioned.Reply to this email directly or view it on GitHub

@IanSudbery
Copy link
Member

Thanks for that @AndreasHeger. The output is already non-sorted, so perhaps we should consider caching.

@TomSmithCGAT
Copy link
Member

@IanSudbery

A very simplistic "two-pass" approach would be to run through the bamfile one chromosome at a time and store the reads as a ordered dictionary mapping reads ids to their paired reads, then re-pass through the keys to perform the dedup, emptying the dictionary after each chromosome. I'll test this out as an immediate solution.

Alternatively, we could cache as we go but this will be a little more complicated within our current workflow? I'll have a think about how to best perform the caching.

Either way, this would have to be done on a per-chromosome basis to avoid potentially huge memory requirements for chimeric read pairs. Perhaps we should fall back on the .mate() function for these read pairs?

What do you think?

@IanSudbery
Copy link
Member

@TomSmithCGAT
Two pass is definitely easier to implement, but would need to do very careful memory profiling. When I was caching whole chromosomes before memory usage was getting up in the 10s of GB for large files. But was caching at the read bundle level (i.e. pre-cluster collapsing), so might be better with caching after this.

I think falling back on the .mate() for chimeric reads is the way to go.

A dynamic cache, that saved read2's in a dict, and then throw them away, or outputted them as their mates were processed would probably be more memory efficient, but harder to do. Most reads would not spend long in the cache because their mate would be nearby. Might also be slower though because for every read processed, you'd have to check if its mate was in a the cache, and add it to a cache if not.

@TomSmithCGAT
Copy link
Member

@IanSudbery
Could you take a look at a 2pass option I've implemented in 35c776c.

I had to change things around a bit to allow the 2 pass option to store the read pairs on a per-contig basis. Note the new class "Writer" which takes the reads, passes them to the "ClusterAndReducer" functor and has method to write out the resultant reads and stats.

I've tested it on the above single chromosome BAM (5M reads, 2M unique read pairs). The original code took 19 hours! With "--2pass" it takes 22 mins. The maximum Virtual Memory Size according to "ps" was 2.1 Gb.

@IanSudbery
Copy link
Member

@TomSmithCGAT: A couple of points

  1. What happens if there are multimappers?
  2. Isn't it more memory efficient to dedup, save the read1s you want to output, and then go through looking for the read2s?
  3. I think the execution path is getting a bit complicated.

I'm working on a suggestion.

@TomSmithCGAT
Copy link
Member

@IanSudbery

  1. I think this is a general area that needs more consideration. I'm not sure there is any way to sensibly deal with secondary/supplementary alignments in the deduplication unless we keep another dictionary mapping primary to secondary/supp. alignments and output all alignments if the primary is retained. We wouldn't want to output secondary/supp. alignments in cases where the primary was removed would we? As it stands, the secondary/supp. alignments are skipped.

  2. You're right, that's a much better approach! E.g run the 2 pass the other way around - thanks!

  3. I agree. I guess the execution path will change for your implementation based on point 2?

@IanSudbery
Copy link
Member

IanSudbery commented May 27, 2016 via email

@TomSmithCGAT
Copy link
Member

I agree this should be up to the user. On reflection perhaps the secondary/supp. skipping should be optional rather than enforced. I think we should mention multi-mapping in the documentation.

I think we should take a similar approach to chimeric reads - leave them in (pull out the paired end with .mate() as we discussed) and have an option to skip them entirely at the users choosing. Out of interest, what TLEN do chimeric reads get in the BAM?

@IanSudbery
Copy link
Member

IanSudbery commented May 28, 2016

I'd go further than that. I'd say that if they want to filter out secondary/supplimentary alignments, let them do it themselves; in the way they see fit.

I'm not really sure how mate pairing works in the presence of supplementary alignments. Is the mate of the primary read1, the primary read2? or just the next on in the chain? I suspect this vaires from mapper to mapper. Could be the same is true of TLEN.

Checkout my proposal for the 2 pass pairing in bbbe020. Its a bit more complicated than I'd anticipated because it has to deal with the chimeric alignment problem, but I think its still represents less radical surgery to the execution path. I profiled using:

time -v umi_tools dedup -I in.bam -S /dev/null --paired -v5 [--2pass]

(I didn't add a seperate option for 2pass, but made it how paired retrieval works)

Its about 25% slower than your solution (5mins 30sec rather than 4mins 20 for the test BAM), but uses a quarter as much RAM. (resident set 500Mb vs 1.8GB).

@TomSmithCGAT
Copy link
Member

@IanSudbery I like the simplicity and cleanness of re-defining the outfile object and it's write and close methods for paired end reads. Nice!

I beleive lines 959-964 need to be removed so that read pairs are only written once when using the --ignore-umi option.

I'm surprised it's any slower to be honest but the gain in memory efficiency more than outweighs this small difference.

@TomSmithCGAT
Copy link
Member

Regarding secondary/supp. alignments in paired end, STAR appears to output a proper pair of alignments for so each alignment1 has it's own alginment2, even where the alignment1 is the same in both cases. I expect this will indeed vary for other mappers. Let's do as you say and leave it to the user to generate a suitable BAM but include some more information in the documentation regarding multimapping.

@IanSudbery
Copy link
Member

@TomSmithCGAT Thanks for catching that - i've sorted it and submitted a PR.

WRT secondary/supp - I was more worried about supplementary than secondary - i.e. where different parts of read1 are mapped two several locations, as are several parts of read2. I think STAR outputs these to a separate file, but I think BWA puts them all in the same file.

@TomSmithCGAT
Copy link
Member

@royfrancis Could you update UMI-Tools to the latest version (v.0.2.0). This should speed up the de-duplication stage considerably.

@royfrancis
Copy link
Author

royfrancis commented Jun 1, 2016

Hi Tom,
I just tried on the same 1 chromosome bam and it completed in 60 mins (2 cores+16gb RAM). For comparison, I filtered uniquely mapped reads into a new bam file and ran dedup again. This time it finished in 37 mins. The good news is that dedup reaches completion. I estimate my original bam file (unique filtered) to take about 16 hours. Btw, does this new version v.0.2.0 incorporate the changes in extract #26?

@IanSudbery
Copy link
Member

Hi @royfrancis,

I don't suppose its much help, but you are probably getting hammered by the stats computation: running your chr1 BAM on my machine takes 2 mins 30 without stats and 25 mins with stats. This is probably due us not finding an efficient way to randomly sample reads from a BAM file.

@royfrancis
Copy link
Author

royfrancis commented Jun 2, 2016

I ran my full bam (unique filtered, 7gb) yesterday for 18 hours and it still failed. Is there some way to output the progress, so as to estimate how much of the bam file was completed?

@IanSudbery
Copy link
Member

Hi Roy,

If you turn the verbosity up to 5 with -v5, a message should be printed to the log every 10,000 alignments processed.

@IanSudbery
Copy link
Member

Hi @royfrancis,
While we work on the speed and memory efficiency of the stats module, perhaps one solution would be run dedup without stats on your complete data set, and just run the stats module on a one chromosome subset.

@royfrancis
Copy link
Author

royfrancis commented Jun 7, 2016

It still did not work for me after a 10hr run time. I am not sure what stats is, but I excluded --output-stats=, like so

umi_tools dedup \
--method="directional-adjacency" \
--paired \
-v 5 \
-E "fn-error.txt" \
-L "fn-dedup-log.txt" \
-I "file.bam" \
-S "file-dedup.bam"

@IanSudbery
Copy link
Member

Yes, exactly.

@IanSudbery
Copy link
Member

Hi Roy, was there anything in the error or log file?

@royfrancis
Copy link
Author

royfrancis commented Jun 15, 2016

Here is the log file after running the full bam for 5 hours (3 cores, 12gb ram) without stats. The run was incomplete and timed out. A 1.5gb dedup bam file was produced. The error file was empty.
log.txt

@TomSmithCGAT
Copy link
Member

I'll have a play with the breath_first_search and see if we can reduce the run time by removing some of the set operations.

I downloaded the sample you linked to but there doesn't appear to be any UMIs in this sample?

AAACCCCACC:HWI-D00611:179:C7MR1ANXX:5:1213:10274:14684	163	chr1	26547764	255	43M	=	26547775	49	GGCCTGGTGTGGTGGCTCATGCCGGTAATCCCAGCACTTTGGG	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:80	nM:i:0	NM:i:0	MD:Z:43	jM:B:c,-1	jI:B:i,-1	RG:Z:foo

@gpratt
Copy link
Contributor

gpratt commented Mar 25, 2017

Woops, forgot adjust the barcode from out in-house location to yours. For some reason years ago I decided that the UMI should go in front of the read name, AAACCCCACC is the UMI in this case.

Either way, I've uploaded a fixed version of the bam file to the same location, should work out of the box now.

I also did what I said would happen, got bored and wrote the fix :). It worked well on my test case that previously finished running, and now I'm testing it on the files that were issues in the past. I'll let you know if those worked / finished soon.

@jblachly
Copy link

jblachly commented Apr 3, 2017

Hi @TomSmithCGAT and @IanSudbery

Congratulations on your paper! Have been following since the blog post. I've just loaded up UMI-tools for the first time this weekend as we finally have our data back, and have also been struck by the incredibly slow speed. Found this issue and thought I would contribute some data. Perhaps it is a misconfiguration on my end? Installed with sudo pip and have also run it once using sudo umi_tools dedup as instructed. There is no CPU or memory pressure on this system. Is it possible cython is not being engaged properly?

Anyway, using ca. 1.5M reads of paired-end targeted resequencing data (where an individual locus could have high [>1000X] depth) I am seeing 90 minute runtimes. Unfortunately worsening is exponential -- latest one took 9 hrs for 2.5M reads. I wrote a script to downsample a ~1.5M read BAM file to 10%, 20%, 40%, and 80% and check times both with and without stats. FWIW, the proportional contribution of stats generation to overall runtime seems to be less the more reads one has, so I don't think in my case the suggestion earlier in the thread that this could be stats-related is entirely accurate.

Anyway, here are the data. Values are in seconds. Yes, the last row is downsampled to 99% (not 100%) as script was easier that way =) Please let me know if you have suggestions to speed this up as it is kind of a blocker ...

reads no stats with stats
149749 69 98
298075 237 307
593073 837 1031
1174608 3232 3664
1446610 4766 5312

@gpratt
Copy link
Contributor

gpratt commented Apr 3, 2017 via email

@jblachly
Copy link

jblachly commented Apr 3, 2017

@gpratt I have commit latest commit from master branch, version 0.4.3, installed through pip. They certainly are on the ball with pip.

@TomSmithCGAT
Copy link
Member

Hi @jblachly. Thanks for the detailed break down of the run time. From what you say, it sounds like the issue is the network building stage which is O(n^2) since perform an all vs. all comparison between the UMIs. Would you mind sharing your BAM so we can profile the script and see what the bottleneck is?

@IanSudbery
Copy link
Member

Hi @TomSmithCGAT , expect that I don't think the stats/no stats difference tallies with that: When building stats, we do the all vs all comparisons with the randomly sampled reads, as well as the actual reads at the locus. Thus should stats not at the very least double the amount of time spent building edit-distance matricies? This suggests that once again it is the connected component finding, rather than the edit distance calculation that is to blame as we don't actually build the components for the stats?

Interestingly, the increase we are seeing in the those numbers is sub-quadratic (every doubling in reads is leading to a three fold increase in time, where as quadratic would predict a four-fold increase). To me, the problem here is not the increase in time with increasing depth, but the large amount of time taken to process even the smallest sample. However the jump 1.5hrs -> 9hrs for 1.5M->2.5M reads doesn't seem to fit that pattern.

@jblachly Most of our users are doing RNA seq and often see their run times dominated by the single highest depth locus. They are seeing runtimes of days caused by deduplication problems with single locations with 10s or even 100s of thousands of reads - but they only have a single locus like this. By having 1000s of reads per locus, but many loci like this, you may have hit on a particularly pathological case. None-the-less, 90 seconds for 100,000 reads feels wrong.

You signature states you are an MD, and you are doing resequencing, so I realise that this might be patient data that you are unable to share. However, one solution to this that I can see is that you strip out the sequence from the BAM files - BAM files are still valid if the sequence is replaced with '*'. Two ways we can do this: the tool bam2bam from the cgat package has a --strip-sequence option, or you could use samtools view to convert to SAM then use awk to remove the sequence column. We could then rebuild the BAM at our end. If you send SAM, make sure you zip it first!

If you are worried about us knowing the identity of the genes you are resequencing, then I suggest adding a secret number to each read position (the same number to every read, or even a different number for every gene - as long as it was the same number within a gene).

@jblachly
Copy link

jblachly commented Apr 3, 2017

Dear @IanSudbery , thanks for your willingness to help with this. I have sent you a note under separate cover with a link to the BAM files.

@IanSudbery
Copy link
Member

Okay, I have located the problem here...

You have many reads where the read 1 is on a different chromosome to its mate. This causes some issues as the end of each chromosome we output all the mates from a particular chromosome. If the mates aren't found, they accumulate in a buffer and we go searching for them at the end. This exhaustive search at the end is slow, and only there as a last resort.

It is particularly slow in your case because to do the search, all reads that overlap the supposed location are retrieved and searched one by one for the mate. As you have may reads at one location, this is slow. I'm working on something better.

@jblachly
Copy link

jblachly commented Apr 4, 2017

@IanSudbery -- thank you so much for tracking this down. We are working with cancer samples so the mate mapping to different contig is going to be a common theme for us (and others in this field). Let me know if I can help in any way with the improvement (I can test other branches etc.).

@IanSudbery
Copy link
Member

Dear @jblachly,

I've push a branch "getting_mates" (see commit c96dd45). It definitely makes things faster for you. On my machine your samples finish in 2 minutes and 3 minutes respectively without stats. The only question is what they do to other people.

@TomSmithCGAT Instead of trying to fetch all the reads at the mate location and then sorting through them looking for the read in question, this commit just reopens the input BAM file, and scans through the whole thing looking for mates. For the files at hand it takes seconds (1.5M and 2.5M reads respectively). I don't know what it would do to someone with 100M reads where the mates were not all concentrated in one location.

@TomSmithCGAT
Copy link
Member

@IanSudbery. At worst this will only ever mean passing through the whole BAM file once more which is tolerable compared to the pathologically poor performance in this case when using the index to try and specifically retrieve the missing pairs. This will also obviously only ever be triggered when the user has chimeric reads since otherwise the read1s set will be emptied with each call to write_mates. I've made a couple of comments on the commit. I'm very happy to merge this into master when it's ready.

It looks like this method of outputting the orphan read pairs was implemented in the first instance of TwoPassPairWriter so it could well explain some of the other slow run times?

@IanSudbery
Copy link
Member

IanSudbery commented Apr 4, 2017 via email

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Apr 5, 2017

P.s Amazing improvement in run time by the way @IanSudbery. That's >1 hour to <5 mins, right?! Thanks @jblachly for bringing this to our attention and sharing your data with us. Hopefully, we've covered most of the unusual BAMs now and the run time issues have been largely resolved 🤞

@crutching
Copy link

Adding my own experiences. I have been assessing this collection of tools for use in one of our processing pipelines. Using the 0.4.3 release, I am running a paired-end analysis off of a 36mil line BAM produced from amplicon alignments. I am doing one run only on chromosome 21, and the other is for the whole file. The chr21 run looked to be done in less than a minute, but then has been running for over 2 hours on the unmatched mates stage (~3500 unmatched). The other run claims 110k unmatched mates, and took maybe 30 min to make the first pass through the file.

Pulling the gather_mates branch seems to improve the situation dramatically. I am now seeing no more than a minute spent on the finding mates step, at least when run on the full data set. When running on the single chromosome, I see the following error:

UnboundLocalError: local variable 'gene_tag' referenced before assignment

Not a big deal, it doesn't look like I will need to scatter gather the dedup step now that the runtime is more reasonable, but I thought it was worth bringing up.

Last thing, not completely related to the thread. I generally see entries in the final output that are not completely sorted. Up until this point, I have had to sort before and after dedup so that the next tool in the pipeline won't break. Is this something that you already know about?

Thanks for all the work on this tool!

@IanSudbery
Copy link
Member

IanSudbery commented May 5, 2017 via email

@TomSmithCGAT
Copy link
Member

Hi jhl667. Thanks for mentioning the error with the --chrom option. It turns out this option was not covered by testing so this bug was missed when it was introduced. The debug will be included in v0.4.4.

@TomSmithCGAT
Copy link
Member

@IanSudbery I'm not sure how we can output in alignment start position order, unless we cache all reads per contig which could be very memory intensive. We might be able to reduce the memory usage somewhat with a sort of 'rolling cache' where we outputted reads from the cache when they were sufficiently upstream of the reads being read in. Even then, chimeric reads pairs would cause problems, unless we retrieve chimeric read pairs with calls to get_mate(), which would have performance issues. If we did implement sorted output order, I think we would definitely want to make this optional and warn users about the memory and performance issues.

@IanSudbery
Copy link
Member

No, I think we would have to output to a temp file and then sort to the output file.

@TomSmithCGAT
Copy link
Member

Ah OK. Yeah I see no reason not to add this as an option. Perhaps make clear to the user than this is how its going to work so they don't try and pipe the output, not that I think anyone is probably doing this?

@crutching
Copy link

@TomSmithCGAT @IanSudbery Ok, so just to be clear, the sensible thing to do in a workflow utilizing umi_tools dedup is to sort before and after? I haven't actually assessed speed differences between sorting or not before dedup, but I am assuming it would make significant difference? Honestly, for me, including an extra sort step doesn't make that much of a difference, though I still like to keep things as optimized as possible.

@TomSmithCGAT
Copy link
Member

hi @jh667. An option to sort the output will be included in version 0.5. See #120.

@TomSmithCGAT
Copy link
Member

I'm closing this issue now as the run time issues seem to have been dealt with. Any future issues with run time can be discussed in a separate issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants