-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark duplicates in files with many contigs #361
Conversation
BTW, you can get around the compiler issue by using |
Thanks for the fix @dpryan79! I'll take a careful look. And yes, we are updating to LLVM and LDC latest. |
Thanks @pjotrp! Can you ping me when you tag a new release? I'd like to update our pipeline with the new version then :) |
Will do on the mailing list |
I just realised there are one or two problems with this PR. The first one is the coordinate comparison at line 765. It may be the correct thing to do, but it will change results. We have to check what samtools is doing here. The second problem is that to read the library_id and ref_id we fetch them from the samtools-style index file which are defined as the shorter versions. I am not sure what samtools/htslib is doing there right now. Be good to check the sizes they are promoting. We may diverge with the index format, but that is obviously something we have to do carefully. wdyt? |
The addition on line 765 doesn't change anything, it was part of
|
Ok I'll check the index format. Added a reminder #365. I just managed to create a reproducable build with ldc 1.10 and LLVM 6. Will push a release soon. |
This is a resolution to #326 and also something we observed internally (maxplanck-ie/snakepipes#233 ) where markdup is limited to files with <=16383 contigs. This PR modifies the size of the structures such that all BAM files should be supported, regardless of the number of contigs/scaffolds.
In my tests this produces identical results on files already supported by
sambamba markdup
, while requiring a small (10-20% peak) increase in RAM. There's no appreciable run-time difference that I've seen.