Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark duplicates in files with many contigs #361

Merged
merged 2 commits into from
Aug 29, 2018
Merged

Conversation

dpryan79
Copy link
Contributor

This is a resolution to #326 and also something we observed internally (maxplanck-ie/snakepipes#233 ) where markdup is limited to files with <=16383 contigs. This PR modifies the size of the structures such that all BAM files should be supported, regardless of the number of contigs/scaffolds.

In my tests this produces identical results on files already supported by sambamba markdup, while requiring a small (10-20% peak) increase in RAM. There's no appreciable run-time difference that I've seen.

@dpryan79
Copy link
Contributor Author

BTW, you can get around the compiler issue by using LLVM 6.0.0 and LDC 1.10.0.

@pjotrp
Copy link
Member

pjotrp commented Aug 29, 2018

Thanks for the fix @dpryan79! I'll take a careful look. And yes, we are updating to LLVM and LDC latest.

@pjotrp pjotrp merged commit d4d46c4 into biod:master Aug 29, 2018
@dpryan79 dpryan79 deleted the 64bit_enum branch August 29, 2018 11:17
@dpryan79
Copy link
Contributor Author

Thanks @pjotrp! Can you ping me when you tag a new release? I'd like to update our pipeline with the new version then :)

@pjotrp
Copy link
Member

pjotrp commented Aug 30, 2018

Will do on the mailing list

@pjotrp
Copy link
Member

pjotrp commented Sep 9, 2018

I just realised there are one or two problems with this PR. The first one is the coordinate comparison at line 765. It may be the correct thing to do, but it will change results. We have to check what samtools is doing here.

The second problem is that to read the library_id and ref_id we fetch them from the samtools-style index file which are defined as the shorter versions. I am not sure what samtools/htslib is doing there right now. Be good to check the sizes they are promoting. We may diverge with the index format, but that is obviously something we have to do carefully.

wdyt?

@dpryan79
Copy link
Contributor Author

dpryan79 commented Sep 9, 2018

The addition on line 765 doesn't change anything, it was part of *cast(ulong*)(&e1) == *cast(ulong*)(&e2) previously, but since the relevant part of e1 and e2 no longer fit in a ulong the comparison needs to be made to ensure that duplicates aren't over called. In my tests with real files this doesn't change any results.

ref_id is 32 bit in htslib, though since it's signed only half of that is actually usable. We rarely need to use read groups, so I'm not overly familiar with how that's handled in htslib.

@pjotrp
Copy link
Member

pjotrp commented Sep 9, 2018

Ok I'll check the index format. Added a reminder #365.

I just managed to create a reproducable build with ldc 1.10 and LLVM 6. Will push a release soon.

pjotrp added a commit to pjotrp/sambamba that referenced this pull request Sep 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants