-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix indexing bug with "placed" unmapped reads. #1352
Conversation
An example file:
Build indices, using an older samtools binary and this PR:
Then query them:
|
I've verified that picard's BuildBamIndex files are also compatible, and they act more like this PR does (with samtools able to query them correctly). Note this is related to #1152, specifically implementing what was previously suggested in #1152 (comment). In theory it should fix this problem, but does not appear to do so. However the two issues are orthogonal:
|
Looking back at old versions, it appears it's once again the linear index which broke the ability to cope with placed reads. From 45715e4 onwards we don't return the first record with |
|
Actually I was just looking at the test suite, and then ran away screaming. I have a total mental block in understanding that hideous perl script (hence preferring the simpler bourne shell equivalent when I can get away with it). It does need something though I'll agree. As for who reported it, I was simply going on the most recent comment (via email) from yesterday. At the time I wrote the commit message I didn't realise you had already identified this particular bug. Although I could recall there being prior work, but recollection (wrong) was that it was related to the linear index and not the main index. Frankly the entire indexing strategy is also something that causes a mental block. As soon as I stop working on it it goes out my head again. :/ |
e4a24ad
to
804ff07
Compare
For HTSlib (and testing of its API functions in particular), I'm quite a fan of the precision available from writing tests directly in C… |
f5d11f9
to
2b99b13
Compare
Hmm, not sure why linux can auto-index when writing a BAM but not Mac or Windows. I could change it to two commands (test_view and test_index), but this seems to be hinting at a bug somewhere so I'd rather not paper over it. I have msys locally so can test it tomorrow. |
Unmapped-but-placed (having REF/POS) reads are not included in the index. Hence if an placed unmapped is the first record in a bin, then it may not be returned. Note most aligners write out mapped followed by unmapped which does not trigger this problem. The SAM spec states that all unmapped placed reads should be considered as having an alignment length of 1. While it doesn't seem to explicitly state these must therefore be in the index, it does imply it. It appears that picard also indexes placed reads in this manner. Originally reported by John Marshall. Fixes samtools#1142
73d2207
to
1108b12
Compare
Things I learnt today - getopt differs between linux, macos and windows. The linux one can do "cmd arg -x opt" fine, but windows/macos must be in the form "cmd -x opt arg". Something to unlearn then. (I'm probably just too used to |
See also this rant and the linked archaeology, and the first paragraph of GNU Coding Standards §4.8. This is one of my personal hobby horses and I'm surprised I haven't tweeted about it more often! 😄 |
Proof of principle implementation that allows to pull reads from regions including their mates, even when they fall outside the regions or are unmapped. Note - so far this was tested on small data only - tests need to be added - everything is stored in memory, with many regions an external temporary storage may be needed - some unmapped reads may be missed, this depends on the success of samtools/htslib#1352 pull request
I suspect that after applying this PR, |
Unmapped-but-placed (having REF/POS) reads are not included in the
index. Hence if an placed unmapped is the first record in a bin, then
it may not be returned. Note most aligners write out mapped followed
by unmapped which does not trigger this problem.
The SAM spec states that all unmapped placed reads should be
considered as having an alignment length of 1. While it doesn't seem
to explicitly state these must therefore be in the index, it does
imply it. It appears that picard also indexes placed reads in this
manner.
Reported by Petr Danecek