-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of CRAM "alignment span" for unmapped data. #1388
Conversation
Hmm, maybe we shouldn't fix the output to match the spec. That sounds radical and in theory the fix should happen, but doing so also means more files which trigger the bug in #1387 with older htslibs / samtools. The consequence of not writing span=0 is, as far as I can see, inconsequential (given no one spotted it until now). No right answer here. Thoughts? |
Assuming that samtools versions both old and new return correct results on files with span 1, I guess I'd give it a lukewarm 👍. Writing non-compliant files seems a step further than just tolerating non-compliance on read as was done for the case of the obsolete data series, but it might be prudent in this one case given that the failure mode is silently returning incorrect results. |
d9e0478
to
eb6d756
Compare
After discussion today at our weekly Samtools meeting and also the GA4GH File Formats meeting we took a pragmatic middle ground.
This spec relaxation is a cheat so that this PR and all previous htslib's are now technically spec compliant, so we don't feel compelled to be purists and fix the software and in the process harm users of earlier versions. We did ponder simply stating that span=1 should be the recommended (not required) value for unmapped data, but it feels a bit unfair on implementations that did the correct thing. I'd have been more inclined if it was an issue with regular index querying, but explicitly fetching unmapped only is a less common operation and the workaround of updating the library is not an insurmountable hill to climb. |
@@ -1923,8 +1930,13 @@ int cram_encode_container(cram_fd *fd, cram_container *c) { | |||
} | |||
|
|||
c->ref_seq_id = c->slices[0]->hdr->ref_seq_id; | |||
c->ref_seq_start = c->slices[0]->hdr->ref_seq_start; | |||
c->ref_seq_span = c->slices[0]->hdr->ref_seq_span; | |||
if (c->ref_seq_id == -1 && CRAM_ge31(fd->version)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to have a short comment here, explaining why there's a check on the file version (and also in cram_update_current_slice
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do, but it'd probably be best to comment it to simply see the relevant commit message as it'd be unwieldy to document the full reasoning. I'll try and think of something appropriate.
When reading, the code that filters data by the requested range now has an exception for ref_seq -1 (unmapped) so that alignment span is not checked. This fixes the check for end-start-1 failing. Data output (correctly) by htsjdk triggered a bug where range filtering was exiting early, making e.g. 'samtools view in.file "*"' return nothing. We didn't spot this before because htslib (wrongly) writes span=1. See next commit. Fixes samtools#1387
The specification states unmapped data must have alignment start and alignment span both zero. Previously span was 1 (from end-start+1). We now adhere to the specification for 3.1 onwards, but it is left as-is (incorrect) for 3.0. See previous commit for implications on reading. Outputting spec compliant CRAM files would make older builds of htslib/samtools fail to return unmapped data from a CRAM index query. This is not helpful, so in this case we feel the specification is best amended to permit other values in the alignment span field (albeit to keep the existing values as recommendations).
eb6d756
to
0f30746
Compare
When writing, the specification states unmapped data must have alignment start and alignment span both zero. Previously span was 1 (from end-start+1).
When reading, the code that filters data by the requested range now has an exception for ref_seq -1 (unmapped) so span is not checked. This fixes the check for end-start-1 failing. Note this previous triggered issues on files (correctly) output from htsjdk.
Fixes #1387