Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of CRAM "alignment span" for unmapped data. #1388

Merged
merged 2 commits into from
Feb 4, 2022

Conversation

jkbonfield
Copy link
Contributor

When writing, the specification states unmapped data must have alignment start and alignment span both zero. Previously span was 1 (from end-start+1).

When reading, the code that filters data by the requested range now has an exception for ref_seq -1 (unmapped) so span is not checked. This fixes the check for end-start-1 failing. Note this previous triggered issues on files (correctly) output from htsjdk.

Fixes #1387

@jkbonfield
Copy link
Contributor Author

Hmm, maybe we shouldn't fix the output to match the spec. That sounds radical and in theory the fix should happen, but doing so also means more files which trigger the bug in #1387 with older htslibs / samtools. The consequence of not writing span=0 is, as far as I can see, inconsequential (given no one spotted it until now).

No right answer here. Thoughts?

@cmnbroad
Copy link

Assuming that samtools versions both old and new return correct results on files with span 1, I guess I'd give it a lukewarm 👍. Writing non-compliant files seems a step further than just tolerating non-compliance on read as was done for the case of the obsolete data series, but it might be prudent in this one case given that the failure mode is silently returning incorrect results.

@jkbonfield
Copy link
Contributor Author

After discussion today at our weekly Samtools meeting and also the GA4GH File Formats meeting we took a pragmatic middle ground.

  • Htslib will now correctly do range queries on unmapped data regardless of span contents (also further improved on previous fix).
  • Writing CRAM 3.0 will continue to output span=1, to ensure the best compatibility with legacy software. Failure to do so would mean a much wider proliferation of CRAM files that the legacy samtools versions couldn't query properly.
  • Writing CRAM 3.1 onwards will output span=0, as these versions are not in regular use.
  • The CRAM specification will be amended to state the start and span should not be used when decoding unmapped data, that the recommended values are (still) start=0 and span=0 for such data, but that other values are permitted.

This spec relaxation is a cheat so that this PR and all previous htslib's are now technically spec compliant, so we don't feel compelled to be purists and fix the software and in the process harm users of earlier versions.

We did ponder simply stating that span=1 should be the recommended (not required) value for unmapped data, but it feels a bit unfair on implementations that did the correct thing. I'd have been more inclined if it was an issue with regular index querying, but explicitly fetching unmapped only is a less common operation and the workaround of updating the library is not an insurmountable hill to climb.

@@ -1923,8 +1930,13 @@ int cram_encode_container(cram_fd *fd, cram_container *c) {
}

c->ref_seq_id = c->slices[0]->hdr->ref_seq_id;
c->ref_seq_start = c->slices[0]->hdr->ref_seq_start;
c->ref_seq_span = c->slices[0]->hdr->ref_seq_span;
if (c->ref_seq_id == -1 && CRAM_ge31(fd->version)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to have a short comment here, explaining why there's a check on the file version (and also in cram_update_current_slice).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do, but it'd probably be best to comment it to simply see the relevant commit message as it'd be unwieldy to document the full reasoning. I'll try and think of something appropriate.

When reading, the code that filters data by the requested range now
has an exception for ref_seq -1 (unmapped) so that alignment span is
not checked.  This fixes the check for end-start-1 failing.  Data
output (correctly) by htsjdk triggered a bug where range filtering was
exiting early, making e.g. 'samtools view in.file "*"' return nothing.

We didn't spot this before because htslib (wrongly) writes span=1.
See next commit.

Fixes samtools#1387
The specification states unmapped data must have alignment start and
alignment span both zero.  Previously span was 1 (from end-start+1).
We now adhere to the specification for 3.1 onwards, but it is left
as-is (incorrect) for 3.0.

See previous commit for implications on reading.  Outputting spec
compliant CRAM files would make older builds of htslib/samtools fail
to return unmapped data from a CRAM index query.  This is not helpful,
so in this case we feel the specification is best amended to permit
other values in the alignment span field (albeit to keep the existing
values as recommendations).
@daviesrob daviesrob merged commit 0f30746 into samtools:develop Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect count of unmapped reads in a CRAM file using samtools view
3 participants