Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error messages for CRAM reference mismatches. #1427

Merged
merged 2 commits into from
Apr 27, 2022

Conversation

jkbonfield
Copy link
Contributor

If the user specifies the wrong reference, the CRAM slice header
MD5sum checks fail. We now report the SQ line M5 string too so it is
possible to validate against the whole chr in the ref.fa file.

The error message has also been improved to report the reference name
instead of #num.

Finally, we now hint at the likely cause, which counters the
misleading samtools supplied error of "truncated or corrupt" file.

See samtools/samtools#1640.

Note we could go one step further and pull the entire chromosome out, md5sum that, and compare to the SQ line. This would then be direct evidence for the wrong reference being specified, but it seems overkill. I'm sure 99% of times it's justified to simply state it's the wrong reference anyway, or at least the first thing to check.

rname, s->ref_start, s->ref_end);
hts_log_error("CRAM : %s", md5_print(s->hdr->md5, M));
hts_log_error("Ref : %s", md5_print(digest, M));
kstring_t ks = {0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use the initialiser since it exists (also I think the kstring is being leaked):

Suggested change
kstring_t ks = {0};
kstring_t ks = KS_INITIALIZE;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot about freeing it! Duh. Thanks.

I'll change it to use KS_INITIALIZE. I did consider it, but for some baffling reason it's the "house style" (just 10 out of 150 initialisations are done this way in htslib) so just went with the flow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the house style is mostly { 0, 0, NULL }, mainly because KS_INITIALIZE was added so late (PR #879 in 2019).

I expected that {0} (5 current instances) would lead to a “missing initialiser for field” warning with ‑Wextra, but it turns out there's a special case for the universal zero initializer. Learning every day… 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the {0} case is a wonderfully useful for initialising large local arrays without having to resort to explicit memset calls. I use that quite a bit. :-)

You do need to still be aware of {{0}} though for 2-dimensional arrays to avoid compiler warnings with older compilers (eg int a[10][20] = {0}; gives a warning with clang 5.0), even though I think initialising 2D arrays with a 1D list is legal.

Copy link
Contributor Author

@jkbonfield jkbonfield Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course technically {0} is not the same as {0, 0, NULL} as there is no requirement in C for NULL to be all-bits-zero, only to compare to 0 as equality. Blergh!

Practically speaking if we ever found such a system all hell would break loose, as we and others routinely make this assumption (which thankfully holds on all systems built in the last few decades).

Edit: oops after reading more I'm wrong. The missing values in an initialiser aren't set to 0-bits, but to the same value they get set when used in a static allocation. This is explicitly stated to be 0 for numerics and NULL for pointers. Obviously it's the same thing for current implementations anyway, but the pedantry wasn't true. As you say - live and learn every day :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those members without initializers are initialised in a type-appropriate way, so in fact leaving off NULL is fine even on a weird platform. (C99 §6.7.8/21: “If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate […], the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration”, which for pointers means initialised to a platform-appropriate null pointer.)

I couldn't find anything special in the standard about {0} beyond that, so I think “universal zero initializer” is a useful term that common practice and the implementations have come up with for for purposes of this warning.

It's thinking calloc() or memset() will adequately zero pointers and floating point values that is theoretically non-portable. But POSIX reassures us, at least about the pointers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I'd already corrected myself above.

I didn't know POSIX had finally modernised things though and were mandating that NULL can be set using memset/calloc. That's great and about time!

Not that I'm losing sleep over it. :-) So much code would fall over with non-bit-0 NULL pointers that htslib would be the least of our worries!

If the user specifies the wrong reference, the CRAM slice header
MD5sum checks fail.  We now report the SQ line M5 string too so it is
possible to validate against the whole chr in the ref.fa file.

The error message has also been improved to report the reference name
instead of #num.

Finally, we now hint at the likely cause, which counters the
misleading samtools supplied error of "truncated or corrupt" file.

See samtools/samtools#1640.
It's trying to spot error messages starting with lowercase letters,
but in doing so forbids things like "@sq" as it's not capital.
@jkbonfield jkbonfield force-pushed the cram_ref_mismatch_err branch from deba481 to d7cc10d Compare April 27, 2022 10:48
@daviesrob daviesrob merged commit d7cc10d into samtools:develop Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants