Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerate lower-case nans in QUAL #1364

Merged
merged 6 commits into from
Jun 24, 2019

Conversation

karenfeng
Copy link
Contributor

@karenfeng karenfeng commented May 8, 2019

Description

We sometimes see VCFs with lower-case nan's in QUAL. This doesn't match the Java-style NaN, and therefore causes VCF parsing to break.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@codecov-io
Copy link

codecov-io commented May 8, 2019

Codecov Report

Merging #1364 into master will increase coverage by 0.036%.
The diff coverage is 89.474%.

@@               Coverage Diff               @@
##              master     #1364       +/-   ##
===============================================
+ Coverage     68.007%   68.043%   +0.036%     
- Complexity      8352      8368       +16     
===============================================
  Files            570       571        +1     
  Lines          33848     33889       +41     
  Branches        5661      5665        +4     
===============================================
+ Hits           23019     23059       +40     
+ Misses          8645      8640        -5     
- Partials        2184      2190        +6
Impacted Files Coverage Δ Complexity Δ
...n/java/htsjdk/variant/variantcontext/Genotype.java 59.036% <0%> (ø) 80 <0> (ø) ⬇️
.../htsjdk/variant/variantcontext/VariantContext.java 77.714% <100%> (ø) 246 <0> (ø) ⬇️
...dk/variant/variantcontext/GenotypeLikelihoods.java 84.971% <100%> (ø) 74 <0> (ø) ⬇️
...main/java/htsjdk/variant/vcf/AbstractVCFCodec.java 76.218% <100%> (+0.573%) 100 <0> (+2) ⬆️
src/main/java/htsjdk/variant/vcf/VCFUtils.java 49.275% <100%> (+4.831%) 22 <5> (+5) ⬆️
...java/htsjdk/variant/variantcontext/CommonInfo.java 47.863% <50%> (ø) 47 <0> (ø) ⬇️
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java 78.947% <0%> (-10.526%) 3% <0%> (-1%)
...dk/samtools/cram/structure/SubstitutionMatrix.java 88.189% <0%> (-3.564%) 25% <0%> (+10%)
...samtools/cram/structure/CramCompressionRecord.java 92.424% <0%> (-0.545%) 114% <0%> (+1%)
.../samtools/cram/build/CompressionHeaderFactory.java 89.6% <0%> (-0.134%) 59% <0%> (-5%)
... and 4 more

@yfarjoun
Copy link
Contributor

yfarjoun commented May 8, 2019 via email

@karenfeng
Copy link
Contributor Author

karenfeng commented May 8, 2019

VCF reading doesn't break on NaN in the QUAL or GQ fields (as in the associated unit test), but I'm not completely confident on how spec-compliant that is. The spec seems to suggest that because QUAL is a Float, it should be ok; the reason GQ doesn't break on NaN today is because it's parsed via Double rather than Integer.

@jmarshall
Copy link
Member

The 4.3 spec is clear that NaNs are valid in QUAL (§1.6.1 states QUAL is a Float; §1.3 says VCF Floats include NaN). The earlier specs don't have §1.3 and appear to be silent on what text a Float might allow.

See also NaN vs. nan discussions in samtools/bcftools#755 (comment) and samtools/hts-specs#89 (comment) and nearby.

@yfarjoun yfarjoun requested a review from lbergelson May 10, 2019 14:35
* Parses a String as a Double, being tolerant for lower-case NaN (nan).
*/
private static final Double decodeDouble(final String string) {
if (Pattern.matches("[+-]?nan", string)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see -+ nans defined anywhere, only "nan" without a sign...have you seen that in the wild? -+Infinity I have seen...but not NaN.

Could you please:

  1. first attempt to call Double.valueOf(string) and catch the exception if it fails (for performance)
  2. compile the pattern into a static final (also for performance)
  3. make the pattern match case-independent Pattern.compile(regex, Pattern.CASE_INSENSITIVE) and remove the [+-] part.
  4. also add checks (and tests) for [-+]?inf(inity)? and encode into either Double.POSITIVE_INFINITY or Double.NEGATIVE_INFINITY

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaNs are NaNs and their sign is surely meaningless, but nonetheless the IEEE NaN format has a sign bit and C printf will print it (in glibc and other implementations, but not the BSD/macOS one — though IMHO that's not standard compliant). And Java will parse it.

So IMHO you should keep the [+-]? part. My suggestion for what the VCF spec should allow for NaN/∞ is samtools/hts-specs#409.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to object to allowing ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments:

  1. Replying to @yfarjoun point 1: Usually I'm not a fan of catching exceptions as part of the standard flow of a method, but in this case I agree that it makes sense for performance reasons.
  2. This should be a replacement for Double.parseDouble and return a primitive double instead of the boxed Double

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging out the code that implements Double.parseDouble (called by valueOf anyways) it turns out to be such monstrosity start from parseDouble in here that a quick checks to identify "nan" (or "infinity") with charAt(length()-1) == 'n' might be better than surrounding with a try-catch ... I don't know what the performance penalty depending on what penalty you paying for wrapping a mostly successful code in a try { ... }.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenfeng Thank you for this PR. It's a good idea. We've made some suggestions for changes before this can be merged.

/**
* Parses a String as a Double, being tolerant for lower-case NaN (nan).
*/
private static final Double decodeDouble(final String string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expose this somewhere public and maybe rename it to be clearer how it's different from the standard methods. Maybe VCFUtils.parseDoubleAccordingToVcfSpec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of additional places that this should be used as well I believe:

GenotypeLikelihoods.parseDeprecatedGLString line 270 parseDouble CommonInfo299 and 323 that useDouble.valueOf VariantContext1632Genotype` 529

@lbergelson lbergelson added the Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond label May 15, 2019
@karenfeng
Copy link
Contributor Author

@lbergelson Thank you for the feedback; let me know if there any further changes you would like me to make.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, I think that the point was missed regarding how parseDouble should be used to avoid boxing...

src/main/java/htsjdk/variant/variantcontext/Genotype.java Outdated Show resolved Hide resolved
src/main/java/htsjdk/variant/vcf/VCFUtils.java Outdated Show resolved Hide resolved
src/main/java/htsjdk/variant/vcf/VCFUtils.java Outdated Show resolved Hide resolved
import java.util.stream.Collectors;

public class VCFUtils {

private static Pattern infOrNanPattern = Pattern.compile("^(?<sign>[-+]?)((?<inf>(INF|INFINITY))|(?<nan>NAN))$", Pattern.CASE_INSENSITIVE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be final, and should have an uppercase name since it's a static, e.g.

Suggested change
private static Pattern infOrNanPattern = Pattern.compile("^(?<sign>[-+]?)((?<inf>(INF|INFINITY))|(?<nan>NAN))$", Pattern.CASE_INSENSITIVE);
private static final Pattern INF_OR_NAN_PATTERN = Pattern.compile("^(?<sign>[-+]?)((?<inf>(INF|INFINITY))|(?<nan>NAN))$", Pattern.CASE_INSENSITIVE);

/**
* Parses a String as a Double, being tolerant for lower-case NaN (nan).
*/
public static double parseDoubleAccordingToVcfSpec(final String str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize I'm late to this PR, but how about parseVcfDouble()? Saying "according to spec" seems redundant to me, unless we need to distinguish this from a VCF double parsing that's not done according to a spec.

@@ -250,6 +254,32 @@ else if (vcfFile.getAbsolutePath().endsWith(IOUtil.COMPRESSED_VCF_FILE_EXTENSION
return output;
}

/**
* Parses a String as a Double, being tolerant for lower-case NaN (nan).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This javadoc doesn't seem sufficient given that this supports INF and NaN. It would be nice to document its behavior more completely, and/or provide a link to part of the spec that this code is implementing.

}
}
return ret;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary else after return

ret = Double.NaN;
} else {
if (matcher.group("sign").equals("-")) {
ret = Double.NEGATIVE_INFINITY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test below appears to test this case but it's showing up as not covered. Are you sure that this code path is being tested by the code below?

}

@Test(dataProvider = "caseIntolerantDoubles")
public void testCaseIntolerantDoubles(String vcfInput, Double value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary to use a boxed type here as value is never null

Suggested change
public void testCaseIntolerantDoubles(String vcfInput, Double value) {
public void testCaseIntolerantDoubles(String vcfInput, double value) {

}
return ret;
} else {
throw e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test to VCFUtilsTest that verifies this negative case?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. Thank you @karenfeng

@lbergelson lbergelson dismissed yfarjoun’s stale review June 24, 2019 18:20

Yossi gave verbal thumbs up.

@lbergelson lbergelson merged commit 37ea940 into samtools:master Jun 24, 2019
@alecw
Copy link
Contributor

alecw commented Jul 10, 2019

Hi @lbergelson , is a release with this change coming any time soon? We would really appreciate it.

Thanks, Alec

@lbergelson
Copy link
Member

Yes. I've wanted to do a release for a while but I've been busy with personal stuff (mostly baby related) and because of incompatibilities I've had trouble running the downstream tests successfully so it's taken longer than intended. I hope to have one out this week but next week at the latest. I can publish a release candidate that hasn't been fully tested if that would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants