-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tolerate lower-case nans in QUAL #1364
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1364 +/- ##
===============================================
+ Coverage 68.007% 68.043% +0.036%
- Complexity 8352 8368 +16
===============================================
Files 570 571 +1
Lines 33848 33889 +41
Branches 5661 5665 +4
===============================================
+ Hits 23019 23059 +40
+ Misses 8645 8640 -5
- Partials 2184 2190 +6
|
does the spec allow for Nan in the QUAL field?
…On Wed, May 8, 2019 at 3:00 PM Karen Feng ***@***.***> wrote:
Description
We sometimes see VCFs with lower-case nan's in QUAL. This doesn't match
the Java-style NaN, and therefore causes VCF parsing to break.
Checklist
- Code compiles correctly
- New tests covering changes and new functionality
- All tests passing
- Extended the README / documentation, if necessary
- Is not backward compatible (breaks binary or source compatibility)
------------------------------
You can view, comment on, or merge this pull request online at:
#1364
Commit Summary
- Tolerate lower-case nans
File Changes
- *M* src/main/java/htsjdk/variant/vcf/AbstractVCFCodec.java
<https://github.com/samtools/htsjdk/pull/1364/files#diff-0> (16)
- *M* src/test/java/htsjdk/variant/vcf/AbstractVCFCodecTest.java
<https://github.com/samtools/htsjdk/pull/1364/files#diff-1> (15)
- *A* src/test/resources/htsjdk/variant/test_withNanQual.vcf
<https://github.com/samtools/htsjdk/pull/1364/files#diff-2> (8)
Patch Links:
- https://github.com/samtools/htsjdk/pull/1364.patch
- https://github.com/samtools/htsjdk/pull/1364.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1364>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAU6JUU4IAD2YZS4UI7HP7DPUMPMZANCNFSM4HLULL3A>
.
|
VCF reading doesn't break on |
The 4.3 spec is clear that NaNs are valid in QUAL (§1.6.1 states QUAL is a Float; §1.3 says VCF Floats include NaN). The earlier specs don't have §1.3 and appear to be silent on what text a Float might allow. See also |
* Parses a String as a Double, being tolerant for lower-case NaN (nan). | ||
*/ | ||
private static final Double decodeDouble(final String string) { | ||
if (Pattern.matches("[+-]?nan", string)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see -+ nans defined anywhere, only "nan" without a sign...have you seen that in the wild? -+Infinity I have seen...but not NaN.
Could you please:
- first attempt to call
Double.valueOf(string)
and catch the exception if it fails (for performance) compile
the pattern into a static final (also for performance)- make the pattern match case-independent
Pattern.compile(regex, Pattern.CASE_INSENSITIVE)
and remove the[+-]
part. - also add checks (and tests) for
[-+]?inf(inity)?
and encode into either Double.POSITIVE_INFINITY or Double.NEGATIVE_INFINITY
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaNs are NaNs and their sign is surely meaningless, but nonetheless the IEEE NaN format has a sign bit and C printf
will print it (in glibc and other implementations, but not the BSD/macOS one — though IMHO that's not standard compliant). And Java will parse it.
So IMHO you should keep the [+-]?
part. My suggestion for what the VCF spec should allow for NaN/∞ is samtools/hts-specs#409.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to object to allowing ∞
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional comments:
- Replying to @yfarjoun point 1: Usually I'm not a fan of catching exceptions as part of the standard flow of a method, but in this case I agree that it makes sense for performance reasons.
- This should be a replacement for
Double.parseDouble
and return a primitivedouble
instead of the boxedDouble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After digging out the code that implements Double.parseDouble (called by valueOf anyways) it turns out to be such monstrosity start from parseDouble in here that a quick checks to identify "nan" (or "infinity") with charAt(length()-1) == 'n' might be better than surrounding with a try-catch ... I don't know what the performance penalty depending on what penalty you paying for wrapping a mostly successful code in a try { ... }.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karenfeng Thank you for this PR. It's a good idea. We've made some suggestions for changes before this can be merged.
/** | ||
* Parses a String as a Double, being tolerant for lower-case NaN (nan). | ||
*/ | ||
private static final Double decodeDouble(final String string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expose this somewhere public and maybe rename it to be clearer how it's different from the standard methods. Maybe VCFUtils.parseDoubleAccordingToVcfSpec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a number of additional places that this should be used as well I believe:
GenotypeLikelihoods.parseDeprecatedGLString
line 270 parseDouble
CommonInfo299 and 323 that use
Double.valueOf
VariantContext1632
Genotype` 529
@lbergelson Thank you for the feedback; let me know if there any further changes you would like me to make. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, I think that the point was missed regarding how parseDouble should be used to avoid boxing...
src/main/java/htsjdk/variant/variantcontext/VariantContext.java
Outdated
Show resolved
Hide resolved
import java.util.stream.Collectors; | ||
|
||
public class VCFUtils { | ||
|
||
private static Pattern infOrNanPattern = Pattern.compile("^(?<sign>[-+]?)((?<inf>(INF|INFINITY))|(?<nan>NAN))$", Pattern.CASE_INSENSITIVE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be final
, and should have an uppercase name since it's a static, e.g.
private static Pattern infOrNanPattern = Pattern.compile("^(?<sign>[-+]?)((?<inf>(INF|INFINITY))|(?<nan>NAN))$", Pattern.CASE_INSENSITIVE); | |
private static final Pattern INF_OR_NAN_PATTERN = Pattern.compile("^(?<sign>[-+]?)((?<inf>(INF|INFINITY))|(?<nan>NAN))$", Pattern.CASE_INSENSITIVE); |
/** | ||
* Parses a String as a Double, being tolerant for lower-case NaN (nan). | ||
*/ | ||
public static double parseDoubleAccordingToVcfSpec(final String str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize I'm late to this PR, but how about parseVcfDouble()
? Saying "according to spec" seems redundant to me, unless we need to distinguish this from a VCF double parsing that's not done according to a spec.
@@ -250,6 +254,32 @@ else if (vcfFile.getAbsolutePath().endsWith(IOUtil.COMPRESSED_VCF_FILE_EXTENSION | |||
return output; | |||
} | |||
|
|||
/** | |||
* Parses a String as a Double, being tolerant for lower-case NaN (nan). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This javadoc doesn't seem sufficient given that this supports INF and NaN. It would be nice to document its behavior more completely, and/or provide a link to part of the spec that this code is implementing.
} | ||
} | ||
return ret; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary else
after return
ret = Double.NaN; | ||
} else { | ||
if (matcher.group("sign").equals("-")) { | ||
ret = Double.NEGATIVE_INFINITY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test below appears to test this case but it's showing up as not covered. Are you sure that this code path is being tested by the code below?
} | ||
|
||
@Test(dataProvider = "caseIntolerantDoubles") | ||
public void testCaseIntolerantDoubles(String vcfInput, Double value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary to use a boxed type here as value
is never null
public void testCaseIntolerantDoubles(String vcfInput, Double value) { | |
public void testCaseIntolerantDoubles(String vcfInput, double value) { |
} | ||
return ret; | ||
} else { | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a unit test to VCFUtilsTest
that verifies this negative case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now. Thank you @karenfeng
Hi @lbergelson , is a release with this change coming any time soon? We would really appreciate it. Thanks, Alec |
Yes. I've wanted to do a release for a while but I've been busy with personal stuff (mostly baby related) and because of incompatibilities I've had trouble running the downstream tests successfully so it's taken longer than intended. I hope to have one out this week but next week at the latest. I can publish a release candidate that hasn't been fully tested if that would be helpful. |
Description
We sometimes see VCFs with lower-case nan's in QUAL. This doesn't match the Java-style NaN, and therefore causes VCF parsing to break.
Checklist