-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCFHeader and VCFHeaderLine rewrite/refactoring #1581
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## master #1581 +/- ##
===============================================
+ Coverage 69.856% 69.878% +0.022%
- Complexity 9695 9832 +137
===============================================
Files 703 706 +3
Lines 37772 38022 +250
Branches 6139 6155 +16
===============================================
+ Hits 26386 26569 +183
- Misses 8929 8965 +36
- Partials 2457 2488 +31
|
9a9ea52
to
f66252a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little nitpicky detail, some uses of "its" and "it's" in comments are backwards.
for (final VCFHeader sourceHeader : headers) { | ||
for (final VCFHeaderLine line : sourceHeader.getMetaDataInSortedOrder()) { | ||
final String key = line.getKey(); | ||
if (VCFHeaderVersion.isFormatString(key) || key.equals(VCFHeader.CONTIG_KEY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we've had this conversation before, but just putting this here to see what the other reviewers think of this idea. This seems like another case where internally representing the fileformat line as just another header line doesn't seem to make much sense, because its semantics are special.
VCFHeader::getMetaDataIn{Input,Sorted}Order
returns the fileformat line along with all other lines, but it needs to be dropped, special-cased, or converted back into a VCFHeaderVersion
for a lot of our applications. A similar pattern (scanning for the fileformat line and dropping it) occurs in VCFMetaDataLines::validateMetaDataLines
and VCFWriter:: writeHeader
. The functionality of storing the version is duplicated by the VCFHeaderVersion
field as well, and the two have to be kept in sync (technically triplicated also by the version field in VCFHeader
).
Not storing it as a line and only as aVCFHeaderVersion
field, and converting fileformat lines to version objects when they’re added to the header might be cleaner in my opinion. VCFHeader::getMetaDataIn{Input,Sorted}Order
would still have to include the fileformat line to keep behaviour the same.
Obviously we'll still have to accept fileformat lines at the API boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the redundant VCFHeaderVersion from VCFHeader makes sense now I think - we'll just need to query VCFMetaDataLines before and after every change, but thats cheap.
Changing VCFMetaData lines to not model the version as a line in the list will result in awkward semantics and implementation for things like removeMetaDataLine
, findEquivalentHeaderLine
, getOtherHeaderLines
, but I don't have a strong opinion on that - as long as we keep the awkward public header API, there will be awkward semantics one way or another. I'll make the change and we can see which way we think is better.
None of that will address the external consumers though, such as VCFWriter
, which is still going to have to strip the file format line manually, since we need to hand them out at the public boundary to be compatible with past behavior. The right long term fix is to have a real codec that encodes as well as decodes, such as we have in the CZI package.
dict.add(idLine.getID()); | ||
seen.add(idLine.getID()); | ||
if (!line.isIDHeaderLine()) { | ||
//is there a better way to ensure that shouldBeAddedToDictionary==true only when isIDHeaderLine==true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether shouldBeAddedToDictionary
should be removed from VCFHeaderLine and replaced with something like a static method here, since it's a BCF 2.2 specific implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do that in the BCF branch.
return Optional.of(validationFailure); | ||
} else { | ||
// warn for older versions - this line can't be used as a v4.3 line | ||
logger.warn(validationFailure.getFailureMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether it makes more sense to move the logging (and checking validation strictness) out of these functions making them pure, and into the functions that actually consume them and have side effects, which can choose to throw or log based on the validation strictness.
These functions also aren't totally consistent, the one in VCFHeaderLine
doesn't check validation strictness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of passing back the error and letting the caller handle it, but it looks like the superclass handler is intended to ignore validation strictness so I'm not sure that it would be easy to redesign.
I would lean towards making the validation stringency a parameter instead of grabbing it through defaults because it makes that dependency much less hidden and also easier to test.
// to retain backward compatibility with previous implementations, we accept (and repair) and the line here. | ||
updateGenericField(NUMBER_ATTRIBUTE, "0"); | ||
lineCount = 0; | ||
logger.warn(String.format("FLAG fields must have a count value of 0, but saw count %d for header line %s. A value of 0 will be used", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small detail, but this logging should come before lineCount
is updated, otherwise the message will print the contradictory warning "FLAG fields must have a count value of 0, but saw count 0".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good catch - I've been meaning to fix that since it looks weird every time I see it.
// We only allow going forward to a newer version, not backwards to an older one, since there | ||
// is really no way to validate old header lines (pre vcfV4.2). The only way to create a header with | ||
// an old version is to create it that way from the start. | ||
// to be created with the old version from the start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment looks like an old line got left in accidentally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
final Set<VCFHeaderVersion> headerVersions = new HashSet<>(2); | ||
// a global mutable static - is there an alternative ? | ||
// there isn't any other reasonable place to keep this state | ||
private static boolean vcfStrictVersionValidation = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to have this in Defaults
making it immutable over the course of one process because the semantics of turning it on or off during one run might be unexpected. Starting with strict validation off then turning it on doesn't trigger validation of existing header lines for example, so we might still be outputting sloppily validated lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setting is already hooked up to Defaults, but I forgot to delete this variable - so this just needs to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (removed).
@@ -330,6 +326,7 @@ public void TestWritingLargeVCF(final String extension) throws FileNotFoundExcep | |||
@DataProvider(name = "vcfExtensionsDataProvider") | |||
public Object[][]vcfExtensionsDataProvider() { | |||
return new Object[][] { | |||
//TODO: fix this BCF problem! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a note to fix these tests in my BCF branch, but from what I can tell, these failures are expected, 2 of these (TestWritingLargeVCF
, testBasicWriteAndRead
) make sense to fix and are from a missing standard header line (DP), because BCF requires all VC attributes to have a corresponding line in the header.
The other 2 (testWriteAndReadVCFHeaderless
and testWriteWithEmptyHeader
) don't make sense at all to test for BCF, because headerless or empty header BCFs aren't well formed.
{ "key=<", new VCFHeaderLine("key", "<") }, | ||
// taken from Funcotator test file as ##ID=<Description="ClinVar Variation ID"> | ||
// technically, this is invalid due to the lack of an "ID" attribute, but it should still parse | ||
// into a VCFHeaderLine (but noa VCFSimpleHeaderLine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in "noa", missing closing )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
return null; | ||
} else if (lineList.size() > 1) { | ||
throw new TribbleException( | ||
String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This format string is missing a %s placeholder for the list of duplicated "other" lines argument
Note: this needs to be resolved/updated to account for the final disposition of structured vs. unstructured lines, attribute ordering, and quoting, as described in samtools/hts-specs#642 #1610, and also samtools/hts-specs#620. |
We've been discussing the lack of support for VCF 4.4 in IGV (igvteam/igv#1289) and started looking at the current htsjdk / VCF interaction. This PR looks like a major step forward to resolve issues around VCF 4.3 - thanks for all your work on this! Any chance of merging this soon so we can consider what changes would be required for 4.4? |
@ohofmann Hi Oliver - yes, this PR addresses quite a few issues with VCF support in htsjdk, and is a prerequisite for a couple of other large PRs for VCF/BCF support. The team has been sidetracked for quite a while with upgrading several of these repos to Java 17, which turned out to be a lot more work than anticipated, but we finally got the main PR for that merged last week. We discussed this PR in our team meeting this morning, and we do plan to get it merged sometime in the next quarter. It has had a fair amount of review/discussion already, but its quite large and will require a major version release, and has significant downstream test impact on projects such as GATK (I haven't tried to build IGV with it, but I expect the IGV fallout to be minimal). Hope that help.s |
@cmnbroad Thanks for the update! It does help, though we may have to consider postponing the VCF 4.4 announcement / PR drive. I wouldn't want the first experience of many users to be an htsjdk error. Will keep an eye on this PR in the meantime. |
…BCF2.2 and bug fixes.
…or customs serialization.
@lbergelson I've rebased this on master/Java 17. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad I have a lot of comments. Some are more important than others. Some of them are much older than others also... There's a bunch of places where we can make use of new 17 (and soon 21) styles stuff to clean up code and I've mentioned some of them. Feel free to ignore that though, we can always revisit.
I think it looks good overall. I found some things that I think need fixing but I think the overall approach is a huge improvement.
@@ -134,6 +139,7 @@ public class Defaults { | |||
SAM_FLAG_FIELD_FORMAT = SamFlagField.valueOf(getStringProperty("sam_flag_field_format", SamFlagField.DECIMAL.name())); | |||
SRA_LIBRARIES_DOWNLOAD = getBooleanProperty("sra_libraries_download", false); | |||
DISABLE_SNAPPY_COMPRESSOR = getBooleanProperty(DISABLE_SNAPPY_PROPERTY_NAME, false); | |||
STRICT_VCF_VERSION_VALIDATION = getBooleanProperty("strict_version_validation", true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make the property name also "strict_vcf_version_validation".
* order in the list | ||
*/ | ||
//TODO: this method ignores (and actually mutates) the sequenceRecord's contig index to make it match | ||
// the record's relative placement in the dictionary's internal list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's a nasty side effect that we've been living with forever.
} | ||
writer.write(header.getHeaderFields().stream() | ||
.map(f -> f.name()) | ||
.collect(Collectors.joining(VCFConstants.FIELD_SEPARATOR)).toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this too string is unecessary?
// collect metadata lines until we hit the required header line, or a non-metadata line, | ||
// in which case throw since there was no header line | ||
while (lineIterator.hasNext()) { | ||
final String line = lineIterator.peek(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUG! This goes into an infinite loop if it hits a line that doesn't start with either of these.
return this.header; | ||
} | ||
} | ||
throw new TribbleException.InvalidHeader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if there was information available at this point to say WHICH vcf file had the error but I don't see anyway to get that at the moment.
|
||
// List of expected tags (for this base class, its ID only; subclasses with more required tags | ||
// should use a custom tag order if more required tags are expected | ||
protected static final List<String> expectedTagOrder = Collections.unmodifiableList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yay for List.of()
* A class representing a VCF validation failure. | ||
* @param <T> a type representing the object that is being validated | ||
*/ | ||
class VCFValidationFailure<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is returned by the public method validateForVersion
so probably it should be public. Or maybe the validate method should be protected.
// have to be able to "repair" header lines (via a call to updateGenericField) during constructor validation. | ||
// | ||
// Otherwise the values here should never change during the lifetime of the header line. | ||
private final Map<String, String> genericFields = new LinkedHashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<>
// without string literal escaping and quoting the regex would be: replaceAll( ([^\])" , $1\" ) | ||
// ie replace: something that's not a backslash ([^\]) followed by a double quote | ||
// with: the thing that wasn't a backslash ($1), followed by a backslash, followed by a double quote | ||
return value.replaceAll("([^\\\\])\"", "$1\\\\\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be possible to write this more legibly with text blocks nowadays?
|
||
@Override | ||
public Optional<VCFValidationFailure<VCFHeaderLine>> validateForVersion(final VCFHeaderVersion vcfTargetVersion) { | ||
if (!vcfTargetVersion.isAtLeastAsRecentAs(VCFHeaderVersion.VCF4_0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use the pattern here of either warning or returning an optional here. It seems like it's done a number of times.
It seem a bit weird, an extra layer of optionalness. Should the decision to log a warning or not be left up to the caller? If some options are only errors during strict validation maybe that should be recorded in the ValidationFailure
object and then the caller can decide what to do with all of them. If you don't like that idea, maybe pull out a method that either returns the option or warns and swallows it.
I guess the reason behind it is probably because it's necessary since the super call might return a higher priority error, but that could be reworked so it's called first or the two are compared, or even both are returned.
Not super important, but it seems awkward.
This PR is resurrected from the PR graveyard, updated and rebased. It contains:
Fixes numerous issues, including: