-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding argument to GenotypeGVCFs to keep only RAW_GT_COUNT #7996
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #7996 +/- ##
===============================================
+ Coverage 79.301% 86.637% +7.336%
- Complexity 36613 38937 +2324
===============================================
Files 2328 2335 +7
Lines 181684 182658 +974
Branches 19950 20052 +102
===============================================
+ Hits 144077 158250 +14173
+ Misses 30805 17370 -13435
- Partials 6802 7038 +236
|
As discussed, let's quickly see if we can generalize this to allow one to specify a set of raw annotations to keep. If it looks like that will be a reach, just let me know and I can go ahead and review. Thanks! |
Github actions tests reported job failures from actions build 2921600692
|
@samuelklee I made an attempt to make the argument more general. It should be ready for review. Thanks! |
Thanks, @meganshand! Just a heads up, I’m mostly out this week since it’s a vacation week for our daycare. But will try to sneak this in if I can. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @meganshand! Sorry for the delay. Some minor changes, and some potentially major/breaking ones that might merit more discussion.
If you need this merged quickly, I'll trust you to make some of the corresponding decisions. But feel free to ping me if you'd like to discuss!
@@ -930,7 +928,8 @@ public void testRawGtCountAnnotation() { | |||
args.addReference(b37_reference_20_21) | |||
.addVCF(reblockedGVCF) | |||
.addOutput(output) | |||
.add("keep-combined-raw-annotations", true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this replaces the previous test in favor of this new one. Perhaps we should consider just adding a new test, so that we cover both 1) keeping all combined annotations, and 2) keeping only the specified raw annotations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've now made these two arguments (keep all combined and keeping only specified) mutually exclusive (not at the time of your review, but since then based on other comments). So I'm going to leave this test as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just to be clear, you mean you'll revert this test to its original behavior, right? In that case, will you add another test for the keep-specified behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original behavior of --keep-combined-raw-annotations
set to true while --keep-specified-raw-annotations
is not empty is no longer allowed due to the mutex
. So I think I am not reverting this test to the original behavior. Does that make sense? This test sets --keep-specified-raw-annotations
to RawGtCount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, by "original behavior" I mean the current behavior of the test as it is in master. This test sets --keep-combined-raw-annotations true
and hence additionally retains RAW_MQandDP
.
In contrast, you modify the test in your branch so that only RAW_GT_COUNT
is retained and we check that RAW_MQandDP
is not. This means the code path that is responsible for retaining RAW_MQandDP
is no longer run in this test, at least (and is perhaps not run for this particular input test file in any test at all). Even though the original test does not explicitly check for any particular RAW_MQandDP
output, this is still an effective drop in test coverage.
Probably not anything to sweat about, but that's all that I wanted to point out in my original comment. I'll leave it up to you to decide if --keep-combined-raw-annotations true
is sufficiently covered by the other tests so that we don't have to worry about replacing the original test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Sorry, I see what you're saying now. I think I originally added this test as a way of testing RawGTCount
, rather than --keep-combined-raw-annotations true
, so I overlooked that it did actually test the latter too. I do think the other tests sufficiently cover --keep-combined-raw-annotations true
but I very much see how that is not clear from this change.
/** | ||
* Keep only the specific combined raw annotations specified (removing the other raw annotations if keep-combined-raw-annotations is not set). | ||
*/ | ||
@Argument(fullName= KEEP_SPECIFIED_RAW_ANNOTATION_LONG_NAME, shortName = KEEP_SPECIFIED_RAW_ANNOTATION_SHORT_NAME, optional = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we perhaps use mutex
to ensure we don't use both -keep-raw
and -keep-combined
simultaneously?
Do you have an idea of the usefulness of -keep-combined
(i.e., the convenience of being able to keep all combined raw annotations, even if we don't know what those annotations might be) and where it might be currently used (e.g., WARP WDLs, etc.)? How much of a pain would it be to require the use of -keep-raw
in all cases---presumably one would always be able to know which annotations one wants to keep?
In any case, -keep-raw
and -keep-combined
(and the corresponding long names) don't really give the sense of -keep
and -keep-all
that is implemented here. Assuming we have the freedom to muck with the name of -keep-combined
, do you think it might be worth changing? And should we also change the short/long names of -keep-raw
to somehow include combined
(as you note in the Javadoc here)?
A general comment: as a relative neophyte to this annotation framework, I will say that I find the lack of clear documentation/definitions for terms like "combined" and "raw" makes it hard to be sure what these arguments (or even the tool) are doing, especially from the doc strings and tool docs alone. Given that you have more experience with this framework, do you feel the same, or do you think things are clear as is?
Happy to hash all of this out further offline/elsewhere, seems a bit thorny!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if -keep-combined
is used anywhere currently. I'm assuming it's mostly useful for debugging/development, but I don't know. I would rather not break backwards compatibility though, so I think we should still leave it in and I have a preference to not rename it.
I can add some more documentation for these arguments though to help make it clearer. I agree that the names don't well represent what they are doing.
} | ||
variantOverlapAnnotator = initializeOverlapAnnotator(dbSNPInput, featureInputs); | ||
reducibleKeys = new LinkedHashSet<>(); | ||
useRawAnnotations = useRaw; | ||
keepRawCombinedAnnotations = keepCombined; | ||
for (String rawAnnot : rawAnnotationsToKeep) { | ||
if (!variantAnnotationKeys.contains(rawAnnot)) { | ||
throw new UserException("Requested --" + GenotypeGVCFs.KEEP_SPECIFIED_RAW_ANNOTATION_LONG_NAME + ": " + rawAnnot + " is not available. Add requested annotation with --" + StandardArgumentDefinitions.ANNOTATION_LONG_NAME + "."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's confusing that the command line looks something like: --keep-specific-raw-annotation: RAW_GT_COUNT --annotation RawGtCount
(since the latter is derived from the Java class name). Is this pattern something that could be cleaned up at this stage, or has that ship long sailed?
And this exception message, while helpful, doesn't actually tell you what value you need to supply to the annotation
argument---not sure if that's actually that trivial to look this up and output it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/VariantAnnotatorEngine.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/VariantAnnotatorEngine.java
Outdated
Show resolved
Hide resolved
*/ | ||
@Argument(fullName= KEEP_SPECIFIED_RAW_ANNOTATION_LONG_NAME, shortName = KEEP_SPECIFIED_RAW_ANNOTATION_SHORT_NAME, optional = true, | ||
doc="Keep only the specific combined raw annotations specified (removing the other raw annotations).") | ||
protected List<String> keepSpecifiedRawAnnotations = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on what you do with -keep-combined
above, you may want to add additional argument documentation/validation/sanitization here. E.g., add to the doc string that duplicate values will be ignored, and ensure that downstream code enforces/respects this as early as possible.
@@ -253,6 +274,14 @@ public Map<String, Object> combineAnnotations(final List<Allele> allelesList, Ma | |||
public VariantContext finalizeAnnotations(VariantContext vc, VariantContext originalVC) { | |||
final Map<String, Object> variantAnnotations = new LinkedHashMap<>(vc.getAttributes()); | |||
|
|||
//save annotations that have been requested to be kept | |||
final Map<String, Object> savedRawAnnotations = new LinkedHashMap<>(); | |||
for(String rawAnnot : rawAnnotationsToKeep) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about final here. Note that it's actually used in other loops in this method...up to you if you want to make it consistent everywhere.
@@ -942,6 +941,7 @@ public void testRawGtCountAnnotation() { | |||
Assert.assertEquals(rawGtCount.get(0), "."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on with the hom-ref count? I see the TODO in the RawGtCount class but I don't understand the fundamental limitation.
Incidentally, perhaps we can go ahead and expand the header-line documentation of this annotation to explicitly give the hom-ref/het/hom-var order. Maybe this is obvious to everyone else, but why not document it for dummies like me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that RawGtCount
is combined by GenomicsDB as a sum. There is no RawGtCount
annotation on RefBlocks so the HomRef count always ends up being 0 in this case. It seems like ExcessHet
isn't actually using RawGtCount
as I expected. It recalculates the counts in GenotypeUtils.computeDiploidGenotypeCounts
which doesn't seem to use RawGtCount
so I'm not sure why RawGtCount
is the rawKey
associated with ExcessHet
. Seems like this whole thing could be cleaned up so that we use the same method ExcessHet
does to calculate the count of Het and HomVar genotypes at a site, but I'm not sure why we didn't do that in the first place.
Also I added the documentation to the header line.
Github actions tests reported job failures from actions build 3129008425
|
Github actions tests reported job failures from actions build 3160429275
|
I think tests are failing only because I changed a header line. I will update the expected VCFs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@meganshand Some changes requested, hope they make sense. I didn't really look at the VariantAnnotatorEngine changes, but presumably someone else is reviewing those.
@@ -82,4 +81,7 @@ public boolean getDisableToolDefaultAnnotations() { | |||
public boolean getEnableAllAnnotations() { | |||
return enableAllAnnotations; | |||
} | |||
|
|||
@Override | |||
public List<String> getKeepSpecifiedCombinedAnnotationNames() { return Collections.emptyList(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the new argument is specific to GenotypeGVCFs
and GenotypeGVCFsAnnotationArgumentCollection
, we definitely don't want to add this method here. It should only be in GenotypeGVCFsAnnotationArgumentCollection
. See my comments on GenotyeGVCFs
elsewhere for more details on how to make this unnecessary.
GATKReadFilterPluginDescriptor readFilterDescriptor = new GATKReadFilterPluginDescriptor(getDefaultReadFilters()); | ||
return useVariantAnnotations()? | ||
Arrays.asList(readFilterDescriptor, new GATKAnnotationPluginDescriptor( | ||
new GenotypeGVCFsAnnotationArgumentCollection(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the branch I gave you, I had instantiated GenotypeGVCFsAnnotationArgumentCollection
directly like this, but it should really be a GenotypeGVCFs
instance variable. Then you can just pass the instance variable to the annotation plugin here, and reference it below when you need to get call getKeepSpecifiedCombinedAnnotationNames
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, make sure that when you add the instance variable, you don't annotate it as an @ArgumentCollection
(and maybe even add a comment saying that @ArgumentCollection
is left out on purpose) since if you do the parser will complain about duplicate arg names. i.e., just add:
// @ArgumentCollection deliberately omitted since this is passed to the annotation plugin
final GenotypeGVCFsAnnotationArgumentCollection genotypeGVCFsAnnotationArgs = new GenotypeGVCFsAnnotationArgumentCollection();
return this.allDiscoveredAnnotations; | ||
} | ||
|
||
public GATKAnnotationArgumentCollection getUserArgs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getUserArgs
can also be removed from this class (see the comments elsewhere).
final Collection<Annotation> variantAnnotations = makeVariantAnnotations(); | ||
final GATKAnnotationPluginDescriptor pluginDescriptor = getCommandLineParser().getPluginDescriptor(GATKAnnotationPluginDescriptor.class); | ||
final List<String> annotationStringsToKeep = pluginDescriptor.getUserArgs().getKeepSpecifiedCombinedAnnotationNames(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add an instance variable of type GenotypeGVCFsAnnotationArgumentCollection
called genotypeGVCFsAnnotationArgs
(or something) as suggested above, then you can access that directly here, instead of asking the plugin descriptor for it. It will have it's actual type (as opposed to the base class type GATKAnnotationArgumentCollection
), and you'll be able to call the GATKAnnotationArgumentCollection::getKeepSpecifiedCombinedAnnotationNames
method on it directly. Then the getKeepSpecifiedCombinedAnnotationNames
method can be removed from the GATKAnnotationArgumentCollection base
class; and getUserArgs
can be removed from the plugin descriptor. Only GenotypeGVCF's will know anything about the new arg collection or the new method:
final Collection<Annotation> variantAnnotations = makeVariantAnnotations(); | |
final GATKAnnotationPluginDescriptor pluginDescriptor = getCommandLineParser().getPluginDescriptor(GATKAnnotationPluginDescriptor.class); | |
final List<String> annotationStringsToKeep = pluginDescriptor.getUserArgs().getKeepSpecifiedCombinedAnnotationNames(); | |
final Collection<Annotation> variantAnnotations = makeVariantAnnotations(); | |
final GATKAnnotationPluginDescriptor pluginDescriptor = getCommandLineParser().getPluginDescriptor(GATKAnnotationPluginDescriptor.class); | |
final List<String> annotationStringsToKeep = genotypeGVCFsAnnotationArgs.getKeepSpecifiedCombinedAnnotationNames(); |
final Map<String, Annotation> allDiscoveredAnnotations = pluginDescriptor.getAllDiscoveredAnnotations(); | ||
final Set<Annotation> annotationsToKeep = annotationStringsToKeep.stream().map(allDiscoveredAnnotations::get).collect(Collectors.toSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be preferable to add an override of makeVariantAnnotations
into this class (adjacent to the getPluginDescriptors
override), and move the computation of which annotations to keep into that, rather than inlining it here.
Also, you're basically giving priority to the new arg in the case where it conflicts with, say, an exclude arg, which I think is fine if thats what you want, but you might want to document that as part of the new arg if thats what you go with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is I still need the regular list of annotations (in addition to the annotationsToKeep
). I can still move it to a separate function though.
Also I think I resolved this by not using allDiscoveredAnnotations
so now excluded args should get handled correctly.
@@ -37,4 +37,7 @@ public abstract class GATKAnnotationArgumentCollection implements Serializable { | |||
* Returns {@code true} if all annotations are enabled; {@code false} otherwise. | |||
*/ | |||
public abstract boolean getEnableAllAnnotations(); | |||
|
|||
|
|||
public abstract List<String> getKeepSpecifiedCombinedAnnotationNames(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this can be removed.
@@ -497,4 +496,12 @@ public Class<?> getClassForPluginHelp(final String pluginName) { | |||
return allDiscoveredAnnotations.containsKey(pluginName) ? allDiscoveredAnnotations.get(pluginName).getClass() : null; | |||
} | |||
|
|||
public Map<String,Annotation> getAllDiscoveredAnnotations() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep this method, I'd suggest:
- moving it so its adjacent to
getResolvedInstances
- renaming it to
getDiscoveredInstances
- adding a javadoc comment saying that it returns every annotation that is discovered anywhere on the classpath, and that it should only be used when
getResolvedInstances
isn't sufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @cmnbroad! I knew it wasn't right, but I was missing the key step of declaring the Argument Collection in GenotpeGVCFs without labeling it an ArgumentCollection.
Hopefully it's cleaner now and should be ready for re-review.
final Map<String, Annotation> allDiscoveredAnnotations = pluginDescriptor.getAllDiscoveredAnnotations(); | ||
final Set<Annotation> annotationsToKeep = annotationStringsToKeep.stream().map(allDiscoveredAnnotations::get).collect(Collectors.toSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is I still need the regular list of annotations (in addition to the annotationsToKeep
). I can still move it to a separate function though.
Also I think I resolved this by not using allDiscoveredAnnotations
so now excluded args should get handled correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@meganshand The argument stuff looks much better now, since everything specific to GenotypeGVCFs is private to it now, but see my comments about the list reconciliation.
...in/java/org/broadinstitute/hellbender/cmdline/GATKPlugin/GATKAnnotationPluginDescriptor.java
Outdated
Show resolved
Hide resolved
* Returns a map of the String to Annotations only in the resolved instances. Should only be used if | ||
* getResolvedInstances() is not sufficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "Should be used...." warning can be removed. That comment was intended for a previous revision where you were exposing allDiscoveredAnnotations
.
final GATKAnnotationPluginDescriptor pluginDescriptor = getCommandLineParser().getPluginDescriptor(GATKAnnotationPluginDescriptor.class); | ||
final List<String> annotationStringsToKeep = genotypeGVCFsAnnotationArgs.getKeepSpecifiedCombinedAnnotationNames(); | ||
final Map<String, Annotation> resolvedInstancesMap = pluginDescriptor.getResolvedInstancesMap(); | ||
return annotationStringsToKeep.stream().map(resolvedInstancesMap::get).collect(Collectors.toSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to be a bit more defensive in how you reconcile the "resolved" entries returned by getResolvedInstancesMap()
with the entries in annotationStringsToKeep
. You're not guaranteed that resolvedInstancesMap
will contain an entry for every annotation the user specified in annotationStringsToKeep
, so you need to handle the case where resolvedInstancesMap::get
returns null here.
For instance, I don't think there is anything that will detect that the user made up some non-existent annotation name and specified it as a keeper, since they're just strings up until this point. Also, it's possible for the user to have both excluded an annotation AND specified it as a keeper (in which case it won't be in the resolved list). This code needs to detect and reject those cases so it doesn't wind up handing the annotator engine null annotations, preferably with some tests to ensure they're rejected properly.
You probably won't be able to tell exactly WHY a null is returned here, but I think rejecting that case with an error message saying something like, the annotation zzz either does not exist or was excluded
. It's a bit of a pain - a lot of the logic in the plugin descriptor is to do exactly this kind of reconciliation and error reporting, but since the keep list is specific to GenotypeGVCFs, reconciling part of it has to be done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the "missing annotation" check here and added a test.
…/GATKAnnotationPluginDescriptor.java Co-authored-by: Chris Norman <cnorman@broadinstitute.org>
Thanks @cmnbroad, back to you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@meganshand I mostly focused on the arg part (which looks good now - phew!) on the theory that Sam reviewed the rest of this. I added one more nit/comment, up to you if you want to use it or not. Otherwise LGTM.
Utils.nonNull(featureInputs, "comparisonFeatureInputs is null"); | ||
infoAnnotations = new ArrayList<>(); | ||
genotypeAnnotations = new ArrayList<>(); | ||
jumboInfoAnnotations = new ArrayList<>(); | ||
jumboGenotypeAnnotations = new ArrayList<>(); | ||
final List<String> variantAnnotationKeys = new ArrayList<>(); | ||
final List<String> rawVariantAnnotationKeysToKeep = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other minor nit: why introduce this intermediate alias ? To me having the additional alias with a slightly different name just obfuscates whats going on - I'd suggest just allocating and using the rawAnnotationsToKeep
instance variable directly, since this winds up being assigned there anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, good catch. I fixed this and removed variantAnnotationKeys
since it wasn't being used anymore (after moving the error message to GenotypeGVCFs)
@cmnbroad I think Sam has reviewed the rest of this PR. Thank you for helping me get the argument part working! Assuming tests pass this is ready for a final review. |
@meganshand you've got a prior approval from me, but let me know if you think I should re-review. Thanks @cmnbroad for helping out with the argument code and reviewing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Currently, there is an argument to keep all raw annotations, but the flow based pipeline has a use case that needs to keep
RAW_GT_COUNT
without keeping the other raw annotations. This new argument allows the user to keepRAW_GT_COUNT
while still cleaning up the other raw annotations in GenotypeGVCFs.