Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new argument to control variant output interval filtering. #6388

Merged
merged 9 commits into from
Sep 18, 2024

Conversation

lbergelson
Copy link
Member

  • Adding a new GATKTool level argument --variant-output-interval-filtering-mode which allows filtering output variants according to the input interval list. This replaces --only-output-calls-starting-in-intervals which was available in GenotypeGvcfs and GnarlyGenotyper.

It works by adding a filtering decorator to the vcf writers created through GATKTool.createVCFWriter.
There are several different filtering modes:
STARTS_IN, ENDS_IN, OVERLAPS, CONTAINED, and ANYWHERE

The default for tools is not to apply the decorator, but they may optionally change that behavior by overriding the new getDefaultVariantOutputFilterMode.

--variant-output-interval-filtering-mode STARTS_IN is equivalent to the previous behavior of --only-output-calls-starting-in-intervals true

MockVcfWriter is now a testUtils class. The naming is a bit awkward so improvements would be helpful.

This doesn't fix the weird behavior in HaplotypeCaller but does allow subsetting unique shards with SelectVariants and other variant outputting tools. We could adapt this to apply to bam outputs as well if that seems useful.

@droazen droazen self-assigned this Jan 21, 2020
@droazen droazen self-requested a review January 21, 2020 16:38
@lbergelson
Copy link
Member Author

@droazen @samuelklee I rebased this. It's ready for review I think. I removed an existing argument in this pr, if we want I can deprecated it instead, it will probably take some extra logic to account for both arguments existing though.

@gatk-bot
Copy link

gatk-bot commented Aug 5, 2020

Travis reported job failures from build 31089
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk11 31089.14 logs

@samuelklee
Copy link
Contributor

@droazen any chance we could get this merged? There's been interest again from our MalariaGEN collaborators at Sanger. Happy to rebase and address any comments you might have, if @lbergelson is too busy---just let me know.

@samuelklee
Copy link
Contributor

BTW, would be great if we could get this in by the next release.

@droazen
Copy link
Collaborator

droazen commented May 26, 2021

@samuelklee Thanks for resurrecting this old PR -- I'll take a look at the state of the branch tomorrow and see whether @lbergelson and I can get it over the finish line.

@samuelklee
Copy link
Contributor

samuelklee commented Nov 17, 2021

@droazen @lbergelson can we get this in soon? We need it for the upcoming MalariaGEN Pf8 release. Thanks!

@lbergelson lbergelson force-pushed the lb_starting_in_intervals branch 2 times, most recently from 6a95b50 to ceea713 Compare December 8, 2021 21:15
@samuelklee
Copy link
Contributor

@droazen @lbergelson could I get an ETA on this? Louis chatted with me in early December and gave me an update, just wanted to make sure everything is on track---or if it's not going to get in in time for Pf8, to let our collaborators know. @kbergin @vruano might also want to keep an eye on this.

@samuelklee
Copy link
Contributor

Just chatted with @lbergelson and we’re going to try to get this in over the next week or so, so that @vruano can slot it in between rounds of evaluations that he’s running. Sounded like some tests might need to be added, but I’ll try to do an initial review today—wouldn’t mind if anyone else wants to add comments, either!

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @lbergelson! Just a quick, relatively superficial skim. Happy to take a second look after you add the tests you mentioned, but might be better to have at least one more set of eyes from the engine team. I also don't have strong opinions about the breaking changes.

}

@DataProvider
public Object[][] getIntervalsAndOverlapMode(){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to check the actual intervals retained, rather than just the total number?

*/
STARTS_IN{
@Override
boolean test(OverlapDetector<? extends Locatable> detector, final VariantContext query) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final OverlapDetector...

},

/**
* Always matches, may be used to not perform any filtering, alternatively a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete comment?

* @param intervals the intervals to compare against, note that these are not merged so if they should be merged than the input list should be preprocessed
* @param mode the matching mode to use
*/
public IntervalFilteringVcfWriter(final VariantContextWriter writer, List<SimpleInterval> intervals, Mode mode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finals

|| variant.getAttributeAsInt(VCFConstants.DEPTH_KEY,0) == 0
|| (onlyOutputCallsStartingInIntervals && !intervals.stream().anyMatch(interval -> interval.contains(variantStart)))) {
|| variant.getAttributeAsInt(VCFConstants.DEPTH_KEY,0) == 0 )
// todo this changes is a slight de-optimization since we will now process some sites whihc were previously ignored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whihc -> which. Not sure if you think this TODO will be worth addressing, but thanks for making a note for now.

@@ -45,6 +49,11 @@
import org.broadinstitute.hellbender.utils.reference.ReferenceUtils;
import org.broadinstitute.hellbender.utils.variant.GATKVariantContextUtils;
import org.broadinstitute.hellbender.utils.variant.writers.ShardingVCFWriter;
import org.broadinstitute.hellbender.utils.variant.writers.IntervalFilteringVcfWriter;

//TODO:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this entails, but don't forget it!

@lbergelson
Copy link
Member Author

I've responded to the ancient comments, rebased, and added some test improvements.

@@ -46,7 +46,7 @@ private StandardArgumentDefinitions(){}
public static final String INVALIDATE_PREVIOUS_FILTERS_LONG_NAME = "invalidate-previous-filters";
public static final String SORT_ORDER_LONG_NAME = "sort-order";
public static final String FLOW_ORDER_FOR_ANNOTATIONS = "flow-order-for-annotations";

public static final String VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME = "variant-output-interval-filtering-mode";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is a mess, does anyone have suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just "variant-interval-filtering"?

@gatk-bot
Copy link

gatk-bot commented Aug 19, 2024

Github actions tests reported job failures from actions build 10457568370
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 17.0.6+10 10457568370.11 logs
integration 17.0.6+10 10457568370.0 logs

@gatk-bot
Copy link

gatk-bot commented Aug 19, 2024

Github actions tests reported job failures from actions build 10457619605
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 17.0.6+10 10457619605.11 logs
integration 17.0.6+10 10457619605.0 logs

@gatk-bot
Copy link

Github actions tests reported job failures from actions build 10458689373
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 17.0.6+10 10458689373.11 logs

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part this looks good and is nice and extensible to all/most cases. I think we need to hunt down cases where the VCFWriter is being constructed outside of this codepath and try to patch them to include this if possible. Beyond that my primary gripe is that there need to be more explicit tests of the bizzaro edge cases that must exist with this code

@@ -46,7 +46,7 @@ private StandardArgumentDefinitions(){}
public static final String INVALIDATE_PREVIOUS_FILTERS_LONG_NAME = "invalidate-previous-filters";
public static final String SORT_ORDER_LONG_NAME = "sort-order";
public static final String FLOW_ORDER_FOR_ANNOTATIONS = "flow-order-for-annotations";

public static final String VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME = "variant-output-interval-filtering-mode";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just "variant-interval-filtering"?

* @return Default interval filtering mode for variant output. Subclasses may override this to set a different default.
*/
public IntervalFilteringVcfWriter.Mode getDefaultVariantOutputFilterMode(){
return null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHould this default to "ANYWHERE" instead?

validateSequenceDictionaries();
}

checkToolRequirements();

if (outputVariantIntervalFilteringMode != null && userIntervals == null){
throw new CommandLineException.MissingArgument("-L or -XL", "Intervals are required if --" + StandardArgumentDefinitions.VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME + " was specified.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a tool overrides the default here its going to spit this message if the user doesn't provide intervals at all which i think is confusing about this message.... We should probably disambiguate the default and user supplied here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I'm not sure it's addressed by your changes though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well pulling this down into variant walker base makes this a little more obnoxious to fix... I'm just going to add a comment clarifying but this is a head scratcher

ENDS_IN("ends within any of the given intervals"){
@Override
boolean test(final OverlapDetector<? extends Locatable> detector, final VariantContext query) {
final SimpleInterval endPosition = new SimpleInterval(query.getContig(), query.getEnd(), query.getEnd());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the simple interval conversion necessary here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because we want to be sure we're only checking the end position for overlap.

doc = "Restrict the output variants to ones that match the specified intervals according to the specified matching mode.",
optional = true)
@Advanced
public IntervalFilteringVcfWriter.Mode outputVariantIntervalFilteringMode = getDefaultVariantOutputFilterMode();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this actually be moved into the VariantWalker codepaths so its not getting applied to EVERY GATKTool even those that are going to cause confusion if its there?

@gatk-bot
Copy link

gatk-bot commented Sep 17, 2024

Github actions tests reported job failures from actions build 10908982699
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 17.0.6+10 10908982699.11 logs
variantcalling 17.0.6+10 10908982699.2 logs
integration 17.0.6+10 10908982699.0 logs

Copy link
Member Author

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery
Adding the filter count at the end is definitely a big improvement. I'm not clear on adding a different method at the GATK level and at the VariantWalker level. It seems like just an extra layer of indirection. Also, tests are failing because I believe it needs to be at the VariantWalkerBase level instead.

I like moving the argument down to that level though. Maybe we don't need a way of specifying a default since no one actually wants to, and we just make it so we override the GATKTool method in order to pass back up the argument value. That also removes the problem of "tool specified a filter pattern but no intervals so we output an error message.

I found a few typos of course.

validateSequenceDictionaries();
}

checkToolRequirements();

if (outputVariantIntervalFilteringMode != null && userIntervals == null){
throw new CommandLineException.MissingArgument("-L or -XL", "Intervals are required if --" + StandardArgumentDefinitions.VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME + " was specified.");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I'm not sure it's addressed by your changes though?

}

@Override
public void close() {
writer.close();
underlyingWriter.close();
if (filteredCount > 0) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably move the output ahead of the close() since if there's an exception in the tool there's often an exception in the closing it but you might still want that information to debub.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

{Arrays.asList(chr1Interval, chr2Interval), new ArrayList<>(), IntervalFilteringVcfWriter.Mode.CONTAINED, List.of(VariantEmitter.INT3, VariantEmitter.INT7)},

// Tests specifically aimed at documenting how the --interval-merging-rule argument works in conjunction with the interval filtering
{Arrays.asList(chr1IntervalLeft, chr1IntervalRight), new ArrayList<>(), IntervalFilteringVcfWriter.Mode.CONTAINED, List.of(VariantEmitter.INT2)}, // Default is to merge all
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use INTERVAL_MERGING_RULE_LONG_NAME

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done but this is about as close to as friviolous as this gets since that argument is bedrock that won't change and it makes this block more unreadable \

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but I really like to be able to find all the uses of an argument in tests by clicking on find uses.

@@ -949,11 +942,11 @@ public VariantContextWriter createVCFWriter(final Path outPath) {
options.toArray(new Options[0]));
}

return outputVariantIntervalFilteringMode== null ?
return getVariantFilteringOutputModeIfApplicable() == IntervalFilteringVcfWriter.Mode.ANYWHERE ?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure this rename helps that much, it just adds another layer of indirection. It's then kind of confusing because tools below variantwalker might not be sure which to override.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to getVariantOutputFilteringMode I don't really know what IS better here

ENDS_IN("ends within any of the given intervals"){
@Override
boolean test(final OverlapDetector<? extends Locatable> detector, final VariantContext query) {
final SimpleInterval endPosition = new SimpleInterval(query.getContig(), query.getEnd(), query.getEnd());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because we want to be sure we're only checking the end position for overlap.

@@ -31,6 +33,28 @@ public abstract class VariantWalker extends VariantWalkerBase {
private FeatureDataSource<VariantContext> drivingVariants;
private FeatureInput<VariantContext> drivingVariantsFeatureInput;

@Argument(fullName = StandardArgumentDefinitions.VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure all these changes need to go in VariantWalkerBase, not VariantWalker. I think that's why all the tests exploded.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responses

@@ -949,11 +942,11 @@ public VariantContextWriter createVCFWriter(final Path outPath) {
options.toArray(new Options[0]));
}

return outputVariantIntervalFilteringMode== null ?
return getVariantFilteringOutputModeIfApplicable() == IntervalFilteringVcfWriter.Mode.ANYWHERE ?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to getVariantOutputFilteringMode I don't really know what IS better here

{Arrays.asList(chr1Interval, chr2Interval), new ArrayList<>(), IntervalFilteringVcfWriter.Mode.CONTAINED, List.of(VariantEmitter.INT3, VariantEmitter.INT7)},

// Tests specifically aimed at documenting how the --interval-merging-rule argument works in conjunction with the interval filtering
{Arrays.asList(chr1IntervalLeft, chr1IntervalRight), new ArrayList<>(), IntervalFilteringVcfWriter.Mode.CONTAINED, List.of(VariantEmitter.INT2)}, // Default is to merge all
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done but this is about as close to as friviolous as this gets since that argument is bedrock that won't change and it makes this block more unreadable \

}

@Override
public void close() {
writer.close();
underlyingWriter.close();
if (filteredCount > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

validateSequenceDictionaries();
}

checkToolRequirements();

if (outputVariantIntervalFilteringMode != null && userIntervals == null){
throw new CommandLineException.MissingArgument("-L or -XL", "Intervals are required if --" + StandardArgumentDefinitions.VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME + " was specified.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well pulling this down into variant walker base makes this a little more obnoxious to fix... I'm just going to add a comment clarifying but this is a head scratcher

@gatk-bot
Copy link

Github actions tests reported job failures from actions build 10910519980
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling 17.0.6+10 10910519980.2 logs

Copy link
Member Author

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me after our discussion.

/**
* @return Default interval filtering mode for variant output. Subclasses may override this to set a different default.
*/
public IntervalFilteringVcfWriter.Mode getDefaultVariantOutputFilterMode(){
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused now. Should we delete it?

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review done with @lbergelson

@jamesemery jamesemery merged commit 6036d67 into master Sep 18, 2024
20 checks passed
@jamesemery jamesemery deleted the lb_starting_in_intervals branch September 18, 2024 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants