-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FilterIntervals now filters out any singleton intervals #6559
Conversation
@samuelklee @mwalker174 Could you please review? Also does either of you know if ScatterIntervals task in CNVTasks WDL has a potential of producing singleton intervals? |
@asmirnov239 looks like you are getting an NPE---remember that intervals are resolved after argument validation, so you need to do the check later. Also, good point, I think you can get a singleton after scattering if you get unlucky with your shards. |
@samuelklee Thank you, that's fixed now. Regarding ScatterIntervals issue - what do you think about changing IntervalListTools logic to not produce singletons instead of filtering them out in GermlineCNVCaller? |
@asmirnov239 I think that we probably want all shard sizes to be as close to equal as possible, so that we avoid any possible issues arising from different model capacities across shards. Actually, is this even an issue at the GermlineCNVCaller step? Maybe we just need to worry about it during ploidy/postprocessing? |
… other intervals on their contigs
151fed2
to
8ac73b5
Compare
@samuelklee You're right - I tested it and it's only an issue in postprocessing , so I'm only making a check there. Neither GermlineCNVCaller or DetermineContigPloidy code fails due to this issue. It should be all set to be reviewed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nitpicky stuff. Otherwise, thanks, looks good to me!
@@ -431,6 +432,17 @@ private SimpleIntervalCollection filterIntervals() { | |||
countNumberPassing(mask), numIntersectedIntervals)); | |||
} | |||
|
|||
//finally, filter intervals that are solitary in their corresponding contigs | |||
final Map<String, Long> contigToCountMap = IntStream.range(0, numIntersectedIntervals) | |||
.filter(i -> !mask[i]).mapToObj(i -> intersectedIntervals.getRecords().get(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put mapToObj
on its own line for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
final Map<String, Long> contigToCountMap = IntStream.range(0, numIntersectedIntervals) | ||
.filter(i -> !mask[i]).mapToObj(i -> intersectedIntervals.getRecords().get(i)) | ||
.collect(Collectors.groupingBy(SimpleInterval::getContig, Collectors.counting())); | ||
IntStream.range(0, numIntersectedIntervals).filter(i -> !mask[i]).forEach(i -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Break up filter
and forEach
onto their own lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
.filter(i -> !mask[i]).mapToObj(i -> intersectedIntervals.getRecords().get(i)) | ||
.collect(Collectors.groupingBy(SimpleInterval::getContig, Collectors.counting())); | ||
IntStream.range(0, numIntersectedIntervals).filter(i -> !mask[i]).forEach(i -> { | ||
final long value = contigToCountMap.get(intersectedIntervals.getRecords().get(i).getContig()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might just name this variable count
. Perhaps even intervalCount
, and rename the map to contigToIntervalCountMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and done
.collect(Collectors.groupingBy(SimpleInterval::getContig, Collectors.counting())); | ||
IntStream.range(0, numIntersectedIntervals).filter(i -> !mask[i]).forEach(i -> { | ||
final long value = contigToCountMap.get(intersectedIntervals.getRecords().get(i).getContig()); | ||
if (value <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit weird to have <=
since it should never happen (and I can't imagine future code changes that would make it possible), but no biggie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to ==
IntStream.range(0, numIntersectedIntervals).filter(i -> !mask[i]).forEach(i -> { | ||
final long value = contigToCountMap.get(intersectedIntervals.getRecords().get(i).getContig()); | ||
if (value <= 1) { | ||
mask[i] = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a log message here indicating when a contig is dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
concatenatedIntervalList.addAll(intervalCollections.get(i).getIntervals()); | ||
}); | ||
|
||
final Map<String, Long> contigToCountMap = concatenatedIntervalList.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make the variable names and code here consistent with the code in FilterIntervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I did that as much as possible
@@ -654,4 +653,25 @@ private static File getCopyNumberSegmentsFile(final File pythonSegmenterOutputPa | |||
} | |||
return unsortedIntervalCollectionsFromModels; | |||
} | |||
|
|||
/** | |||
* Validate that the union of shard's interval lists does not have singleton intervals, i.e. intervals that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate that the concatenation of the sharded interval lists does not have singleton intervals...
, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
@@ -27,6 +27,7 @@ | |||
import org.testng.annotations.Test; | |||
|
|||
import java.io.File; | |||
import java.lang.reflect.AnnotatedElement; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure how that got in there. Thanks!
@@ -71,6 +72,16 @@ | |||
new AnnotationMap(Arrays.asList( | |||
Pair.of(CopyNumberAnnotations.GC_CONTENT, 0.95), | |||
Pair.of(CopyNumberAnnotations.MAPPABILITY, 0.95), | |||
Pair.of(CopyNumberAnnotations.SEGMENTAL_DUPLICATION_CONTENT, 0.5)))), | |||
new AnnotatedInterval(new SimpleInterval("20", 51, 60), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
White space. Thanks for adding these tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the other ones had tabs as well.. fixed!
* other tools (e.g., {@link DetermineGermlineContigPloidy} and {@link GermlineCNVCaller}) to mask intervals from | ||
* analysis. | ||
* Annotation-based filters will be applied first, followed by count-based filters. In the end, any singleton intervals | ||
* (i.e. those being by themselves on their corresponding contigs) found after applying other filters will be filtered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comma after i.e.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @asmirnov239, I have a few questions/comments.
IntStream.range(0, numIntersectedIntervals).filter(i -> !mask[i]).forEach(i -> { | ||
final long value = contigToCountMap.get(intersectedIntervals.getRecords().get(i).getContig()); | ||
if (value <= 1) { | ||
mask[i] = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log a warning here, something like:
Filtered singleton interval on contig: "chrX"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -246,7 +243,7 @@ | |||
public void onStartup() { | |||
super.onStartup(); | |||
/* check for successful import of gcnvkernel */ | |||
PythonScriptExecutor.checkPythonEnvironmentForPackage("gcnvkernel"); | |||
PythonScriptExecutor.checkPythonEnvironmentForPackage("numpy"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this was to get it to run in IDE. Thanks for catching this!
* Validate that the union of shard's interval lists does not have singleton intervals, i.e. intervals that | ||
* are the only ones on their corresponding contigs. | ||
*/ | ||
private void checkForSingletonIntervalAbsence(final List<SimpleIntervalCollection> intervalCollections){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue you can just call this checkForSingletonInterval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
final List<SimpleInterval> concatenatedIntervalList = new ArrayList<>(); | ||
IntStream.range(0, numShards).forEach(i -> { | ||
concatenatedIntervalList.addAll(intervalCollections.get(i).getIntervals()); | ||
}); | ||
|
||
final Map<String, Long> contigToCountMap = concatenatedIntervalList.stream() | ||
.collect(Collectors.groupingBy(SimpleInterval::getContig, Collectors.counting())); | ||
contigToCountMap.keySet().forEach(c -> { | ||
if (contigToCountMap.get(c) == 1) { | ||
throw new IllegalArgumentException( | ||
String.format("Records contain a singleton interval on contig (%s)." + | ||
" Please run FilterIntervals tool first.", c)); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified a bit:
intervalCollections.stream()
.flatMap(list -> list.getIntervals().stream())
.collect(Collectors.groupingBy(SimpleInterval::getContig, Collectors.counting()))
.entrySet().stream()
.forEach(entry -> {
if (entry.getValue() == 1) {
throw new IllegalArgumentException(
String.format("Records contain a singleton interval on contig (%s)." +
" Please run FilterIntervals tool first.", entry.getKey()));
}
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I used this
{intervalsWithExtraIntervalFile, Collections.emptyList(), annotatedIntervalsFile, 0.1, 0.9, 0.1, 0.9, 0.1, 0.9, Arrays.asList(2, 5)}, | ||
{intervalsWithExtraIntervalFile, Collections.singletonList("20:1-10"), annotatedIntervalsFile, 0., 1., 0., 1., 0., 1., Arrays.asList(1, 2, 3, 4, 5)}, | ||
{intervalsWithExtraIntervalFile, Arrays.asList("20:1-15", "20:35-45"), annotatedIntervalsFile, 0., 1., 0., 1., 0., 1., Arrays.asList(2, 5)}, | ||
{intervalsWithExtraIntervalFile, Collections.singletonList("20:25-50"), annotatedIntervalsFile, 0.1, 0.9, 0., 1., 0., 1., Arrays.asList(0, 1, 5)}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does one of these test for the singleton case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of them actually do because I added an interval from contig "21" which should be filtered out in all of these.
* FilterIntervals now filters out any singleton intervals, that have no other intervals on their contigs.
This implements suggestion #2 proposed by @samuelklee from issue #5852