-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bypass FeatureReader for GenomicsDBImport #7393
Conversation
…ml_bypass_featurereader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlathara A couple issues in the test code that need to be addressed before this can be merged -- back to you.
@Test(groups = {"bucket"}, dataProvider = "batchSizes") | ||
public void testGenomicsDBImportGCSInputsInBatches(final int batchSize) throws IOException { | ||
testGenomicsDBImporterWithBatchSize(resolveLargeFilesAsCloudURIs(LOCAL_GVCFS), INTERVAL, COMBINED, batchSize); | ||
} | ||
|
||
@Test(groups = {"bucket"}, dataProvider = "batchSizes") | ||
public void testGenomicsDBImportGCSInputsInBatchesNativeReader(final int batchSize) throws IOException { | ||
testGenomicsDBImporterWithBatchSize(resolveLargeFilesAsCloudURIs(LOCAL_GVCFS), INTERVAL, COMBINED, batchSize, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testGenomicsDBImporterWithBatchSize()
does not propagate the useNativeReader
boolean correctly into writeToGenomicsDB()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@Test | ||
public void testGenomicsDBIncrementalAndBatchSize1WithNonAdjacentIntervalsNativeReader() throws IOException { | ||
final String workspace = createTempDir("genomicsdb-incremental-tests").getAbsolutePath() + "/workspace"; | ||
testIncrementalImport(2, MULTIPLE_NON_ADJACENT_INTERVALS_THAT_WORK_WITH_COMBINE_GVCFS, workspace, 1, false, true, "", 0, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testIncrementalImport()
does not use the native reader for the first batch (i == 0
) -- why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - not entirely sure anymore, I think I wanted to check that a given workspace could be imported to using feature reader and htslib. Refactored a bit to make that a bit more clear, and added a test that does all htslib/native incremental import
@@ -512,6 +530,9 @@ private void initializeHeaderAndSampleMappings() { | |||
final List<VCFHeader> headers = new ArrayList<>(variantPaths.size()); | |||
for (final String variantPathString : variantPaths) { | |||
final Path variantPath = IOUtils.getPath(variantPathString); | |||
if (bypassFeatureReader) { | |||
assertVariantFileIsCompressedAndIndexed(variantPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your testing, did you find that these extra checks for whether the inputs are block-compressed and indexed added significantly to the runtime when dealing with remote files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't done a lot of remote testing -- just sanity tests to ensure that they work. In the small remote cases we've tried the native reader is actually slower, but I haven't dug into it to see where the bottleneck is (potentially tweaking buffer sizes, etc). As I mentioned in the PR, that is something we were hoping to explore with Broad.
@@ -348,6 +350,13 @@ | |||
optional = true) | |||
private boolean sharedPosixFSOptimizations = false; | |||
|
|||
@Argument(fullName = BYPASS_FEATURE_READER, | |||
doc = "Used htslib to read input VCFs instead of FeatureReader. This will reduce memory usage and potentially speed up " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used
-> Use
FeatureReader
-> GATK's FeatureReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@droazen Made some changes, I think the PR build failing is unrelated...? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlathara Back to you with a few lingering issues in the test code
for(int i=0; i<LOCAL_GVCFS.size(); i+=stepSize) { | ||
int upper = Math.min(i+stepSize, LOCAL_GVCFS.size()); | ||
writeToGenomicsDB(LOCAL_GVCFS.subList(i, upper), intervals, workspace, batchSize, false, 0, 1, false, false, i!=0, | ||
chrsToPartitions, i!=0 && useNativeReader); | ||
chrsToPartitions, useNativeReaderInitial && useNativeReader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here useNativeReaderInitial
and useNativeReader
are doing the same thing -- the native reader will only be used if both are true, regardless of whether we're on the first batch or a later batch. I think the intent was for useNativeReaderInitial
to control whether the native reader should be used for batch 0? In that case, we'd want something like:
(i == 0 && useNativeReaderInitial) || (i > 0 && useNativeReader)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@Test | ||
public void testGenomicsDBBasicIncrementalAllNativeReader() throws IOException { | ||
final String workspace = createTempDir("genomicsdb-incremental-tests").getAbsolutePath() + "/workspace"; | ||
testIncrementalImport(2, INTERVAL, workspace, 0, true, true, COMBINED_WITH_GENOTYPES, 0, false, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the useNativeReader
and useNativeReaderInitial
booleans should be true here if this is testing the "all native reader" case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Done - sorry for the 😶🌫️ 🤦♂️ |
Looks like the integration tests failed with an unrelated error -- I'll try re-running them. |
rebase them first |
The test failures in the branch build are clearly related to the recent travis key migration. The PR build (which is the one we care about) passes, so this should be safe to merge. |
This PR adds the option to bypass feature reader for GenomicsDBImport. In our testing, this sees about 10-15% speedup, and uses roughly an order of magnitude less memory in the case where vcfs and genomicsdb workspaces are both on local disk. We don't have extensive benchmarking of how this affects GenomicsDBImport in the cloud, but would be interested in exploring that (in conjunction with some of the recent changes for native cloud support).
cc: @droazen @lbergelson @ldgauthier