-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index output VCFs for GCNV postprocessing #6330
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once we can get the tests to pass.
@@ -348,7 +350,8 @@ public Object onTraversalSuccess() { | |||
|
|||
private void generateIntervalsVCFFileFromAllShards() { | |||
logger.info("Generating intervals VCF file..."); | |||
final VariantContextWriter intervalsVCFWriter = createVCFWriter(outputIntervalsVCFFile); | |||
final VariantContextWriter intervalsVCFWriter = GATKVariantContextUtils.createVCFWriter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange, because createVCFWriter creates output indexes by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... it's because the sequence dictionary is weird in this tool. It might make sense to fix this tools implementation of getBestAvailableSequenceDictionary instead of changing these calls, but maybe that's harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because of the way the tool takes inputs. It takes a parent directory that contains the files, so there's no single file the engine knows how to get a sequence dictionary from. I hate to change the input arguments at this point, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially override getBestAvailableSequenceDictionary in this tool to understand what this tool uses, and then other stuff would JustWork
@lbergelson I liked your suggestion, so I did that. What do you think about this improved version? |
|
||
/* get intervals from each call and model shard in the provided (potentially arbitrary) order */ | ||
final List<SimpleIntervalCollection> unsortedIntervalCollectionsFromCalls = | ||
getIntervalCollectionsFromPaths(inputUnsortedCallsShardPaths); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially be an expensive operation. It might be a good idea to promote these to fields and only initialize them once on demand. It's more complicated and annoying though so either way 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntelliJ Refactor > Extract Field and > Encapsulate Field FTW! If I knew how to do the lazy initialization through refactoring this would have been easy peasy. (Instead it was just easy.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just call these initializing methods in getBestAvailableSequenceDictionary as well? (Matilda is preventing me from doing a detailed review, so please ignore if I'm wrong!)
final List<SimpleIntervalCollection> unsortedIntervalCollectionsFromModels = | ||
getIntervalCollectionsFromPaths(inputUnsortedModelShardPaths); | ||
|
||
if (sequenceDictionary == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you have to wrap this in a null check because it's also checked in getBestAvailableSequenceDictionary()
|
||
logger.info(String.format("Writing intervals VCF file to %s...", outputIntervalsVCFFile.getAbsolutePath())); | ||
for (int shardIndex = 0; shardIndex < numShards; shardIndex++) { | ||
logger.info(String.format("Analyzing shard %d / %d...", shardIndex, numShards)); | ||
logger.info(String.format("Analyzing shard %d / %d...", shardIndex + 1, numShards)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shards are named starting at 1 I take it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought I fixed this in another PR...dunno emoji
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier One comment but looks good to me either way depending on what you want to do. It's always nice when someone likes my suggestions :)
7a278e4
to
91050c8
Compare
Resolves part of #6167 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier Awesome. Looks good to me!
Thanks for adding this, @ldgauthier! |
Tests are "failing" with the "code is too big" error on the CNN testTrainingReadModel.
I had to update my conda yml template to use a newer Tensorflow @cmnbroad found -- should I add that here too?