Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index output VCFs for GCNV postprocessing #6330

Merged
merged 2 commits into from
Jan 7, 2020

Conversation

ldgauthier
Copy link
Contributor

Tests are "failing" with the "code is too big" error on the CNN testTrainingReadModel.

I had to update my conda yml template to use a newer Tensorflow @cmnbroad found -- should I add that here too?

@cmnbroad
Copy link
Collaborator

The xbyak "code is too big" issue recently started happening on multiple branches (see #6307), but is intermittent. I'll restart that one.

The updated TF I gave you can't be checked in since its OSX specific, and needs additional work to be integrated (see #6325) so it should be left out for now.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once we can get the tests to pass.

@gatk-bot
Copy link

gatk-bot commented Dec 18, 2019

Travis reported job failures from build 28391
Failures in the following jobs:

Test Type JDK Job ID Logs
python openjdk8 28391.5 logs
python openjdk8 28391.5 logs
python openjdk8 28391.5 logs

@@ -348,7 +350,8 @@ public Object onTraversalSuccess() {

private void generateIntervalsVCFFileFromAllShards() {
logger.info("Generating intervals VCF file...");
final VariantContextWriter intervalsVCFWriter = createVCFWriter(outputIntervalsVCFFile);
final VariantContextWriter intervalsVCFWriter = GATKVariantContextUtils.createVCFWriter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange, because createVCFWriter creates output indexes by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... it's because the sequence dictionary is weird in this tool. It might make sense to fix this tools implementation of getBestAvailableSequenceDictionary instead of changing these calls, but maybe that's harder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because of the way the tool takes inputs. It takes a parent directory that contains the files, so there's no single file the engine knows how to get a sequence dictionary from. I hate to change the input arguments at this point, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially override getBestAvailableSequenceDictionary in this tool to understand what this tool uses, and then other stuff would JustWork

@ldgauthier
Copy link
Contributor Author

@lbergelson I liked your suggestion, so I did that. What do you think about this improved version?


/* get intervals from each call and model shard in the provided (potentially arbitrary) order */
final List<SimpleIntervalCollection> unsortedIntervalCollectionsFromCalls =
getIntervalCollectionsFromPaths(inputUnsortedCallsShardPaths);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially be an expensive operation. It might be a good idea to promote these to fields and only initialize them once on demand. It's more complicated and annoying though so either way 👍.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntelliJ Refactor > Extract Field and > Encapsulate Field FTW! If I knew how to do the lazy initialization through refactoring this would have been easy peasy. (Instead it was just easy.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just call these initializing methods in getBestAvailableSequenceDictionary as well? (Matilda is preventing me from doing a detailed review, so please ignore if I'm wrong!)

final List<SimpleIntervalCollection> unsortedIntervalCollectionsFromModels =
getIntervalCollectionsFromPaths(inputUnsortedModelShardPaths);

if (sequenceDictionary == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you have to wrap this in a null check because it's also checked in getBestAvailableSequenceDictionary()


logger.info(String.format("Writing intervals VCF file to %s...", outputIntervalsVCFFile.getAbsolutePath()));
for (int shardIndex = 0; shardIndex < numShards; shardIndex++) {
logger.info(String.format("Analyzing shard %d / %d...", shardIndex, numShards));
logger.info(String.format("Analyzing shard %d / %d...", shardIndex + 1, numShards));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shards are named starting at 1 I take it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought I fixed this in another PR...dunno emoji

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier One comment but looks good to me either way depending on what you want to do. It's always nice when someone likes my suggestions :)

@ldgauthier ldgauthier force-pushed the ldg_indexPostProcessGCNV branch from 7a278e4 to 91050c8 Compare January 7, 2020 17:20
@ldgauthier
Copy link
Contributor Author

Resolves part of #6167

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier Awesome. Looks good to me!

@samuelklee
Copy link
Contributor

Thanks for adding this, @ldgauthier!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants