Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1770] Genotype should only store core variant fields. #1771

Merged

Conversation

fnothaft
Copy link
Member

Resolves #1770.

@fnothaft fnothaft added this to the 0.23.0 milestone Oct 20, 2017
@fnothaft fnothaft requested a review from heuermh October 20, 2017 11:15
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2438/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1771/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 20bcc89 # timeout=10Checking out Revision 20bcc89 (origin/pr/1771/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 20bcc8972902f8517902a73ba79d822b6ae0e93eFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh
Copy link
Member

heuermh commented Oct 20, 2017

I'm not so sure about this. For the use case where someone converts to Genotypes in Avro+Parquet format and archives the VCF to low cost storage, we don't want to lose data. This change doesn't appear to give the user any choice.

See also comment bigdatagenomics/bdg-formats#108 (comment), which refers to the thread at bigdatagenomics/bdg-formats#108 (comment).

@fnothaft
Copy link
Member Author

Yeah, the premise is that you would save the VCF to separate Parquet Variant and Genotype files and save both. This is what the Genotype converter did prior to the VariantAnnotation refactor, which was intentional. I had thought that we had kept that behavior, but was wrong.

@heuermh
Copy link
Member

heuermh commented Oct 20, 2017

Yeah, the premise is that you would save the VCF to separate Parquet Variant and Genotype files and save both.

If you read them in separately, how do you join them back together?

@fnothaft
Copy link
Member Author

Short answer is with a region join; it'd be similar to the join we used to have with DatabaseVariantAnnotations. When we clean up #1590, data will be marked as sorted on import from VCF, so the join will be fast.

@heuermh
Copy link
Member

heuermh commented Oct 23, 2017

The filter here (which is also messy in a number of different reasons) uses a combination of genotype VCF FORMAT attributes, variant VCF INFO attributes, and Variant.quality.

The genotype refactor in bigdatagenomics/bdg-formats#108 cleans up the separation between Variant and Genotype, meaning that one would have to filter VariantRDD and then join to GenotypeRDD to continue the filter. How might such a filter be implemented in the meantime?

@heuermh heuermh merged commit b84dd11 into bigdatagenomics:master Nov 6, 2017
@heuermh
Copy link
Member

heuermh commented Nov 6, 2017

Thank you, @fnothaft

fnothaft added a commit to fnothaft/adam that referenced this pull request Nov 6, 2017
fnothaft added a commit to fnothaft/adam that referenced this pull request Nov 7, 2017
heuermh pushed a commit that referenced this pull request Nov 7, 2017
fnothaft added a commit to fnothaft/adam that referenced this pull request Dec 19, 2017
…igurable.

Resolves bigdatagenomics#1838. Modifies the behavior of bigdatagenomics#1771, which disabled populating the
`variant.annotation` field in the `Genotype` record. Now, this field is not
populated by default. To enable populating it, a user can set the property
`org.bdgenomics.adam.converters.VariantContextConverter.NEST_ANN_IN_GENOTYPES`
to true.
heuermh pushed a commit that referenced this pull request Dec 20, 2017
…igurable.

Resolves #1838. Modifies the behavior of #1771, which disabled populating the
`variant.annotation` field in the `Genotype` record. Now, this field is not
populated by default. To enable populating it, a user can set the property
`org.bdgenomics.adam.converters.VariantContextConverter.NEST_ANN_IN_GENOTYPES`
to true.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants