Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Genotype and GenotypeAnnotation #108

Closed
wants to merge 7 commits into from

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented Oct 24, 2016

Starting point for discussion around Genotype and GenotypeAnnotation. This is a tear-down-and-rebuild, where VariantCallingAnnotations has been removed and the bare minimum put back in.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/129/
Test PASSed.

Additional genotype attributes that do not fit into the standard fields above.
The values are stored as strings, even for flag, integer, and float types.
*/
map<string> attributes = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we require the attributes but not the genotype? I ask because this is the opposite of [https://github.com/tmoerman/adam-fx/blob/master/src/main/resources/avro/adam-fx.avdl]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't require the attributes here; if no attributes are specified, we will default to an empty map. If we wanted to require the user to set attributes, this line would read:

map<string> attributes;

Our policy is to always allow fields to be nullable. For pure fields, we do this through the nullable syntax. For maps/arrays, we do this by setting the default to an empty map/array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see. great!

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love these changes. I agree with cutting out some of the fields, but the new Genotype record is far too austere for my preferences. Let me take a hack at a PR that is more of a gradual refactor, as a comparison point.

True if this genotype is phased. If true, the order of alleles is significant. VCF genotype field
reserved key "GT" allele separators: '|' genotype is phased; '/' genotype is unphased.
*/
union { boolean, null } phased = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phased doesn't really make sense without the phaseSetId, and I'd argue that it isn't great to have without phaseSetQuality/phasingQuality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note before this was defined as phaseSetId != null. According to the VCF spec, one can have phased alleles without specifying a phase set ID:

"All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see phaseSetQuality in the VCF spec. phasingQuality is VCF genotype field reserved key "PQ". Is that what you meant or did you have something else in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if you meant to use phaseSetQuality instead of phasingQuality, I'd +1 that.

/**
Genotype.
*/
record Genotype {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We lost a lot of pretty important Genotype level fields here, like genotype quality, read depth, alt/reference depth, etc. What's the rationale for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below

*/
union { null, Variant } variant = null;

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're refactoring Genotype, it'd be kinda nice to remove the contigName/start/end fields, since those are nested in variant. We flattened them a while back due to a performance regression in Parquet when running predicates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are nested fields still a problem for Parquet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nested fields are still a big problem, mostly in the case when we are trying to minimize latency (Mango). One of the motivations for putting in contigName/start/end fields was because of this latency issue.

/**
Genotype annotation.
*/
record GenotypeAnnotation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer having GenotypeAnnotation nest in Genotype than the other way around. The reason I like nesting Variant in VariantAnnotation is because Variant nests in Genotype, and thus doing a join between Genotypes and VariantAnnotations makes sense to me. Historically, we had VariantCallingAnnotations as a nested record solely for convenience. I.e., for a "finished" genotype dataset like 1KG, you would probably not populate the VariantCallingAnnotations, since they're just used to determine whether a genotype call is correct or not, and the "finished" callset should be "correct".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of what I don't understand about the current API.

In conference with @akmorrow13 this afternoon I'm starting to lean more towards removing the *Annotation records altogether in favor of merging the fields onto Variant and Genotype and nulling them out via projection.

I'd like further discussion around this, if only for my own understanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In conference with @akmorrow13 this afternoon I'm starting to lean more towards removing the *Annotation records altogether in favor of merging the fields onto Variant and Genotype and nulling them out via projection.

Actually, that's a reasonable approach. TBH, I'd be OK with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Avro, if you don't set an array or map field, does it still occupy space in RAM as empty arrays and empty maps? Same with array and map fields that are projected away by Parquet, which I assume/hope does not explicitly set those fields to empty.

@@ -988,6 +703,97 @@ record VariantAnnotation {
}

/**
Allele encodings for genotypes.
*/
enum Allele {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on rename and capitalization

@heuermh
Copy link
Member Author

heuermh commented Oct 27, 2016

I don't love these changes. I agree with cutting out some of the fields, but the new Genotype record is far too austere for my preferences. Let me take a hack at a PR that is more of a gradual refactor, as a comparison point.

As I stated above, this pull request was just a starting point for discussion. I plan to gradually add back most if not all of the missing fields.

I started with what I thought were the most uncontroversial and non-complicated fields.

@fnothaft
Copy link
Member

As I stated above, this pull request was just a starting point for discussion. I plan to gradually add back most if not all of the missing fields.

Oops! Got it now. Sorry about the confusion. I put a (lazily named) view of what I think would be good at #109. Let me know your thoughts.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/131/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/133/
Test PASSed.

@heuermh heuermh force-pushed the genotype-annotation branch from 227019c to cd69f8d Compare November 4, 2016 13:58
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/134/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/135/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/136/
Test PASSed.

are in the same phased set. All phased genotypes that do not have a phaseSetId are assumed to
belong to the same phased set. VCF genotype field reserved key PS.
*/
union { null, string } phaseSetId = null;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that even though the VCF specification types this as Integer, and recommends "a convention of the position of the first variant in the set identifier (although this is not required)", I've seen all sorts of things used in the wild that aren't Integers. E.g. the GIAB VCF files use [position_ref_alt], 863556_G_A

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer to stick with Int, as that's what the VCF spec states, and what is IMO correct (i.e., if I were to implement this, I would use an Int "UUID"). That being said, I agree that it is pretty common to see the string phase set ID. I don't know what's the correct thing to do here. If we go with the string, then we emit headers that disagree with the VCF spec, which seems wrong. If we go with Int, then we can't handle VCFs that come in from the wild. If we went with say:

union { null, int } phaseSetId = null;
union { null, string } phaseSetStringId = null;

Then we'd need to figure out at save whether the user provided string/int/no phase set IDs.

I don't think there's a right/wrong answer, but we see this elsewhere (bigdatagenomics/adam#1213), so we should hash out our philosophy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should go with what @heuermh argued over in #1213. :)

union { null, boolean } filtersPassed = null;

/**
Zero or more filters that failed for this variant. VCF genotype field reserved key FT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should "..that failed for this variant" be "..that failed for this genotype call"?
Also applies to above field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! fixed it

@heuermh heuermh force-pushed the genotype-annotation branch from b37f21a to da05b5b Compare November 7, 2016 01:48
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/137/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/138/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/141/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/147/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Jan 10, 2017

Things got hairy on the last rebase, so I'm not 100% sure this is currently the way I want it

@fnothaft
Copy link
Member

SGTM. Ping me when you'd like me to make a review pass.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/158/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/159/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented May 17, 2017

Fixes #106

@heuermh heuermh modified the milestone: 0.12.0 May 23, 2017
@heuermh heuermh force-pushed the genotype-annotation branch from d9b71f1 to be8f76b Compare June 18, 2017 01:54
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/165/
Test PASSed.

@heuermh heuermh force-pushed the genotype-annotation branch from be8f76b to 3687814 Compare June 22, 2017 04:34
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/166/
Test PASSed.

@heuermh heuermh force-pushed the genotype-annotation branch from 3687814 to acb8f41 Compare October 11, 2017 19:41
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/172/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/174/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Oct 12, 2017

Most recent changes:

  • likelihoods and probabilities are log-scaled
  • new VCF genotype field keys priors, nonReferenceLikelihoods, and fisherStrandBiasPValue

I have reservations about log-scaling, in that users are not likely to look at the documentation, and will be confused when e.g. GL and likelihoods values do not match.

I'm still not sure what to do re: #108 (comment) which may be hidden above. We currently set the variant field and the flattened values when reading from VCF, so unless a user projects away the fields they don't want, we're using more RAM than necessary.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/175/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Oct 23, 2017

Resolved thread around flattening variant fields on Genotype by removing variant field and adding alternateAllele and referenceAllele.

@heuermh
Copy link
Member Author

heuermh commented Jan 14, 2019

Replaced by #176.

@heuermh heuermh closed this Jan 14, 2019
@heuermh heuermh deleted the genotype-annotation branch July 2, 2019 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants