Refactor Genotype and GenotypeAnnotation #108

heuermh · 2016-10-24T18:29:16Z

Starting point for discussion around Genotype and GenotypeAnnotation. This is a tear-down-and-rebuild, where VariantCallingAnnotations has been removed and the bare minimum put back in.

AmplabJenkins · 2016-10-24T18:32:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/129/
Test PASSed.

akmorrow13 · 2016-10-26T17:50:36Z

src/main/resources/avro/bdg.avdl

+   Additional genotype attributes that do not fit into the standard fields above.
+   The values are stored as strings, even for flag, integer, and float types.
+   */
+  map<string> attributes = {};


Why do we require the attributes but not the genotype? I ask because this is the opposite of [https://github.com/tmoerman/adam-fx/blob/master/src/main/resources/avro/adam-fx.avdl]

We don't require the attributes here; if no attributes are specified, we will default to an empty map. If we wanted to require the user to set attributes, this line would read:

map<string> attributes;

Our policy is to always allow fields to be nullable. For pure fields, we do this through the nullable syntax. For maps/arrays, we do this by setting the default to an empty map/array.

ah I see. great!

fnothaft

I don't love these changes. I agree with cutting out some of the fields, but the new Genotype record is far too austere for my preferences. Let me take a hack at a PR that is more of a gradual refactor, as a comparison point.

fnothaft · 2016-10-27T01:30:43Z

src/main/resources/avro/bdg.avdl

+   True if this genotype is phased.  If true, the order of alleles is significant. VCF genotype field
+   reserved key "GT" allele separators: '|' genotype is phased; '/' genotype is unphased.
+   */
+  union { boolean, null } phased = false;


phased doesn't really make sense without the phaseSetId, and I'd argue that it isn't great to have without phaseSetQuality/phasingQuality.

Note before this was defined as phaseSetId != null. According to the VCF spec, one can have phased alleles without specifying a phase set ID:

"All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set."

I don't see phaseSetQuality in the VCF spec. phasingQuality is VCF genotype field reserved key "PQ". Is that what you meant or did you have something else in mind?

Or if you meant to use phaseSetQuality instead of phasingQuality, I'd +1 that.

fnothaft · 2016-10-27T01:32:12Z

src/main/resources/avro/bdg.avdl

+/**
+ Genotype.
+ */
+record Genotype {


We lost a lot of pretty important Genotype level fields here, like genotype quality, read depth, alt/reference depth, etc. What's the rationale for that?

See comment below

fnothaft · 2016-10-27T01:32:46Z

src/main/resources/avro/bdg.avdl

+   */
+  union { null, Variant } variant = null;
+
+  /**


If we're refactoring Genotype, it'd be kinda nice to remove the contigName/start/end fields, since those are nested in variant. We flattened them a while back due to a performance regression in Parquet when running predicates.

Are nested fields still a problem for Parquet?

Yes, nested fields are still a big problem, mostly in the case when we are trying to minimize latency (Mango). One of the motivations for putting in contigName/start/end fields was because of this latency issue.

fnothaft · 2016-10-27T01:54:05Z

src/main/resources/avro/bdg.avdl

+/**
+ Genotype annotation.
+ */
+record GenotypeAnnotation {


I prefer having GenotypeAnnotation nest in Genotype than the other way around. The reason I like nesting Variant in VariantAnnotation is because Variant nests in Genotype, and thus doing a join between Genotypes and VariantAnnotations makes sense to me. Historically, we had VariantCallingAnnotations as a nested record solely for convenience. I.e., for a "finished" genotype dataset like 1KG, you would probably not populate the VariantCallingAnnotations, since they're just used to determine whether a genotype call is correct or not, and the "finished" callset should be "correct".

This is part of what I don't understand about the current API.

In conference with @akmorrow13 this afternoon I'm starting to lean more towards removing the *Annotation records altogether in favor of merging the fields onto Variant and Genotype and nulling them out via projection.

I'd like further discussion around this, if only for my own understanding.

In conference with @akmorrow13 this afternoon I'm starting to lean more towards removing the *Annotation records altogether in favor of merging the fields onto Variant and Genotype and nulling them out via projection.

Actually, that's a reasonable approach. TBH, I'd be OK with that.

With Avro, if you don't set an array or map field, does it still occupy space in RAM as empty arrays and empty maps? Same with array and map fields that are projected away by Parquet, which I assume/hope does not explicitly set those fields to empty.

fnothaft · 2016-10-27T01:55:05Z

src/main/resources/avro/bdg.avdl

@@ -988,6 +703,97 @@ record VariantAnnotation {
 }

 /**
+ Allele encodings for genotypes.
+ */
+enum Allele {


+1 on rename and capitalization

heuermh · 2016-10-27T02:27:16Z

I don't love these changes. I agree with cutting out some of the fields, but the new Genotype record is far too austere for my preferences. Let me take a hack at a PR that is more of a gradual refactor, as a comparison point.

As I stated above, this pull request was just a starting point for discussion. I plan to gradually add back most if not all of the missing fields.

I started with what I thought were the most uncontroversial and non-complicated fields.

fnothaft · 2016-10-27T02:28:49Z

As I stated above, this pull request was just a starting point for discussion. I plan to gradually add back most if not all of the missing fields.

Oops! Got it now. Sorry about the confusion. I put a (lazily named) view of what I think would be good at #109. Let me know your thoughts.

AmplabJenkins · 2016-10-27T16:47:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/131/
Test PASSed.

AmplabJenkins · 2016-11-04T13:57:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/133/
Test PASSed.

AmplabJenkins · 2016-11-04T14:02:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/134/
Test PASSed.

AmplabJenkins · 2016-11-04T15:32:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/135/
Test PASSed.

AmplabJenkins · 2016-11-04T16:17:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/136/
Test PASSed.

heuermh · 2016-11-05T13:29:38Z

src/main/resources/avro/bdg.avdl

+   are in the same phased set. All phased genotypes that do not have a phaseSetId are assumed to
+   belong to the same phased set. VCF genotype field reserved key PS.
+   */
+  union { null, string } phaseSetId = null;


Note that even though the VCF specification types this as Integer, and recommends "a convention of the position of the first variant in the set identifier (although this is not required)", I've seen all sorts of things used in the wild that aren't Integers. E.g. the GIAB VCF files use [position_ref_alt], 863556_G_A

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz

I think I'd prefer to stick with Int, as that's what the VCF spec states, and what is IMO correct (i.e., if I were to implement this, I would use an Int "UUID"). That being said, I agree that it is pretty common to see the string phase set ID. I don't know what's the correct thing to do here. If we go with the string, then we emit headers that disagree with the VCF spec, which seems wrong. If we go with Int, then we can't handle VCFs that come in from the wild. If we went with say:

union { null, int } phaseSetId = null; union { null, string } phaseSetStringId = null;

Then we'd need to figure out at save whether the user provided string/int/no phase set IDs.

I don't think there's a right/wrong answer, but we see this elsewhere (bigdatagenomics/adam#1213), so we should hash out our philosophy.

Right, we should go with what @heuermh argued over in #1213. :)

jpdna · 2016-11-05T20:13:48Z

src/main/resources/avro/bdg.avdl

+  union { null, boolean } filtersPassed = null;
+
+  /**
+   Zero or more filters that failed for this variant. VCF genotype field reserved key FT.


should "..that failed for this variant" be "..that failed for this genotype call"?
Also applies to above field.

thanks! fixed it

AmplabJenkins · 2016-11-07T01:52:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/137/
Test PASSed.

AmplabJenkins · 2016-11-07T13:47:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/138/
Test PASSed.

AmplabJenkins · 2016-11-08T19:12:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/141/
Test PASSed.

AmplabJenkins · 2017-01-10T17:33:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/147/
Test PASSed.

heuermh · 2017-01-10T17:34:34Z

Things got hairy on the last rebase, so I'm not 100% sure this is currently the way I want it

fnothaft · 2017-01-10T23:09:31Z

SGTM. Ping me when you'd like me to make a review pass.

AmplabJenkins · 2017-05-16T15:17:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/158/
Test PASSed.

AmplabJenkins · 2017-05-16T15:52:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/159/
Test PASSed.

heuermh · 2017-05-17T14:28:30Z

Fixes #106

AmplabJenkins · 2017-06-18T01:57:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/165/
Test PASSed.

AmplabJenkins · 2017-06-22T04:37:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/166/
Test PASSed.

…omments

AmplabJenkins · 2017-10-11T19:42:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/172/
Test PASSed.

AmplabJenkins · 2017-10-12T17:18:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/174/
Test PASSed.

heuermh · 2017-10-12T17:23:17Z

Most recent changes:

likelihoods and probabilities are log-scaled
new VCF genotype field keys priors, nonReferenceLikelihoods, and fisherStrandBiasPValue

I have reservations about log-scaling, in that users are not likely to look at the documentation, and will be confused when e.g. GL and likelihoods values do not match.

I'm still not sure what to do re: #108 (comment) which may be hidden above. We currently set the variant field and the flattened values when reading from VCF, so unless a user projects away the fields they don't want, we're using more RAM than necessary.

AmplabJenkins · 2017-10-23T18:37:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/175/
Test PASSed.

heuermh · 2017-10-23T18:38:58Z

Resolved thread around flattening variant fields on Genotype by removing variant field and adding alternateAllele and referenceAllele.

heuermh · 2019-01-14T23:22:29Z

Replaced by #176.

heuermh mentioned this pull request Oct 24, 2016

Upgrade some fields/documentation to improve gVCF support #107

Closed

akmorrow13 reviewed Oct 26, 2016

View reviewed changes

fnothaft requested changes Oct 27, 2016

View reviewed changes

fnothaft mentioned this pull request Oct 27, 2016

Refactor Genotype and GenotypeAnnotation #109

Closed

heuermh force-pushed the genotype-annotation branch from 227019c to cd69f8d Compare November 4, 2016 13:58

heuermh mentioned this pull request Nov 4, 2016

Add variant filters for VCF column 7 FILTER #110

Merged

heuermh force-pushed the genotype-annotation branch from c0a37ba to b37f21a Compare November 4, 2016 16:13

heuermh commented Nov 5, 2016

View reviewed changes

jpdna reviewed Nov 5, 2016

View reviewed changes

heuermh force-pushed the genotype-annotation branch from b37f21a to da05b5b Compare November 7, 2016 01:48

heuermh mentioned this pull request Nov 8, 2016

Improve doc for VariantAnnotation #111

Merged

fnothaft mentioned this pull request Nov 8, 2016

Harmonize Variant/VariantCallingAnnotations filters #112

Closed

heuermh force-pushed the genotype-annotation branch from 361d2ab to 816f29b Compare November 8, 2016 19:08

fnothaft mentioned this pull request Nov 23, 2016

Clean rewrite of VariantContextConverter bigdatagenomics/adam#1288

Merged

heuermh mentioned this pull request Jan 10, 2017

Should Genotype allow for multiple Variants / alleles? #33

Closed

heuermh force-pushed the genotype-annotation branch from 816f29b to f2b4ada Compare January 10, 2017 17:32

fnothaft mentioned this pull request May 12, 2017

Bump to bdg-formats 0.11.0 bigdatagenomics/adam#1520

Closed

heuermh force-pushed the genotype-annotation branch from f045d23 to 40c4793 Compare May 16, 2017 14:09

heuermh modified the milestone: 0.12.0 May 23, 2017

heuermh force-pushed the genotype-annotation branch from d9b71f1 to be8f76b Compare June 18, 2017 01:54

heuermh force-pushed the genotype-annotation branch from be8f76b to 3687814 Compare June 22, 2017 04:34

heuermh added 5 commits October 11, 2017 14:40

Refactor Genotype and GenotypeAnnotation

2a93cd3

Flesh out VCF genotype field reserved keys, doc remaining fields as c…

0f8b573

…omments

Addressing review comments.

a7147aa

Adding gVCF related fields.

8e58371

Reapply float to double changes from commit 96951c2.

acb8f41

heuermh force-pushed the genotype-annotation branch from 3687814 to acb8f41 Compare October 11, 2017 19:41

Further addressing review comments.

3483f53

heuermh mentioned this pull request Oct 20, 2017

[ADAM-1770] Genotype should only store core variant fields. bigdatagenomics/adam#1771

Merged

Remove variant field and add ref and alt alleles.

15345dc

heuermh mentioned this pull request Sep 18, 2018

[FORMATS-170] Add sampleId to Feature record #167

Merged

heuermh mentioned this pull request Jan 14, 2019

[FORMATS-106] Refactor Genotype and GenotypeAnnotation. #176

Closed

heuermh closed this Jan 14, 2019

heuermh deleted the genotype-annotation branch July 2, 2019 04:50

Refactor Genotype and GenotypeAnnotation #108

Refactor Genotype and GenotypeAnnotation #108

Conversation

heuermh commented Oct 24, 2016

AmplabJenkins commented Oct 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heuermh commented Oct 27, 2016

fnothaft commented Oct 27, 2016

AmplabJenkins commented Oct 27, 2016

AmplabJenkins commented Nov 4, 2016

AmplabJenkins commented Nov 4, 2016

AmplabJenkins commented Nov 4, 2016

AmplabJenkins commented Nov 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 7, 2016

AmplabJenkins commented Nov 7, 2016

AmplabJenkins commented Nov 8, 2016

AmplabJenkins commented Jan 10, 2017

heuermh commented Jan 10, 2017

fnothaft commented Jan 10, 2017

AmplabJenkins commented May 16, 2017

AmplabJenkins commented May 16, 2017

heuermh commented May 17, 2017

AmplabJenkins commented Jun 18, 2017

AmplabJenkins commented Jun 22, 2017

AmplabJenkins commented Oct 11, 2017

AmplabJenkins commented Oct 12, 2017

heuermh commented Oct 12, 2017

AmplabJenkins commented Oct 23, 2017

heuermh commented Oct 23, 2017 • edited Loading

heuermh commented Jan 14, 2019

heuermh commented Oct 23, 2017 •

edited

Loading