-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Genotype and GenotypeAnnotation #108
Conversation
Test PASSed. |
src/main/resources/avro/bdg.avdl
Outdated
Additional genotype attributes that do not fit into the standard fields above. | ||
The values are stored as strings, even for flag, integer, and float types. | ||
*/ | ||
map<string> attributes = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we require the attributes but not the genotype? I ask because this is the opposite of [https://github.com/tmoerman/adam-fx/blob/master/src/main/resources/avro/adam-fx.avdl]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't require the attributes here; if no attributes are specified, we will default to an empty map. If we wanted to require the user to set attributes, this line would read:
map<string> attributes;
Our policy is to always allow fields to be nullable. For pure fields, we do this through the nullable syntax. For maps/arrays, we do this by setting the default to an empty map/array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see. great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love these changes. I agree with cutting out some of the fields, but the new Genotype
record is far too austere for my preferences. Let me take a hack at a PR that is more of a gradual refactor, as a comparison point.
src/main/resources/avro/bdg.avdl
Outdated
True if this genotype is phased. If true, the order of alleles is significant. VCF genotype field | ||
reserved key "GT" allele separators: '|' genotype is phased; '/' genotype is unphased. | ||
*/ | ||
union { boolean, null } phased = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phased
doesn't really make sense without the phaseSetId
, and I'd argue that it isn't great to have without phaseSetQuality
/phasingQuality
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note before this was defined as phaseSetId != null
. According to the VCF spec, one can have phased alleles without specifying a phase set ID:
"All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see phaseSetQuality
in the VCF spec. phasingQuality
is VCF genotype field reserved key "PQ". Is that what you meant or did you have something else in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if you meant to use phaseSetQuality
instead of phasingQuality
, I'd +1 that.
src/main/resources/avro/bdg.avdl
Outdated
/** | ||
Genotype. | ||
*/ | ||
record Genotype { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We lost a lot of pretty important Genotype
level fields here, like genotype quality, read depth, alt/reference depth, etc. What's the rationale for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below
src/main/resources/avro/bdg.avdl
Outdated
*/ | ||
union { null, Variant } variant = null; | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're refactoring Genotype
, it'd be kinda nice to remove the contigName
/start
/end
fields, since those are nested in variant
. We flattened them a while back due to a performance regression in Parquet when running predicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are nested fields still a problem for Parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, nested fields are still a big problem, mostly in the case when we are trying to minimize latency (Mango). One of the motivations for putting in contigName/start/end
fields was because of this latency issue.
src/main/resources/avro/bdg.avdl
Outdated
/** | ||
Genotype annotation. | ||
*/ | ||
record GenotypeAnnotation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer having GenotypeAnnotation
nest in Genotype
than the other way around. The reason I like nesting Variant
in VariantAnnotation
is because Variant
nests in Genotype
, and thus doing a join between Genotype
s and VariantAnnotation
s makes sense to me. Historically, we had VariantCallingAnnotations
as a nested record solely for convenience. I.e., for a "finished" genotype dataset like 1KG, you would probably not populate the VariantCallingAnnotations
, since they're just used to determine whether a genotype call is correct or not, and the "finished" callset should be "correct".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of what I don't understand about the current API.
In conference with @akmorrow13 this afternoon I'm starting to lean more towards removing the *Annotation records altogether in favor of merging the fields onto Variant
and Genotype
and nulling them out via projection.
I'd like further discussion around this, if only for my own understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In conference with @akmorrow13 this afternoon I'm starting to lean more towards removing the *Annotation records altogether in favor of merging the fields onto Variant and Genotype and nulling them out via projection.
Actually, that's a reasonable approach. TBH, I'd be OK with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Avro, if you don't set an array or map field, does it still occupy space in RAM as empty arrays and empty maps? Same with array and map fields that are projected away by Parquet, which I assume/hope does not explicitly set those fields to empty.
src/main/resources/avro/bdg.avdl
Outdated
@@ -988,6 +703,97 @@ record VariantAnnotation { | |||
} | |||
|
|||
/** | |||
Allele encodings for genotypes. | |||
*/ | |||
enum Allele { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on rename and capitalization
As I stated above, this pull request was just a starting point for discussion. I plan to gradually add back most if not all of the missing fields. I started with what I thought were the most uncontroversial and non-complicated fields. |
Oops! Got it now. Sorry about the confusion. I put a (lazily named) view of what I think would be good at #109. Let me know your thoughts. |
Test PASSed. |
Test PASSed. |
227019c
to
cd69f8d
Compare
Test PASSed. |
Test PASSed. |
c0a37ba
to
b37f21a
Compare
Test PASSed. |
src/main/resources/avro/bdg.avdl
Outdated
are in the same phased set. All phased genotypes that do not have a phaseSetId are assumed to | ||
belong to the same phased set. VCF genotype field reserved key PS. | ||
*/ | ||
union { null, string } phaseSetId = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that even though the VCF specification types this as Integer, and recommends "a convention of the position of the first variant in the set identifier (although this is not required)", I've seen all sorts of things used in the wild that aren't Integers. E.g. the GIAB VCF files use [position_ref_alt], 863556_G_A
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer to stick with Int, as that's what the VCF spec states, and what is IMO correct (i.e., if I were to implement this, I would use an Int "UUID"). That being said, I agree that it is pretty common to see the string phase set ID. I don't know what's the correct thing to do here. If we go with the string, then we emit headers that disagree with the VCF spec, which seems wrong. If we go with Int, then we can't handle VCFs that come in from the wild. If we went with say:
union { null, int } phaseSetId = null;
union { null, string } phaseSetStringId = null;
Then we'd need to figure out at save whether the user provided string/int/no phase set IDs.
I don't think there's a right/wrong answer, but we see this elsewhere (bigdatagenomics/adam#1213), so we should hash out our philosophy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we should go with what @heuermh argued over in #1213. :)
src/main/resources/avro/bdg.avdl
Outdated
union { null, boolean } filtersPassed = null; | ||
|
||
/** | ||
Zero or more filters that failed for this variant. VCF genotype field reserved key FT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should "..that failed for this variant" be "..that failed for this genotype call"?
Also applies to above field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! fixed it
b37f21a
to
da05b5b
Compare
Test PASSed. |
Test PASSed. |
361d2ab
to
816f29b
Compare
Test PASSed. |
816f29b
to
f2b4ada
Compare
Test PASSed. |
Things got hairy on the last rebase, so I'm not 100% sure this is currently the way I want it |
SGTM. Ping me when you'd like me to make a review pass. |
f045d23
to
40c4793
Compare
Test PASSed. |
Test PASSed. |
Fixes #106 |
d9b71f1
to
be8f76b
Compare
Test PASSed. |
be8f76b
to
3687814
Compare
Test PASSed. |
3687814
to
acb8f41
Compare
Test PASSed. |
Test PASSed. |
Most recent changes:
I have reservations about log-scaling, in that users are not likely to look at the documentation, and will be confused when e.g. GL and I'm still not sure what to do re: #108 (comment) which may be hidden above. We currently set the |
Test PASSed. |
Resolved thread around flattening variant fields on |
Replaced by #176. |
Starting point for discussion around
Genotype
andGenotypeAnnotation
. This is a tear-down-and-rebuild, whereVariantCallingAnnotations
has been removed and the bare minimum put back in.