-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update feature and related records to support GFF3 #82
Conversation
Test PASSed. |
Test PASSed. |
I needed #81 to build this locally (thanks); may have additional doc changes. For example, what is going on here in the javadocs?
|
Never mind, fixed it |
Test PASSed. |
Test PASSed. |
Pushed work-in-progress commit here bigdatagenomics/adam@8c212da |
Okay, so two high-level comments:
|
|
||
/** | ||
The type of feature this is (aka, "track"). | ||
Display name for this feature. Name tag in GFF3, optional column 4 "name" | ||
in BED format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add example like in the comments above for OntologyTerm
etc?
Per the GFF3 spec: I figure if the data take advantage of the distinction, representing some links as Dbxref and some as Ontology_term, we should not lose that information.
Yes, this pull request models GFF3 most closely but also allows for things specific to other formats (i.e. The way I understand it, with Parquet nulled fields won't increase storage requirements on disk, so having wider but sparse Feature records shouldn't impact performance. If overhead in RAM is a concern, one can limit by projection. |
Overall I'm fine with this patch, though I also am not crazy about tying ourselves too closely to a particular file format. |
A minimal version of this patch would update Strand as above, add |
Test PASSed. |
Test PASSed. |
Hold on this one, avro |
Broadly, I concur with @laserson -- I think tying ourselves to a particular file format for the "Feature" record isn't great. I'd prefer a more stripped down Feature or, failing that, renaming Feature to be more clear what it is and what it represents, either But we've been over both of these objections before, so if neither of them is acceptable to the broader group then I will accept this PR as is. |
I've taken care to document how other formats fit the fields in the schema, so I wouldn't say that it is GFF3-specific. Rather it was driven by a design principle of preferring nullable fields to attributes in a map. Copying a comment I made in another thread: From what I understand, storing data in attributes is inefficient given how ADAM and Spark operate on schema records (projections, filters, column compression on disk, SparkSQL queries, etc.). For example, it is not possible to project only a single key from the attributes field. This inefficiency, plus the note above about |
This is the classic tradeoff of how strongly typed we want to be. It seems to me that one issue is that the read/variant types are more "mature" than the feature type, since I don't think as much software is cranking on the feature types. Would it be possible to mark it somehow as less mature and more subject to change? |
I strongly advocate for this design principle. We've promoted many fields in
Sounds reasonable to me; maybe just add a sentence in the docs "The Feature schema is experimental and subject to change"? |
In any case, attributes with duplicate keys are going to get clobbered. Is this a reasonable workaround?
And is there an equivalent to |
In the latest commit to ADAM bigdatagenomics/adam@8e2d786 I've added special handling for |
Thanks @heuermh! Merged. |
Thank you! |
Maximally disruptive proposal for refactoring feature and related records to support GFF3.
If desirable, I could also propose a minimally disruptive refactoring, but where's the fun in that?
I included a
Contig
→contigName
change as elsewhere.