Update feature and related records to support GFF3 #82

heuermh · 2016-05-20T17:41:27Z

Maximally disruptive proposal for refactoring feature and related records to support GFF3.

If desirable, I could also propose a minimally disruptive refactoring, but where's the fun in that?

I included a Contig → contigName change as elsewhere.

AmplabJenkins · 2016-05-20T17:42:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/93/
Test PASSed.

fnothaft · 2016-05-20T17:44:03Z

This doesn't seem too disruptive to me. I +1. CC @laserson @tdanford for review.

AmplabJenkins · 2016-05-20T17:47:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/94/
Test PASSed.

heuermh · 2016-05-20T17:51:19Z

I needed #81 to build this locally (thanks); may have additional doc changes.

For example, what is going on here in the javadocs?

@Deprecated
public String contigName

Deprecated. 
Source of this feature. Column 2 "source" in GFF3. union { null, string } source; /**
Feature type. Column 3 "type" in GFF3. union { null, string } type; /** Contig this
feature is located on. Column 1 "seqid" in GFF3, column 1 "chrom" in BED format.

heuermh · 2016-05-20T17:55:02Z

Never mind, fixed it

AmplabJenkins · 2016-05-20T17:57:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/95/
Test PASSed.

AmplabJenkins · 2016-05-20T23:02:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/96/
Test PASSed.

heuermh · 2016-05-20T23:24:49Z

Pushed work-in-progress commit here bigdatagenomics/adam@8c212da

tdanford · 2016-05-24T12:51:48Z

Okay, so two high-level comments:

OntologyTerm and Dbxref seem duplicative? Why not just a Dbxref?
Feature seems pretty GFF3-specific, is that intended? Most of those fields aren't present in most feature file formats (and ad hoc formats)

laserson · 2016-05-24T14:09:32Z

src/main/resources/avro/bdg.avdl


 /**
- The type of feature this is (aka, "track").
+ Display name for this feature. Name tag in GFF3, optional column 4 "name"
+ in BED format.


Add example like in the comments above for OntologyTerm etc?

heuermh · 2016-05-24T14:11:01Z

OntologyTerm and Dbxref seem duplicative? Why not just a Dbxref?

Per the GFF3 spec:
"Two reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database. Ontology_term is reserved for associations to ontologies, such as the Gene Ontology. Dbxref is used for all other cross references. While there is no firm boundary line between these two concepts, curators tend to treat ontology associations differently and hence ontology terms have been given their own reserved attribute label."

I figure if the data take advantage of the distinction, representing some links as Dbxref and some as Ontology_term, we should not lose that information.

Feature seems pretty GFF3-specific, is that intended? Most of those fields aren't present in most feature file formats (and ad hoc formats)

Yes, this pull request models GFF3 most closely but also allows for things specific to other formats (i.e. frame for GFF2/GTF). I considered adding geneId and transcriptId for required gene_id and transcript_id attribute tags for GFF2/GTF. (edit: Now that I've written that, I'm leaning toward including them, and possibly also exonId. Isn't it better to have fields that are easily queried than have to pull things out of attributes?)

The way I understand it, with Parquet nulled fields won't increase storage requirements on disk, so having wider but sparse Feature records shouldn't impact performance. If overhead in RAM is a concern, one can limit by projection.

laserson · 2016-05-24T14:16:06Z

Overall I'm fine with this patch, though I also am not crazy about tying ourselves too closely to a particular file format.

heuermh · 2016-05-24T14:25:10Z

A minimal version of this patch would update Strand as above, add frame as a field so that it isn't lost in GFF2/GTF parsing, and remove the dbxrefs field since it is not used consistently across the formats. That would keep all the interesting bits in the attributes field.

…es in docs

AmplabJenkins · 2016-05-24T16:37:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/97/
Test PASSed.

AmplabJenkins · 2016-06-03T18:17:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/100/
Test PASSed.

heuermh · 2016-06-07T01:28:26Z

Hold on this one, avro map<string> for attributes does not preserve cardinality for repeated keys. Need to think on it...

tdanford · 2016-06-07T11:42:38Z

Broadly, I concur with @laserson -- I think tying ourselves to a particular file format for the "Feature" record isn't great. I'd prefer a more stripped down Feature or, failing that, renaming Feature to be more clear what it is and what it represents, either GFF3Feature or MODDbFeature or something along those lines.

But we've been over both of these objections before, so if neither of them is acceptable to the broader group then I will accept this PR as is.

heuermh · 2016-06-07T14:48:57Z

I've taken care to document how other formats fit the fields in the schema, so I wouldn't say that it is GFF3-specific. Rather it was driven by a design principle of preferring nullable fields to attributes in a map.

Copying a comment I made in another thread:

From what I understand, storing data in attributes is inefficient given how ADAM and Spark operate on schema records (projections, filters, column compression on disk, SparkSQL queries, etc.). For example, it is not possible to project only a single key from the attributes field.

This inefficiency, plus the note above about map<string> for attributes not allowing multiple values for the same key, leads me to believe that design principle is sound. I am of course open to being proved otherwise. :)

laserson · 2016-06-07T14:56:39Z

This is the classic tradeoff of how strongly typed we want to be. It seems to me that one issue is that the read/variant types are more "mature" than the feature type, since I don't think as much software is cranking on the feature types. Would it be possible to mark it somehow as less mature and more subject to change?

fnothaft · 2016-06-07T15:37:43Z

Rather it was driven by a design principle of preferring nullable fields to attributes in a map.

I strongly advocate for this design principle. We've promoted many fields in AlignmentRecord or Genotype according to this philosophy.

Would it be possible to mark it somehow as less mature and more subject to change?

Sounds reasonable to me; maybe just add a sentence in the docs "The Feature schema is experimental and subject to change"?

heuermh · 2016-06-07T16:09:15Z

In any case, attributes with duplicate keys are going to get clobbered. Is this a reasonable workaround?

  // ...
  array<Attribute> attributes = [];
}

record Attribute {
  string key;
  string value;
}

And is there an equivalent to Multimap<K,V> in scala?

heuermh · 2016-06-14T20:35:33Z

In the latest commit to ADAM bigdatagenomics/adam@8e2d786 I've added special handling for geneId, transcriptId, and exonId fields and duplicate key support for attributes with reserved keys in the GFF3 specification.

fnothaft · 2016-06-15T18:37:44Z

Thanks @heuermh! Merged.

heuermh · 2016-06-15T19:38:49Z

Thank you!

heuermh force-pushed the maximal-gff3 branch from f7c6839 to 418ac14 Compare May 20, 2016 17:46

heuermh force-pushed the maximal-gff3 branch from 418ac14 to 76cc77c Compare May 20, 2016 17:54

Update feature and related records to support GFF3

7a093b3

heuermh force-pushed the maximal-gff3 branch from 76cc77c to 7a093b3 Compare May 20, 2016 22:57

laserson reviewed May 24, 2016
View reviewed changes

Add geneId, transcriptId, exonId fields to Feature; additional exampl…

7bc9ee3

…es in docs

Add default values for nullable fields

76f2bec

heuermh mentioned this pull request Jun 3, 2016

[ADAM-710] Add saveAs methods for feature formats GTF, BED, IntervalList, and NarrowPeak bigdatagenomics/adam#998

Closed

heuermh mentioned this pull request Jun 7, 2016

Release ADAM version 0.20.0 bigdatagenomics/adam#1048

Closed

61 tasks

fnothaft merged commit 2c712bf into bigdatagenomics:master Jun 15, 2016

heuermh deleted the maximal-gff3 branch June 15, 2016 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update feature and related records to support GFF3 #82

Update feature and related records to support GFF3 #82

heuermh commented May 20, 2016

AmplabJenkins commented May 20, 2016

fnothaft commented May 20, 2016

AmplabJenkins commented May 20, 2016

heuermh commented May 20, 2016

heuermh commented May 20, 2016

AmplabJenkins commented May 20, 2016

AmplabJenkins commented May 20, 2016

heuermh commented May 20, 2016

tdanford commented May 24, 2016

laserson May 24, 2016

heuermh commented May 24, 2016 •

edited

Loading

laserson commented May 24, 2016

heuermh commented May 24, 2016

AmplabJenkins commented May 24, 2016

AmplabJenkins commented Jun 3, 2016

heuermh commented Jun 7, 2016

tdanford commented Jun 7, 2016

heuermh commented Jun 7, 2016

laserson commented Jun 7, 2016

fnothaft commented Jun 7, 2016

heuermh commented Jun 7, 2016

heuermh commented Jun 14, 2016

fnothaft commented Jun 15, 2016

heuermh commented Jun 15, 2016

Update feature and related records to support GFF3 #82

Update feature and related records to support GFF3 #82

Conversation

heuermh commented May 20, 2016

AmplabJenkins commented May 20, 2016

fnothaft commented May 20, 2016

AmplabJenkins commented May 20, 2016

heuermh commented May 20, 2016

heuermh commented May 20, 2016

AmplabJenkins commented May 20, 2016

AmplabJenkins commented May 20, 2016

heuermh commented May 20, 2016

tdanford commented May 24, 2016

laserson May 24, 2016

Choose a reason for hiding this comment

heuermh commented May 24, 2016 • edited Loading

laserson commented May 24, 2016

heuermh commented May 24, 2016

AmplabJenkins commented May 24, 2016

AmplabJenkins commented Jun 3, 2016

heuermh commented Jun 7, 2016

tdanford commented Jun 7, 2016

heuermh commented Jun 7, 2016

laserson commented Jun 7, 2016

fnothaft commented Jun 7, 2016

heuermh commented Jun 7, 2016

heuermh commented Jun 14, 2016

fnothaft commented Jun 15, 2016

heuermh commented Jun 15, 2016

heuermh commented May 24, 2016 •

edited

Loading