Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update feature and related records to support GFF3 #82

Merged
merged 3 commits into from
Jun 15, 2016

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented May 20, 2016

Maximally disruptive proposal for refactoring feature and related records to support GFF3.

If desirable, I could also propose a minimally disruptive refactoring, but where's the fun in that?

I included a ContigcontigName change as elsewhere.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/93/
Test PASSed.

@fnothaft
Copy link
Member

This doesn't seem too disruptive to me. I +1. CC @laserson @tdanford for review.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/94/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented May 20, 2016

I needed #81 to build this locally (thanks); may have additional doc changes.

For example, what is going on here in the javadocs?

@Deprecated
public String contigName

Deprecated. 
Source of this feature. Column 2 "source" in GFF3. union { null, string } source; /**
Feature type. Column 3 "type" in GFF3. union { null, string } type; /** Contig this
feature is located on. Column 1 "seqid" in GFF3, column 1 "chrom" in BED format.

@heuermh
Copy link
Member Author

heuermh commented May 20, 2016

Never mind, fixed it

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/95/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/96/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented May 20, 2016

Pushed work-in-progress commit here bigdatagenomics/adam@8c212da

@tdanford
Copy link
Contributor

Okay, so two high-level comments:

  • OntologyTerm and Dbxref seem duplicative? Why not just a Dbxref?
  • Feature seems pretty GFF3-specific, is that intended? Most of those fields aren't present in most feature file formats (and ad hoc formats)


/**
The type of feature this is (aka, "track").
Display name for this feature. Name tag in GFF3, optional column 4 "name"
in BED format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add example like in the comments above for OntologyTerm etc?

@heuermh
Copy link
Member Author

heuermh commented May 24, 2016

OntologyTerm and Dbxref seem duplicative? Why not just a Dbxref?

Per the GFF3 spec:
"Two reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database. Ontology_term is reserved for associations to ontologies, such as the Gene Ontology. Dbxref is used for all other cross references. While there is no firm boundary line between these two concepts, curators tend to treat ontology associations differently and hence ontology terms have been given their own reserved attribute label."

I figure if the data take advantage of the distinction, representing some links as Dbxref and some as Ontology_term, we should not lose that information.

Feature seems pretty GFF3-specific, is that intended? Most of those fields aren't present in most feature file formats (and ad hoc formats)

Yes, this pull request models GFF3 most closely but also allows for things specific to other formats (i.e. frame for GFF2/GTF). I considered adding geneId and transcriptId for required gene_id and transcript_id attribute tags for GFF2/GTF. (edit: Now that I've written that, I'm leaning toward including them, and possibly also exonId. Isn't it better to have fields that are easily queried than have to pull things out of attributes?)

The way I understand it, with Parquet nulled fields won't increase storage requirements on disk, so having wider but sparse Feature records shouldn't impact performance. If overhead in RAM is a concern, one can limit by projection.

@laserson
Copy link
Contributor

Overall I'm fine with this patch, though I also am not crazy about tying ourselves too closely to a particular file format.

@heuermh
Copy link
Member Author

heuermh commented May 24, 2016

A minimal version of this patch would update Strand as above, add frame as a field so that it isn't lost in GFF2/GTF parsing, and remove the dbxrefs field since it is not used consistently across the formats. That would keep all the interesting bits in the attributes field.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/97/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/bdg-formats-prb/100/
Test PASSed.

@heuermh
Copy link
Member Author

heuermh commented Jun 7, 2016

Hold on this one, avro map<string> for attributes does not preserve cardinality for repeated keys. Need to think on it...

@tdanford
Copy link
Contributor

tdanford commented Jun 7, 2016

Broadly, I concur with @laserson -- I think tying ourselves to a particular file format for the "Feature" record isn't great. I'd prefer a more stripped down Feature or, failing that, renaming Feature to be more clear what it is and what it represents, either GFF3Feature or MODDbFeature or something along those lines.

But we've been over both of these objections before, so if neither of them is acceptable to the broader group then I will accept this PR as is.

@heuermh
Copy link
Member Author

heuermh commented Jun 7, 2016

I've taken care to document how other formats fit the fields in the schema, so I wouldn't say that it is GFF3-specific. Rather it was driven by a design principle of preferring nullable fields to attributes in a map.

Copying a comment I made in another thread:

From what I understand, storing data in attributes is inefficient given how ADAM and Spark operate on schema records (projections, filters, column compression on disk, SparkSQL queries, etc.). For example, it is not possible to project only a single key from the attributes field.

This inefficiency, plus the note above about map<string> for attributes not allowing multiple values for the same key, leads me to believe that design principle is sound. I am of course open to being proved otherwise. :)

@laserson
Copy link
Contributor

laserson commented Jun 7, 2016

This is the classic tradeoff of how strongly typed we want to be. It seems to me that one issue is that the read/variant types are more "mature" than the feature type, since I don't think as much software is cranking on the feature types. Would it be possible to mark it somehow as less mature and more subject to change?

@fnothaft
Copy link
Member

fnothaft commented Jun 7, 2016

Rather it was driven by a design principle of preferring nullable fields to attributes in a map.

I strongly advocate for this design principle. We've promoted many fields in AlignmentRecord or Genotype according to this philosophy.

Would it be possible to mark it somehow as less mature and more subject to change?

Sounds reasonable to me; maybe just add a sentence in the docs "The Feature schema is experimental and subject to change"?

@heuermh
Copy link
Member Author

heuermh commented Jun 7, 2016

In any case, attributes with duplicate keys are going to get clobbered. Is this a reasonable workaround?

  // ...
  array<Attribute> attributes = [];
}

record Attribute {
  string key;
  string value;
}

And is there an equivalent to Multimap<K,V> in scala?

@heuermh
Copy link
Member Author

heuermh commented Jun 14, 2016

In the latest commit to ADAM bigdatagenomics/adam@8e2d786 I've added special handling for geneId, transcriptId, and exonId fields and duplicate key support for attributes with reserved keys in the GFF3 specification.

@fnothaft fnothaft merged commit 2c712bf into bigdatagenomics:master Jun 15, 2016
@fnothaft
Copy link
Member

Thanks @heuermh! Merged.

@heuermh
Copy link
Member Author

heuermh commented Jun 15, 2016

Thank you!

@heuermh heuermh deleted the maximal-gff3 branch June 15, 2016 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants