Genome annotations aren't very generic #187

trvrb · 2018-07-23T18:57:48Z

Currently, genome_annotations looks like:

  "genome_annotations": {
    "nuc": {
      "start": 1,
      "end": 10769,
      "strand": 1
    },
    "CA": {
      "start": 90,
      "end": 456,
      "strand": 1
    },

The schema enforces that each gene has a start, end and strand. I'm assuming we will eventually deal with cases of multi-exon genes, etc... where start, end and strand isn't sufficient. Nothing we have currently demands this I believe. I don't know if necessary to address.

The text was updated successfully, but these errors were encountered:

tsibley · 2018-07-23T20:35:33Z

If we ever want to genericize genomic annotations, I advise that we borrow from standards like GFF3 and the hard thought that many people have put into it. We don't want to poorly re-invent this wheel.

In the meantime, it's reasonable not to support fully generic genome annotations; they let us ignore a lot of complexity that we don't currently care about.

trvrb · 2018-07-23T22:38:10Z

Okay. Seems reasonable to wait and address this thoroughly when we do so.

jameshadfield · 2019-07-18T02:36:20Z

Will be closed in v6 by https://github.com/nextstrain/augur/commits/gff_annotations

joverlee521 · 2019-08-13T19:15:04Z

@jameshadfield I'm a little confused on what we gain from moving to GFF annotations. It seems like we are changing to one-based coordinates in within augur and then changing back to zero-based coordinates within auspice. This seems to leave a lot of room for off-by-one errors in the future.

With @rneher 's changes within augur.translate, we are converting from zero-based coordinates returned from SeqIO.read().features to one-based coordinates (literally just adding one). Then to have minimal changes to the React code in auspice, I change the coordinates back to zero-based within src/utils/entropyCreateStateFromJsons.js

tsibley · 2019-08-13T19:31:58Z

@joverlee521 and I chatted about this a bit in person. I'm in agreement that it seems a bit counter-productive at the moment to switch to a GFF-style coordinate system. BED-style coordinates are easier to do simple calculations on are what Augur/Auspice natively want. What's the motivating reason for switching to GFF in particular?¹

I think we could keep the BED-style coordinate system we're already using and still switch strand to +/- (conforming to BED and GFF3) and add the seqid (or what BED calls chrom) and type columns.

¹ I realize I may have been the accidental precipitant of this due to my comment on this issue from a year ago. 😬 When I made that suggestion, I was thinking about not redoing the work of how to handle more complex annotations like multi-exon genes and feature hierarchies.

joverlee521 · 2019-08-13T23:46:54Z

For more detail, here's the coordinates that I've been able to decipher (please correct me if I'm wrong!):

augur NT's = currently zero-based, GFF format would change that to one-based
augur AA's = currently one-based, see augur.translate
auspice NT's = currently zero-based, with GFF format:
1. Change back to zero-based during when creating entropy state from JSONS
2. Change React code to handle one-based
auspice AA's = currently one-based

jameshadfield · 2019-08-14T03:15:38Z

@joverlee521 could you please do some more digging into this -- all of the following is using augur & auspice master branches & the zika-tutorial build.

Zika at nucleotide position 3 (the "3" is specified in the URL -- c=gt-nuc_3 -- the legend title & the entropy hover box) shows a basal state of A with one clade having C. This is drawn from zika_tree.json via "muts": ["A3C", ...], which itself is taken from the nt_muts.json (same numbering). Looking at the alignment (results/aligned.fasta), which has been trimmed to the reference sequence (therefore the co-ordinates are the same) these nucleotides represent the third position in a one-based numbering system.

Could you double check this, as you say that auspice nucleotides are 0-based.

P.S. for the purposes of this issue ignore the fact that augur is inferring the base for unknown nucleotides, which is an issue unto itself. "VEN/UF_1/2016" carries the A->C mutation.

P.P.S. I agree that for amino acid mutations, augur (aa_muts.json and zika_tree.json) and auspice all use one-based numbering.

jameshadfield · 2019-08-14T03:40:42Z

To be clear -- I don't believe we should change our syntax for mutations, which I believe to be one-based and correct (I got rather worried with your above findings!).

This proposal is to move to 1-based, GFF numbering for the annotations in the JSONs. I don't think this is controversial and makes perfect sense to me. For instance:

Currently, the refernce genbank file (1-based) used by the build has:

     source          1..10769
     ...
     CDS             961..2472
                     /product="envelope protein"
                     /gene="ENV"

But we export that (aa_muts.json and zika_meta.json) as:

  "nuc": {
   "end": 10769,
   "start": 0,
   "strand": 1
  },
  "ENV": {
   "end": 2472,
   "start": 960,
   "strand": 1
  }

I believe our syntax here should mirror the genbank numbering (v1 JSONs should still use 0-based starts for backwards compatibility, which is why we need to modify auspice's conversion code as well as the auspice feature parsing code).

rneher · 2019-08-14T07:08:37Z

My take here is:

both genbank and gff use one-based numbering.
genbank reference files require some tweaking before they work for us (CDS/gene/locustag fields). Genbank is also a very brittle format, I'd be more than happy to move away from it and gradually replace augur input by fasta + gff.
Javascript and Python use 0-based numbering. So internally we'll need to subtract one (in auspice this will become an issue the moment we provide a reference sequence for non-variable nucleotides. This will likely be an array with 0-based numbering.)
the annotation field in the json should mirror the input format as much as possible IMO. Hence the proposal to use field names and conventions as GFF (or subset thereof).

emmahodcroft · 2019-08-14T08:13:09Z

I've gotten really mixed up on the numbering of NT at various output stages of augur and auspice. (I think I wrote about this previously last time I was in Seattle - I never was able to reconcile what it seemed was supposed to be the numbering vs what was.)

One-based numbering is the standard when talking about sequences, and I'd be strongly in favour of having all input and output use this - it makes everything else much easier. We can convert within the code to zero-based (and will have to), but that's something we and users should only have to worry about if digging into the guts - not when comparing input Genbank to Fasta sequence to nut_muts JSON to export JSON to auspice....

joverlee521 · 2019-08-14T16:52:00Z

@jameshadfield Sorry, I should have been more specific. I was referring to just the gene display within the entropy panel, not the syntax for mutations. The genome_annotations in the JSON passed to auspice is zero-based, and I couldn't find where that gets converted to one-based for display.

@rneher @emmahodcroft if one-based coordinates is standard for sequences and more user friendly, then I understand the push for GFF format. We just need to document very clearly that all inputs and outputs are one-based. I'm inclined to expect things to be zero-based when it comes to software outputs 😄

tsibley · 2019-08-14T17:15:00Z

Maybe we can chat about this more on the Nextstrain call tomorrow?

I think it's useful to make a distinction between coordinates used for display/selection purposes (for which 1-based, fully closed intervals are more natural) and coordinates used internally (which more naturally uses 0-based, half-open intervals).

jameshadfield · 2019-08-14T23:29:53Z

This is really simple everyone. As it stands in the exported JSONs, nuc & aa mutations are 1-based. Gene & nuc annotations have 0-based starts. They should have 1-based starts. That's it.

trvrb · 2019-08-14T23:38:08Z

I agree that all sequence references in IO files should be 1-based.

emmahodcroft · 2019-08-15T09:54:24Z

if one-based coordinates is standard for sequences and more user friendly, then I understand the push for GFF format. We just need to document very clearly that all inputs and outputs are one-based. I'm inclined to expect things to be zero-based when it comes to software outputs 😄

I don't know anyone's biology background so this isn't meant to be patronising! Please excuse me if this is repeating what you already know 🙂

But as a bit of background, almost everything when dealing with sequences is one-based - if you open a sequence viewer, it'll by default number from one, and Genbank & GFF annotations are given from 1. Perhaps most importantly, when people talk about important mutations (A->G mutation at base 1035; M->I mutation at position 50 in env) in publications and etc, they'll be using a one-based system. So the expectation is a bit different here than if we were coming up with an original output.

We should definitely document that we switch between what we use internally vs what we output, but for most people using our software, 1-based output is exactly what they will be expecting 🙂

jameshadfield · 2019-08-15T21:05:41Z

#354

trvrb added the schema label Jul 23, 2018

trvrb changed the title ~~Genome annotations aren't generic~~ Genome annotations aren't very generic Jul 23, 2018

jameshadfield added the priority: low To be resolved after high and moderate priority issues label Jun 17, 2019

jameshadfield added v6 Issues relating to augur v6 / auspice v2 and removed priority: low To be resolved after high and moderate priority issues labels Jul 18, 2019

jameshadfield mentioned this issue Jul 30, 2019

Move to GFF-style annotations nextstrain/auspice#757

Closed

jameshadfield mentioned this issue Aug 15, 2019

Gff annotations #354

Merged

jameshadfield closed this as completed Aug 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genome annotations aren't very generic #187

Genome annotations aren't very generic #187

trvrb commented Jul 23, 2018 •

edited

Loading

tsibley commented Jul 23, 2018

trvrb commented Jul 23, 2018

jameshadfield commented Jul 18, 2019

joverlee521 commented Aug 13, 2019 •

edited

Loading

tsibley commented Aug 13, 2019

joverlee521 commented Aug 13, 2019

jameshadfield commented Aug 14, 2019 •

edited

Loading

jameshadfield commented Aug 14, 2019 •

edited

Loading

rneher commented Aug 14, 2019

emmahodcroft commented Aug 14, 2019

joverlee521 commented Aug 14, 2019

tsibley commented Aug 14, 2019

jameshadfield commented Aug 14, 2019

trvrb commented Aug 14, 2019

emmahodcroft commented Aug 15, 2019

jameshadfield commented Aug 15, 2019

Genome annotations aren't very generic #187

Genome annotations aren't very generic #187

Comments

trvrb commented Jul 23, 2018 • edited Loading

tsibley commented Jul 23, 2018

trvrb commented Jul 23, 2018

jameshadfield commented Jul 18, 2019

joverlee521 commented Aug 13, 2019 • edited Loading

tsibley commented Aug 13, 2019

joverlee521 commented Aug 13, 2019

jameshadfield commented Aug 14, 2019 • edited Loading

jameshadfield commented Aug 14, 2019 • edited Loading

rneher commented Aug 14, 2019

emmahodcroft commented Aug 14, 2019

joverlee521 commented Aug 14, 2019

tsibley commented Aug 14, 2019

jameshadfield commented Aug 14, 2019

trvrb commented Aug 14, 2019

emmahodcroft commented Aug 15, 2019

jameshadfield commented Aug 15, 2019

trvrb commented Jul 23, 2018 •

edited

Loading

joverlee521 commented Aug 13, 2019 •

edited

Loading

jameshadfield commented Aug 14, 2019 •

edited

Loading

jameshadfield commented Aug 14, 2019 •

edited

Loading