Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome annotations aren't very generic #187

Closed
trvrb opened this issue Jul 23, 2018 · 16 comments
Closed

Genome annotations aren't very generic #187

trvrb opened this issue Jul 23, 2018 · 16 comments
Labels
schema v6 Issues relating to augur v6 / auspice v2

Comments

@trvrb
Copy link
Member

trvrb commented Jul 23, 2018

Currently, genome_annotations looks like:

  "genome_annotations": {
    "nuc": {
      "start": 1,
      "end": 10769,
      "strand": 1
    },
    "CA": {
      "start": 90,
      "end": 456,
      "strand": 1
    },

The schema enforces that each gene has a start, end and strand. I'm assuming we will eventually deal with cases of multi-exon genes, etc... where start, end and strand isn't sufficient. Nothing we have currently demands this I believe. I don't know if necessary to address.

@trvrb trvrb added the schema label Jul 23, 2018
@trvrb trvrb changed the title Genome annotations aren't generic Genome annotations aren't very generic Jul 23, 2018
@tsibley
Copy link
Member

tsibley commented Jul 23, 2018

If we ever want to genericize genomic annotations, I advise that we borrow from standards like GFF3 and the hard thought that many people have put into it. We don't want to poorly re-invent this wheel.

In the meantime, it's reasonable not to support fully generic genome annotations; they let us ignore a lot of complexity that we don't currently care about.

@trvrb
Copy link
Member Author

trvrb commented Jul 23, 2018

Okay. Seems reasonable to wait and address this thoroughly when we do so.

@jameshadfield jameshadfield added the priority: low To be resolved after high and moderate priority issues label Jun 17, 2019
@jameshadfield jameshadfield added v6 Issues relating to augur v6 / auspice v2 and removed priority: low To be resolved after high and moderate priority issues labels Jul 18, 2019
@jameshadfield
Copy link
Member

@joverlee521
Copy link
Contributor

joverlee521 commented Aug 13, 2019

@jameshadfield I'm a little confused on what we gain from moving to GFF annotations. It seems like we are changing to one-based coordinates in within augur and then changing back to zero-based coordinates within auspice. This seems to leave a lot of room for off-by-one errors in the future.

With @rneher 's changes within augur.translate, we are converting from zero-based coordinates returned from SeqIO.read().features to one-based coordinates (literally just adding one). Then to have minimal changes to the React code in auspice, I change the coordinates back to zero-based within src/utils/entropyCreateStateFromJsons.js

@tsibley
Copy link
Member

tsibley commented Aug 13, 2019

@joverlee521 and I chatted about this a bit in person. I'm in agreement that it seems a bit counter-productive at the moment to switch to a GFF-style coordinate system. BED-style coordinates are easier to do simple calculations on are what Augur/Auspice natively want. What's the motivating reason for switching to GFF in particular?¹

I think we could keep the BED-style coordinate system we're already using and still switch strand to +/- (conforming to BED and GFF3) and add the seqid (or what BED calls chrom) and type columns.


¹ I realize I may have been the accidental precipitant of this due to my comment on this issue from a year ago. 😬 When I made that suggestion, I was thinking about not redoing the work of how to handle more complex annotations like multi-exon genes and feature hierarchies.

@joverlee521
Copy link
Contributor

For more detail, here's the coordinates that I've been able to decipher (please correct me if I'm wrong!):

  • augur NT's = currently zero-based, GFF format would change that to one-based
  • augur AA's = currently one-based, see augur.translate
  • auspice NT's = currently zero-based, with GFF format:
    1. Change back to zero-based during when creating entropy state from JSONS
    2. Change React code to handle one-based
  • auspice AA's = currently one-based

@jameshadfield
Copy link
Member

jameshadfield commented Aug 14, 2019

@joverlee521 could you please do some more digging into this -- all of the following is using augur & auspice master branches & the zika-tutorial build.

Zika at nucleotide position 3 (the "3" is specified in the URL -- c=gt-nuc_3 -- the legend title & the entropy hover box) shows a basal state of A with one clade having C. This is drawn from zika_tree.json via "muts": ["A3C", ...], which itself is taken from the nt_muts.json (same numbering). Looking at the alignment (results/aligned.fasta), which has been trimmed to the reference sequence (therefore the co-ordinates are the same) these nucleotides represent the third position in a one-based numbering system.

Could you double check this, as you say that auspice nucleotides are 0-based.

image

image

P.S. for the purposes of this issue ignore the fact that augur is inferring the base for unknown nucleotides, which is an issue unto itself. "VEN/UF_1/2016" carries the A->C mutation.

P.P.S. I agree that for amino acid mutations, augur (aa_muts.json and zika_tree.json) and auspice all use one-based numbering.

@jameshadfield
Copy link
Member

jameshadfield commented Aug 14, 2019

To be clear -- I don't believe we should change our syntax for mutations, which I believe to be one-based and correct (I got rather worried with your above findings!).

This proposal is to move to 1-based, GFF numbering for the annotations in the JSONs. I don't think this is controversial and makes perfect sense to me. For instance:

Currently, the refernce genbank file (1-based) used by the build has:

     source          1..10769
     ...
     CDS             961..2472
                     /product="envelope protein"
                     /gene="ENV"

But we export that (aa_muts.json and zika_meta.json) as:

  "nuc": {
   "end": 10769,
   "start": 0,
   "strand": 1
  },
  "ENV": {
   "end": 2472,
   "start": 960,
   "strand": 1
  }

I believe our syntax here should mirror the genbank numbering (v1 JSONs should still use 0-based starts for backwards compatibility, which is why we need to modify auspice's conversion code as well as the auspice feature parsing code).

@rneher
Copy link
Member

rneher commented Aug 14, 2019

My take here is:

  • both genbank and gff use one-based numbering.
  • genbank reference files require some tweaking before they work for us (CDS/gene/locustag fields). Genbank is also a very brittle format, I'd be more than happy to move away from it and gradually replace augur input by fasta + gff.
  • Javascript and Python use 0-based numbering. So internally we'll need to subtract one (in auspice this will become an issue the moment we provide a reference sequence for non-variable nucleotides. This will likely be an array with 0-based numbering.)
  • the annotation field in the json should mirror the input format as much as possible IMO. Hence the proposal to use field names and conventions as GFF (or subset thereof).

@emmahodcroft
Copy link
Member

I've gotten really mixed up on the numbering of NT at various output stages of augur and auspice. (I think I wrote about this previously last time I was in Seattle - I never was able to reconcile what it seemed was supposed to be the numbering vs what was.)

One-based numbering is the standard when talking about sequences, and I'd be strongly in favour of having all input and output use this - it makes everything else much easier. We can convert within the code to zero-based (and will have to), but that's something we and users should only have to worry about if digging into the guts - not when comparing input Genbank to Fasta sequence to nut_muts JSON to export JSON to auspice....

@joverlee521
Copy link
Contributor

@jameshadfield Sorry, I should have been more specific. I was referring to just the gene display within the entropy panel, not the syntax for mutations. The genome_annotations in the JSON passed to auspice is zero-based, and I couldn't find where that gets converted to one-based for display.

@rneher @emmahodcroft if one-based coordinates is standard for sequences and more user friendly, then I understand the push for GFF format. We just need to document very clearly that all inputs and outputs are one-based. I'm inclined to expect things to be zero-based when it comes to software outputs 😄

@tsibley
Copy link
Member

tsibley commented Aug 14, 2019

Maybe we can chat about this more on the Nextstrain call tomorrow?

I think it's useful to make a distinction between coordinates used for display/selection purposes (for which 1-based, fully closed intervals are more natural) and coordinates used internally (which more naturally uses 0-based, half-open intervals).

@jameshadfield
Copy link
Member

This is really simple everyone. As it stands in the exported JSONs, nuc & aa mutations are 1-based. Gene & nuc annotations have 0-based starts. They should have 1-based starts. That's it.

@trvrb
Copy link
Member Author

trvrb commented Aug 14, 2019

I agree that all sequence references in IO files should be 1-based.

@emmahodcroft
Copy link
Member

if one-based coordinates is standard for sequences and more user friendly, then I understand the push for GFF format. We just need to document very clearly that all inputs and outputs are one-based. I'm inclined to expect things to be zero-based when it comes to software outputs 😄

I don't know anyone's biology background so this isn't meant to be patronising! Please excuse me if this is repeating what you already know 🙂

But as a bit of background, almost everything when dealing with sequences is one-based - if you open a sequence viewer, it'll by default number from one, and Genbank & GFF annotations are given from 1. Perhaps most importantly, when people talk about important mutations (A->G mutation at base 1035; M->I mutation at position 50 in env) in publications and etc, they'll be using a one-based system. So the expectation is a bit different here than if we were coming up with an original output.

We should definitely document that we switch between what we use internally vs what we output, but for most people using our software, 1-based output is exactly what they will be expecting 🙂

@jameshadfield
Copy link
Member

#354

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema v6 Issues relating to augur v6 / auspice v2
Projects
None yet
Development

No branches or pull requests

6 participants