-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genome annotations aren't very generic #187
Comments
If we ever want to genericize genomic annotations, I advise that we borrow from standards like GFF3 and the hard thought that many people have put into it. We don't want to poorly re-invent this wheel. In the meantime, it's reasonable not to support fully generic genome annotations; they let us ignore a lot of complexity that we don't currently care about. |
Okay. Seems reasonable to wait and address this thoroughly when we do so. |
Will be closed in v6 by https://github.com/nextstrain/augur/commits/gff_annotations |
@jameshadfield I'm a little confused on what we gain from moving to GFF annotations. It seems like we are changing to one-based coordinates in within augur and then changing back to zero-based coordinates within auspice. This seems to leave a lot of room for off-by-one errors in the future. With @rneher 's changes within |
@joverlee521 and I chatted about this a bit in person. I'm in agreement that it seems a bit counter-productive at the moment to switch to a GFF-style coordinate system. BED-style coordinates are easier to do simple calculations on are what Augur/Auspice natively want. What's the motivating reason for switching to GFF in particular?¹ I think we could keep the BED-style coordinate system we're already using and still switch ¹ I realize I may have been the accidental precipitant of this due to my comment on this issue from a year ago. 😬 When I made that suggestion, I was thinking about not redoing the work of how to handle more complex annotations like multi-exon genes and feature hierarchies. |
For more detail, here's the coordinates that I've been able to decipher (please correct me if I'm wrong!):
|
@joverlee521 could you please do some more digging into this -- all of the following is using augur & auspice master branches & the zika-tutorial build. Zika at nucleotide position 3 (the "3" is specified in the URL -- Could you double check this, as you say that auspice nucleotides are 0-based. P.S. for the purposes of this issue ignore the fact that augur is inferring the base for unknown nucleotides, which is an issue unto itself. "VEN/UF_1/2016" carries the A->C mutation. P.P.S. I agree that for amino acid mutations, augur ( |
To be clear -- I don't believe we should change our syntax for mutations, which I believe to be one-based and correct (I got rather worried with your above findings!). This proposal is to move to 1-based, GFF numbering for the annotations in the JSONs. I don't think this is controversial and makes perfect sense to me. For instance: Currently, the refernce genbank file (1-based) used by the build has:
But we export that (
I believe our syntax here should mirror the genbank numbering (v1 JSONs should still use 0-based starts for backwards compatibility, which is why we need to modify auspice's conversion code as well as the auspice feature parsing code). |
My take here is:
|
I've gotten really mixed up on the numbering of NT at various output stages of augur and auspice. (I think I wrote about this previously last time I was in Seattle - I never was able to reconcile what it seemed was supposed to be the numbering vs what was.) One-based numbering is the standard when talking about sequences, and I'd be strongly in favour of having all input and output use this - it makes everything else much easier. We can convert within the code to zero-based (and will have to), but that's something we and users should only have to worry about if digging into the guts - not when comparing input Genbank to Fasta sequence to nut_muts JSON to export JSON to auspice.... |
@jameshadfield Sorry, I should have been more specific. I was referring to just the gene display within the entropy panel, not the syntax for mutations. The @rneher @emmahodcroft if one-based coordinates is standard for sequences and more user friendly, then I understand the push for GFF format. We just need to document very clearly that all inputs and outputs are one-based. I'm inclined to expect things to be zero-based when it comes to software outputs 😄 |
Maybe we can chat about this more on the Nextstrain call tomorrow? I think it's useful to make a distinction between coordinates used for display/selection purposes (for which 1-based, fully closed intervals are more natural) and coordinates used internally (which more naturally uses 0-based, half-open intervals). |
This is really simple everyone. As it stands in the exported JSONs, nuc & aa mutations are 1-based. Gene & nuc annotations have 0-based starts. They should have 1-based starts. That's it. |
I agree that all sequence references in IO files should be 1-based. |
I don't know anyone's biology background so this isn't meant to be patronising! Please excuse me if this is repeating what you already know 🙂 But as a bit of background, almost everything when dealing with sequences is one-based - if you open a sequence viewer, it'll by default number from one, and Genbank & GFF annotations are given from 1. Perhaps most importantly, when people talk about important mutations (A->G mutation at base 1035; M->I mutation at position 50 in env) in publications and etc, they'll be using a one-based system. So the expectation is a bit different here than if we were coming up with an original output. We should definitely document that we switch between what we use internally vs what we output, but for most people using our software, 1-based output is exactly what they will be expecting 🙂 |
Currently,
genome_annotations
looks like:The schema enforces that each gene has a
start
,end
andstrand
. I'm assuming we will eventually deal with cases of multi-exon genes, etc... wherestart
,end
andstrand
isn't sufficient. Nothing we have currently demands this I believe. I don't know if necessary to address.The text was updated successfully, but these errors were encountered: