Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gff3 file: differences between nextclade datasets and augur ancestral #1655

Open
anna-parker opened this issue Oct 21, 2024 · 3 comments
Open
Labels
documentation Improvements or additions to documentation

Comments

@anna-parker
Copy link
Collaborator

anna-parker commented Oct 21, 2024

Hi! I am not sure if this is a question for nextclade or augur.

I am using augur ancestral as part of a pipeline to create a tree for use in a nextclade dataset: https://github.com/anna-parker/marburg-virus-tree/tree/main. I decided to simplify matters and use the same gff3 file I use in my nextclade dataset - with the goal of having CDS-regions named the same in the nextclade tree and alignment.

However, I realized that nextclade and augur ancestral appear to read the gff3 file differently. For example this is my annotation for the NP CDS (full file here: https://github.com/GenSpectrum/nextclade-datasets/blob/add_marburg/data/marburg/unreleased/genome_annotation.gff3 - it is the same as the gff3 file from genbank expect that I have renamed the CDS by adding a Name=NP field to the start of the annotations)

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_001608.3 1 19111
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3052505
NC_001608.3	RefSeq	region	1	19111	.	+	.	ID=NC_001608.3:1..19111;Dbxref=taxon:3052505;country=Kenya;gbkey=Src;genome=genomic;isolate=Marburg virus/H.sapiens-tc/KEN/1980/Mt. Elgon-Musoke;mol_type=viral cRNA;old-name=Lake Victoria marburgvirus
NC_001608.3	RefSeq	gene	49	2844	.	+	.	ID=gene-MARV_gp1;Dbxref=GeneID:920944;Name=NP;gbkey=Gene;gene=NP;gene_biotype=protein_coding;locus_tag=MARV_gp1
NC_001608.3	RefSeq	mRNA	49	2844	.	+	.	ID=rna-MARV_gp1;Parent=gene-MARV_gp1;Dbxref=GeneID:920944;gbkey=mRNA;gene=NP;locus_tag=MARV_gp1;product=nucleoprotein
NC_001608.3	RefSeq	exon	49	2844	.	+	.	ID=exon-MARV_gp1-1;Parent=rna-MARV_gp1;Dbxref=GeneID:920944;gbkey=mRNA;gene=NP;locus_tag=MARV_gp1;product=nucleoprotein
NC_001608.3	RefSeq	CDS	104	2191	.	+	0	Name=NP;ID=cds-YP_001531153.1;Parent=rna-MARV_gp1;Dbxref=GenBank:YP_001531153.1,GeneID:920944;Name=YP_001531153.1;Note=encapsidates RNA genome;gbkey=CDS;gene=NP;locus_tag=MARV_gp1;product=nucleoprotein;protein_id=YP_001531153.1
...

When creating a nextclade dataset only the CDS field is used and is given the name NP. Which makes sense according to the nextclade docs as:
When a linked gene and CDS are present (CDSs specify their parents by listing the gene’s ID in the Parent attribute), the gene is effectively ignored for all purposes but display in the web UI. CDS segments are joined if they have the same ID, otherwise they are treated as independent.

However, when using this same file in augur ancestral the gene and not the CDS region is used (I can tell because the gene is longer and I get a lot more mutations). I then removed the gene and left only the CDS field, then augur ancestral did no ancestral reconstruction. Only when I renamed the CDS field to gene (see https://github.com/anna-parker/marburg-virus-tree/blob/main/config/reference.gff3) did augur ancestral reconstruct the CDS the same way as in nextclade.

Is this expected behavior? Does augur ancestral only perform ancestral reconstruction on genes? I couldn't find any docs on the way augur ancestral expects gff3 files to be formatted.

Side-note: I was previously using a genbank file and not a gff3 file, and there augur ancestral used the CDS and not the gene, my main reason for changing to a gff3 file was to rename the CDS.

@anna-parker anna-parker added the documentation Improvements or additions to documentation label Oct 21, 2024
@jameshadfield
Copy link
Member

Yes the way Nextclade and augur process GFFs differ. Ivan's done a lot of good work on the nextclade side and we've had a bunch of discussions about how to leverage that work in Augur rather than trying to maintain our own version in parallel, but I don't think any changes are imminent here. Currently here's how Augur does it:

augur/augur/utils.py

Lines 264 to 270 in 862aa37

Read a GFF file. We only read GFF IDs 'gene' or 'source' (the latter may not technically
be a valid GFF field, but is used widely within the Nextstrain ecosystem).
Only the first entry in the GFF file is parsed.
We create a "feature name" via:
- for 'source' IDs use 'nuc'
- for 'gene' IDs use the 'gene', 'gene_name' or 'locus_tag'.
If none are specified, the intention is to silently ignore but there are bugs here.

Side-note: I was previously using a genbank file and not a gff3 file, and there augur ancestral used the CDS and not the gene

Yes, this is (unfortunately) different. The reasons are historical and not really possible to sort out in a backwards compatible way :(

augur/augur/utils.py

Lines 388 to 391 in 862aa37

Read a GenBank file. We only read GenBank feature keys 'CDS' or 'source'.
We create a "feature name" via:
- for 'source' features use 'nuc'
- for 'CDS' features use the locus_tag or the gene. If neither, then silently ignore.

@anna-parker
Copy link
Collaborator Author

@jameshadfield Thanks so much for the explanation! I will then continue with my workaround :-)

@corneliusroemer
Copy link
Member

corneliusroemer commented Nov 9, 2024

Ivan's done a lot of good work on the nextclade side and we've had a bunch of discussions about how to leverage that work in Augur rather than trying to maintain our own version in parallel, but I don't think any changes are imminent here

I got so annoyed by having to create parallel genbank files that I ended up writing our own version, it wasn't so hard in the end and will end up saving me lots of frustration (and other users). I don't think maintenance is going to be a big issue in the end, the extra code is unlikely to change in the future.

The reasons are historical and not really possible to sort out in a backwards compatible way :(

Adding a new special Nextclade compatibility mode thankfully is backwards compatible!

See #1664

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants