Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include genome (+ annotation) #17

Closed
edkerk opened this issue Aug 13, 2020 · 7 comments · Fixed by #21 or #24
Closed

Include genome (+ annotation) #17

edkerk opened this issue Aug 13, 2020 · 7 comments · Fixed by #21 or #24
Labels
dependencies dependency issues relevant to GEM-type repos

Comments

@edkerk
Copy link
Collaborator

edkerk commented Aug 13, 2020

As raised by @cshenry:

One thing I would consider to be of utmost importance in such a site is to properly represent the genomes linked to the models. Ideally, I would prefer the see the site maintain its own internal compressed copies of GFF and FASTA files for genomes associated with any models stored there. People routinely use genome IDs… but these IDs go away or genes get recalled and it makes things difficult. I would argue a model is nearly useless without its associated genome, and finding the exact correct genome that should be mapped to a particular published model is one of my greatest pain points in trying to use these models in my own research. You could store protein sequences in the model, which would help, but without the genome, you’re still losing some provenance on where the protein came from.

Seems like a valid point. Not convinced about the compressed copy, I'm always happier to avoid binary files in git.

@Midnighter
Copy link
Collaborator

This goes a lot further than my intention with #13. Is there no stable genome identifier at all?

@cshenry
Copy link

cshenry commented Aug 13, 2020 via email

@haowang-bioinfo haowang-bioinfo added the dependencies dependency issues relevant to GEM-type repos label Aug 14, 2020
@mihai-sysbio
Copy link
Member

Following the pointers above from @cshenry (thank you), I have found only PATRIC and GenBank to have publicly available genome identifiers; one can only get so far in KBase without having to log in. Personally, I do not consider any of these follow FAIR principles, but having a genome ID in the template is an improvement nevertheless.

@Midnighter
Copy link
Collaborator

Midnighter commented Sep 8, 2020

What about assemblies at NCBI, for example, https://www.ncbi.nlm.nih.gov/assembly/GCF_000007565.2/ lists

GenBank assembly accession:
    GCA_000007565.2 (latest)
RefSeq assembly accession:
    GCF_000007565.2 (latest)

I would hope that those are stable?

They do exist at identifiers.org so that's a plus https://registry.identifiers.org/registry/insdc.gca and https://registry.identifiers.org/registry/refseq.

@mihai-sysbio
Copy link
Member

mihai-sysbio commented Sep 8, 2020

Good idea about RefSeq! Here is the comparison to GenBank:

The GenBank archival sequence database includes publicly available DNA sequences submitted from individual laboratories and large-scale sequencing projects. GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) along with the European Nucleotide Archive and the DNA Data Bank of Japan (DDBJ). Submitted sequence data is exchanged daily between the three collaborators to achieve comprehensive worldwide coverage. As an archival database, GenBank can be very redundant for some loci. GenBank sequence records are owned by the original submitter and cannot be altered by a third party.
RefSeq sequences are not part of the INSDC but are derived from INSDC sequences to provide non-redundant curated data representing our current knowledge of known genes. Some records include sequence information gathered from more than one INSDC record. Records may include sequence, descriptive information, publications, or feature annotation that is not available from any single INSDC record. RefSeq records are owned by NCBI and therefore can be updated as needed to maintain current annotation or to incorporate additional information. Also see the appendix provided in the NCBI Handbook, GenBank chapter.
Another distinction is that transcripts and proteins annotated on RefSeq genomic records are instantiated as separate records; in contrast, GenBank only instantiates the proteins annotated on genomic sequence records.

Moreover, RefSeq has an identifiers.org profile which makes the compact identifier both human readable and useful (eg refseq:NP_012345). Nevermind, that is only for protein IDs.

@mihai-sysbio
Copy link
Member

race condition over there with the edits :)
insdc.gca:GCF_000007565.2 seems to work nicely

@mihai-sysbio
Copy link
Member

I believe this issue is resolved in the linked PRs - please reopen this issue if needed.

This was referenced Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies dependency issues relevant to GEM-type repos
Projects
None yet
5 participants