Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add additional species names to gene labels #116

Open
balhoff opened this issue Apr 30, 2023 · 24 comments
Open

add additional species names to gene labels #116

balhoff opened this issue Apr 30, 2023 · 24 comments
Assignees
Labels

Comments

@balhoff
Copy link
Member

balhoff commented Apr 30, 2023

See geneontology/go-site#1955 (comment)

@pgaudet
Copy link

pgaudet commented Dec 18, 2023

Species names are still not being displayed in Noctua.

For example, Taxon:42789 & model: http://noctua.geneontology.org/workbench/noctua-visual-pathway-editor/?model_id=gomodel%3A6494e2e900000134

image

@balhoff
Copy link
Member Author

balhoff commented Dec 18, 2023

This seems to be controlled by the script https://github.com/geneontology/neo/blob/master/gpi2obo.pl. When it is called by the Makefile, a species code needs to be passed to be appended to the gene product name. There seems to be some unfinished work related to handling virus names here:

neo/Makefile

Lines 41 to 68 in b1a1039

# BUG: temporary hardcode until https://github.com/geneontology/go-site/issues/1431 is resolved and stable GPI URL is established
mirror/goa_sars-cov-2.gpi.gz:
wget --no-check-certificate https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi -O mirror/goa_sars-cov-2.gpi && gzip mirror/goa_sars-cov-2.gpi
target/neo-goa_sars-cov-2.obo: mirror/goa_sars-cov-2.gpi.gz
gzip -dc $< | ./gpi2obo.pl -s Scov2 -n sars-cov-2 > $@.tmp && mv $@.tmp $@
# ## In support of including viruses and bacteria
# ## (https://github.com/geneontology/neo/issues/77).
# ## http://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_virus_bacteria.gpi.gz
# mirror/uniprot_reviewed_virus_bacteria.gpi.gz:
# wget --no-check-certificate http://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_virus_bacteria.gpi.gz -O mirror/uniprot_reviewed_virus_bacteria.gpi.gz
# target/neo-uniprot_reviewed_virus_bacteria.obo: mirror/uniprot_reviewed_virus_bacteria.gpi.gz
# gzip -dc $< | ./gpi2obo.pl -F -n reviewed_virus_bacteria > $@.tmp && mv $@.tmp $@
## In support of including all swissprot reviewed.
## Download and /filter out by species/.
## (https://github.com/geneontology/neo/issues/82).
## http://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed.gpi.gz
## The filter_list.txt (and option) should not be needed in the future
## as we should be drawing exclusively from datasets.json.
mirror/uniprot_reviewed.gpi.gz: datasets.json
wget --no-check-certificate http://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed.gpi.gz -O mirror/uniprot_reviewed.gpi.gz.tmp
gzip -dc mirror/uniprot_reviewed.gpi.gz.tmp > mirror/uniprot_reviewed.gpi.tmp
perl filter.pl -v --metadata datasets.json --filter filter_list.txt --input mirror/uniprot_reviewed.gpi.tmp > mirror/filtered_uniprot_reviewed.gpi.tmp
gzip -c mirror/filtered_uniprot_reviewed.gpi.tmp > mirror/filtered_uniprot_reviewed.gpi.gz.tmp
mv mirror/filtered_uniprot_reviewed.gpi.gz.tmp mirror/uniprot_reviewed.gpi.gz
target/neo-uniprot_reviewed.obo: mirror/uniprot_reviewed.gpi.gz
gzip -dc $< | ./gpi2obo.pl -F -n reviewed > $@.tmp && mv $@.tmp $@

@kltm
Copy link
Member

kltm commented Dec 18, 2023

(Quiet shoutout to geneontology/project-management#52)

@pgaudet
Copy link

pgaudet commented Jan 22, 2024

Trying to quote @kltm

How the data gets in:

  • takes the GPI file, and convert it into an obo and load it
  • the GPI files are species-specific and the species 4-letter code is inserted at the processing of the GPI file to .obo.
  • therefore, GPI files with more than one species cannot include a species code and include the taxon ID

@kltm
Copy link
Member

kltm commented Jan 22, 2024

Noting that this seems to be the "next step" referred to in #77. Essentially, we stopped there before getting to this point.

@pgaudet
Copy link

pgaudet commented Sep 9, 2024

Right now there is a single file per file - so this is complicated to load

@kltm
Copy link
Member

kltm commented Nov 21, 2024

Pondering from the meeting earlier today when talking to Patrick and @vanaukenk .

There are a few ways to deal with this. It seems that the codes mostly come from, by way of a JSON derivative, the metadata/datasets YAML files. I'm not sure there's much to do there for adding a bunch of additional species.

It would be nice if we could just modify build-neo-makefile.py with an optional override file, but that too seems to operate on a file-per-species basis.

The most direct way, without redoing a bunch of what we're doing, might be to add GPI files for the species that we want and adding the metadata for them in datasets.

@vanaukenk vanaukenk moved this from Todo to In Progress in Pathway Viewer improvements and bug fixes Jan 16, 2025
@pgaudet
Copy link

pgaudet commented Jan 27, 2025

GPIs for the organisms requested by Patrick Masson and Paul Denny are now being generated by GOA at each release:
https://ftp.ebi.ac.uk/pub/contrib/goa/virus_bacteria_gpi4neo/

@vanaukenk
Copy link

Thanks @kltm @pgaudet @alexsign

@kltm - on today's workbenches call I'd just like to confirm how/when/who should test this.

@vanaukenk
Copy link

From 2024-01-30 workbenches call:

Once the new species are added to neo, we'll test on the next Noctua maintenance outage.

@kltm
Copy link
Member

kltm commented Feb 1, 2025

Data needed for this now being populated to GO mirror of GOA data for our pipeline.

@kltm
Copy link
Member

kltm commented Feb 4, 2025

Okay, I did a little testing of this NEO data load on amigo-staging (soon reverting), and it looks like there is a little more work to be done

  • it looks like the data did not automatically propagate as I expected; I will need to explore the NEO build a little more to see what went wrong
  • something that we will have to fix either way is adding the proper species names to the metadata files; specifically, it would be adding the species_code field (e.g species_code: Ggal) to each of the entries; @pgaudet should I just use the standard shortening here, adding from NCBITaxon, or does something want something else there for nomenclature?

@pgaudet
Copy link

pgaudet commented Feb 4, 2025

Hi @kltm

Are you limited to 4 characters? Ideally we would align to the UniProt 5 characters species_code.

@Pauldenny
Copy link

If it's possible @kltm @pgaudet the UniProt 5-character code would be preferable.

@kltm
Copy link
Member

kltm commented Feb 4, 2025

@Pauldenny For clarification, would this be the canonical source for the 5-character code?
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/speclist.txt

@cmungall
Copy link
Member

cmungall commented Feb 5, 2025

5 letter uniprot code is good!

@Pauldenny
Copy link

I believe that's the source you should use, @kltm

@kltm
Copy link
Member

kltm commented Feb 6, 2025

@Pauldenny The above file mostly works, but is missing two entries for our purposes:

https://www.ncbi.nlm.nih.gov/datasets/taxonomy/36352/
https://www.ncbi.nlm.nih.gov/datasets/taxonomy/128958/

I've used placeholders for the time being.

kltm added a commit to geneontology/go-site that referenced this issue Feb 6, 2025
kltm added a commit that referenced this issue Feb 6, 2025
@vanaukenk
Copy link

vanaukenk commented Feb 6, 2025

@pgaudet @cmungall

Just wanted to clarify about the UniProt five-letter code.

Do we want to apply this to all species in Noctua or just these new ones?

Currently, we just have the five-letter codes applied to the new species.

@kltm
Copy link
Member

kltm commented Feb 6, 2025

Talking to @tmushayahama , we will need to have a release of the Pathway Viewer that is more clever at removing the "species" part of the label. This will need to be done before we release the new data.
To restate: geneontology/wc-gocam-viz#73 is possibly a blocking issue for proceeding.

@kltm
Copy link
Member

kltm commented Feb 6, 2025

To properly lay out the issues and options here, it turns out that the Pathway Viewer widget uses a statically compiled file to filter out the species part of the label. The species part of the label is introduced by the NEO data load taken by minerva and propagates to the API and other locations by way of the model JSON.

  • If we proceed without a widget release (which would allow us to get the new species in quickly) and mix the 4-letter and 5-letter code, things would continue to work as they do now, but models with new species' gps would have the 5-letter code exposed in the main part of the display. While this is annoying for some users in some cases, this could be considered as not a blocking issue.
  • If we proceed without a widget release (which would allow us to get the new species in quickly) and standardize on the 5-letter codes, all models would have the 5-letter code exposed in the main part of the display for gps. This is likely somewhat annoying to many users in most cases. I'm not sure if this is a blocking issue--it probably depends on priorities.
  • If we wait until the code is fixed to be a little more flexible and dynamic, we would also want to make sure all of our users have the new widget before we make this data update. This could be a slower process, and we would likely not be aiming to get this done at the next outage. In this case, Need to update species trimming code for pathway viewer wc-gocam-viz#73 would be a blocking issue.

@Pauldenny
Copy link

Hi @kltm thanks for explainer - I would prefer to get the new species in quickly, if possible and workaround the gene naming

@kltm
Copy link
Member

kltm commented Feb 8, 2025

There is a collision between uniprot_reviewed.gpi.gz and taxon_12118.gpi.gz; it looks like they have difference names for the same identifier (from @balhoff ), likely around "name( P03305 FMDVO)" and "name( P03305 NCBITaxon:73482)". For expediency of testing, I'm going to remove taxon_12118.gpi.gz from the build for the moment to see if we can get more progress.

@kltm
Copy link
Member

kltm commented Feb 9, 2025

Okay, there seem to be multiple issues. Unfortunately, it stops when hitting the first rather than continuing, so we'll have to take a few passes at this. I will keep a list of issues as I find them here:

  • UniProtKB:P03305 in taxon_12118.gpi.gz
  • UniProtKB:P0DPR4 in taxon_470.gpi.gz

I will be eliminating the files as I go; then examine the problematic files individually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

6 participants