Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the Gene Ontology Annotations in 2024 #6

Open
dhimmel opened this issue Aug 22, 2024 · 16 comments
Open

Update the Gene Ontology Annotations in 2024 #6

dhimmel opened this issue Aug 22, 2024 · 16 comments

Comments

@dhimmel
Copy link
Owner

dhimmel commented Aug 22, 2024

Motivated by #5

@NegarJanani to attempt to update this repo, you can:

  1. install the conda environment
  2. follow the readme execution command

You will likely hit snags, but we can figure it out when you do

@NegarJanani
Copy link
Contributor

NegarJanani commented Aug 26, 2024

I installed the Conda environment and executed the command. I encountered some minor errors:

  1. NetworkX Compatibility Issue:
  • networkx==2.6 doesn't support graph.node[go_id] in process.ipynb. I resolved this by changing it to graph.nodes[go_id].
  1. Python Version Compatibility Issue:
  • In Python 3.9 and later, the gcd function was removed from the fractions module and is now available in the math module. The gcd function was previously part of NetworkX packages.
Screenshot 2024-08-26 at 1 06 53 PM

As a workaround, I downgraded to Python 3.8. The code ran fine initially, but I encountered a kernel error that I couldn’t resolve:

 ***raise DeadKernelError("Kernel died") from None***
***nbclient.exceptions.DeadKernelError: Kernel died***

I would appreciate any guidance on resolving the kernel issue.

@dhimmel
Copy link
Owner Author

dhimmel commented Aug 26, 2024

It looks like your networkx and python versions are newer than those pinned in environment.yml:

- conda-forge::networkx=2.2
- conda-forge::numpy=1.15.3
- conda-forge::pandas=0.23.4
- conda-forge::python=3.6.7
- conda-forge::requests=2.20.0
- conda-forge::notebook=5.7.0
- conda-forge::nbconvert=5.4.0
- conda-forge::ipykernel=5.1.0

How did you install the conda environment? Did you try:

conda env create --file=environment.yml

It might be good to upgrade the conda environment and code to work with newer versions, but let's see if we can get the old environment to install first.

The dead kernel could be an out of memory error. How much memory do you have available? I was likely running this on a pretty beefy machine.

@NegarJanani
Copy link
Contributor

I didn't use the conda env create --file=environment.yml command to install the Conda environment initially. Instead, I created the environment with:

conda create --name myenv

I have 32 GB of physical memory, but after doing some calculations on the memory used and cached files, I’m left with only about 6 GB of available memory. This might be contributing to the issue.

I’ll go ahead and install the Conda environment using:

conda env create --file=environment.yml

I’ll also try running it on another machine to see if that resolves the problem.

@NegarJanani
Copy link
Contributor

I tried running:

conda env create --file=environment.yml

However, I encountered the following error:

Screenshot 2024-08-26 at 4 30 30 PM

It appears that Conda couldn't find these specific package versions in the conda-forge channel.

@dhimmel
Copy link
Owner Author

dhimmel commented Aug 27, 2024

Okay, you could unpin everything to get latest versions and then re-add the pins with whatever version resolves. This will require more code updates, but might be the best choice if these old conda packages binaries no longer exist.

Another option would be to switch to poetry for managing the environment. Example of what poetry in repo looks like here. Poetry is nice because it creates a lock files that includes versions of implicit dependencies.

@dhimmel
Copy link
Owner Author

dhimmel commented Aug 28, 2024

Or instead of conda or poetry, you could try the newest and snazziest option of https://docs.astral.sh/uv/.

@NegarJanani
Copy link
Contributor

NegarJanani commented Aug 29, 2024

It seems the pined versions in environment.yml file:
gene-ontology/environment.yml
Lines 5 to 14 in ae04e74:

  - conda-forge::networkx=2.6
  - conda-forge::numpy=1.24.3
  - conda-forge::pandas=2.0.3
  - conda-forge::python=3.8.19
  - conda-forge::requests=2.32.3
  - conda-forge::notebook=7.2.1
  - conda-forge::nbconvert=7.16.4
  - conda-forge::ipykernel=6.29.5
  - pip:
    - obonet==1.1.0

Additionally, minor changes in process.ipynb can be seen here:

commit ae04e74, specifically lines 390, 404, and 409.

    "        graph.node[go_id][key].add(gene)\n",
    "        graph.nodes[go_id][key].add(gene)\n",

These changes should help update the web interface and annotations. I plan to run the command on an HPC to address the memory issue. If it fails, I will consider using poetry or UV.

@dhimmel
Copy link
Owner Author

dhimmel commented Aug 31, 2024

Nice work @NegarJanani.

Feel free to open a draft pull request if you'd like more feedback while working on these changes.

For development you could do something like the following to limit memory usage:

gene_df = utilities.read_gene_info(download_dir).head(10_000)

Have I mentioned that I'd eventually love if we get this to run on a scheduled basis on CI?

@NegarJanani
Copy link
Contributor

I’ve opened a pull request for the two changes I’ve made so far. I may need to make additional updates to get everything working correctly.

I’ve been running the code on an HPC for two days now. When I checked the results, I noticed a discrepancy: the web version shows 45 taxids, but my file from the HPC contains 1,997 taxids. I also reviewed some files from the last version in 2018 and saw that the numbers have changed. The process is still running, and I’m currently assessing how long it will take and what further changes may happen.

By the way, the idea of running this on a scheduled basis using CI is excellent!

@dhimmel
Copy link
Owner Author

dhimmel commented Sep 1, 2024

the web version shows 45 taxids, but my file from the HPC contains 1,997 taxids

To clarify, you are rerunning with newer/current data rather than reusing the old data?

The increase from 25 species to 1,997 is a lot. For development, I'd limit to a couple species like human and rat.

@NegarJanani
Copy link
Contributor

I believe the links in run.sh are pointing to the latest version of the data. I can either limit the number of taxons in the utilities.py script and rerun it, or I can keep only the 45 taxons and remove the others before pushing the changes back to the repository.

@dhimmel
Copy link
Owner Author

dhimmel commented Sep 3, 2024

I can either limit the number of taxons in the utilities.py script and rerun it, or I can keep only the 45 taxons and remove the others before pushing the changes

I would limit the taxons somewhere in the code, possibly to the 45 that were already supported. You will want the benefits of filtering taxons as early in the processing pipeline as possible to save computation.

@NegarJanani
Copy link
Contributor

Following our last conversation, I limited the number of taxon IDs (species) in process.ipynb to 45 taxa to align with the latest update of the website, but I ended up with only 20 updated taxa.

Upon further investigation of the gene2go.gz file in Gene Entrez, I found that 25 taxon IDs out of the initial 45 have been removed, and a significant number of new taxon IDs have been added since the last update of the Gene Ontology annotations. The reason for these changes appears to be that the number of annotations for each taxon has been updated, with some having too few annotations to be retained.

The removed taxa are as follows:

Shewanella oneidensis MR-1
Sus scrofa
Pseudomonas syringae pv. tomato str. DC3000
Geobacter sulfurreducens PCA
Clostridium perfringens ATCC 13124
Coxiella burnetii RSA 493
Oryza sativa Indica Group
Oryza sativa f. spontanea
Oryza australiensis
Oryza officinalis
Oryza punctata
Oryza glumipatula
Oryza rhizomatis
Oryza latifolia
Oryza grandiglumis
Oryza barthii
Oryza longistaminata
Oryza glaberrima
Oryza eichingeri
Oryza meridionalis
Oryza alta
Oryza minuta
Oryza meyeriana
Oryza ridleyi
Oryza longiglumis
Most of these are from the rice family.

@cgreene suggested updating only 20 taxa from the last update by limiting the number of taxa in the process.ipynb file. Additionally, we could consider adding as many taxa as possible from the current list, which now includes 2,019 taxa (as of today).

@dhimmel
Copy link
Owner Author

dhimmel commented Oct 15, 2024

suggested updating only 20 taxa from the last update by limiting the number of taxa in the process.ipynb file

+1 that sounds like a a great approach

we could consider adding as many taxa as possible from the current list, which now includes 2,019 taxa (as of today).

I don't think we should do this yet as our current data storage method is likely inadequate to store so much data. Most users will be focused on the major model organisms and humans.

@NegarJanani
Copy link
Contributor

NegarJanani commented Oct 21, 2024

I updated the repository based on the latest version of Gene Ontology from Entrez Gene, using the files gene2go.gz and gene_info.gz. These files are quite large: the first is 1.1 GB, and the second is 1.2 GB. I removed gene_info.gz because it was not present in the download directory.

However, gene2go.gz is still too large to be pushed back to the forked repository. I tried using Git LFS, but it didn’t work with the forked repository.

At this point, I can either remove the file and push the update back, or I can create a new repository, although I'd prefer not to create a new one. Could you advise on the best course of action?

@dhimmel
Copy link
Owner Author

dhimmel commented Oct 23, 2024

Let's not track gene2go.gz and gene_info.gz with git given how massive they've become.

At this point, I can either remove the file and push the update back

Yes that's a good approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants