Update the Gene Ontology Annotations in 2024 #6

dhimmel · 2024-08-22T20:29:22Z

Motivated by #5

@NegarJanani to attempt to update this repo, you can:

install the conda environment
follow the readme execution command

You will likely hit snags, but we can figure it out when you do

NegarJanani · 2024-08-26T19:19:46Z

I installed the Conda environment and executed the command. I encountered some minor errors:

NetworkX Compatibility Issue:

networkx==2.6 doesn't support graph.node[go_id] in process.ipynb. I resolved this by changing it to graph.nodes[go_id].

Python Version Compatibility Issue:

In Python 3.9 and later, the gcd function was removed from the fractions module and is now available in the math module. The gcd function was previously part of NetworkX packages.

As a workaround, I downgraded to Python 3.8. The code ran fine initially, but I encountered a kernel error that I couldn’t resolve:

 ***raise DeadKernelError("Kernel died") from None***
***nbclient.exceptions.DeadKernelError: Kernel died***

I would appreciate any guidance on resolving the kernel issue.

dhimmel · 2024-08-26T20:29:42Z

It looks like your networkx and python versions are newer than those pinned in environment.yml:

gene-ontology/environment.yml

Lines 5 to 12 in d57fd93

    
           - conda-forge::networkx=2.2 
        
           - conda-forge::numpy=1.15.3 
        
           - conda-forge::pandas=0.23.4 
        
           - conda-forge::python=3.6.7 
        
           - conda-forge::requests=2.20.0 
        
           - conda-forge::notebook=5.7.0 
        
           - conda-forge::nbconvert=5.4.0 
        
           - conda-forge::ipykernel=5.1.0

How did you install the conda environment? Did you try:

conda env create --file=environment.yml

It might be good to upgrade the conda environment and code to work with newer versions, but let's see if we can get the old environment to install first.

The dead kernel could be an out of memory error. How much memory do you have available? I was likely running this on a pretty beefy machine.

NegarJanani · 2024-08-26T21:12:19Z

I didn't use the conda env create --file=environment.yml command to install the Conda environment initially. Instead, I created the environment with:

conda create --name myenv

I have 32 GB of physical memory, but after doing some calculations on the memory used and cached files, I’m left with only about 6 GB of available memory. This might be contributing to the issue.

I’ll go ahead and install the Conda environment using:

conda env create --file=environment.yml

I’ll also try running it on another machine to see if that resolves the problem.

NegarJanani · 2024-08-26T22:33:55Z

I tried running:

conda env create --file=environment.yml

However, I encountered the following error:

It appears that Conda couldn't find these specific package versions in the conda-forge channel.

dhimmel · 2024-08-27T16:15:16Z

Okay, you could unpin everything to get latest versions and then re-add the pins with whatever version resolves. This will require more code updates, but might be the best choice if these old conda packages binaries no longer exist.

Another option would be to switch to poetry for managing the environment. Example of what poetry in repo looks like here. Poetry is nice because it creates a lock files that includes versions of implicit dependencies.

dhimmel · 2024-08-28T20:00:41Z

Or instead of conda or poetry, you could try the newest and snazziest option of https://docs.astral.sh/uv/.

NegarJanani · 2024-08-29T18:10:13Z

It seems the pined versions in environment.yml file:
gene-ontology/environment.yml
Lines 5 to 14 in ae04e74:

  - conda-forge::networkx=2.6
  - conda-forge::numpy=1.24.3
  - conda-forge::pandas=2.0.3
  - conda-forge::python=3.8.19
  - conda-forge::requests=2.32.3
  - conda-forge::notebook=7.2.1
  - conda-forge::nbconvert=7.16.4
  - conda-forge::ipykernel=6.29.5
  - pip:
    - obonet==1.1.0

Additionally, minor changes in process.ipynb can be seen here:

commit ae04e74, specifically lines 390, 404, and 409.

    "        graph.node[go_id][key].add(gene)\n",
    "        graph.nodes[go_id][key].add(gene)\n",

These changes should help update the web interface and annotations. I plan to run the command on an HPC to address the memory issue. If it fails, I will consider using poetry or UV.

dhimmel · 2024-08-31T13:15:49Z

Nice work @NegarJanani.

Feel free to open a draft pull request if you'd like more feedback while working on these changes.

For development you could do something like the following to limit memory usage:

gene_df = utilities.read_gene_info(download_dir).head(10_000)

Have I mentioned that I'd eventually love if we get this to run on a scheduled basis on CI?

NegarJanani · 2024-09-01T19:20:23Z

I’ve opened a pull request for the two changes I’ve made so far. I may need to make additional updates to get everything working correctly.

I’ve been running the code on an HPC for two days now. When I checked the results, I noticed a discrepancy: the web version shows 45 taxids, but my file from the HPC contains 1,997 taxids. I also reviewed some files from the last version in 2018 and saw that the numbers have changed. The process is still running, and I’m currently assessing how long it will take and what further changes may happen.

By the way, the idea of running this on a scheduled basis using CI is excellent!

dhimmel · 2024-09-01T21:32:38Z

the web version shows 45 taxids, but my file from the HPC contains 1,997 taxids

To clarify, you are rerunning with newer/current data rather than reusing the old data?

The increase from 25 species to 1,997 is a lot. For development, I'd limit to a couple species like human and rat.

NegarJanani · 2024-09-02T15:23:05Z

I believe the links in run.sh are pointing to the latest version of the data. I can either limit the number of taxons in the utilities.py script and rerun it, or I can keep only the 45 taxons and remove the others before pushing the changes back to the repository.

dhimmel · 2024-09-03T00:33:16Z

I can either limit the number of taxons in the utilities.py script and rerun it, or I can keep only the 45 taxons and remove the others before pushing the changes

I would limit the taxons somewhere in the code, possibly to the 45 that were already supported. You will want the benefits of filtering taxons as early in the processing pipeline as possible to save computation.

NegarJanani · 2024-10-07T15:57:25Z

Following our last conversation, I limited the number of taxon IDs (species) in process.ipynb to 45 taxa to align with the latest update of the website, but I ended up with only 20 updated taxa.

Upon further investigation of the gene2go.gz file in Gene Entrez, I found that 25 taxon IDs out of the initial 45 have been removed, and a significant number of new taxon IDs have been added since the last update of the Gene Ontology annotations. The reason for these changes appears to be that the number of annotations for each taxon has been updated, with some having too few annotations to be retained.

The removed taxa are as follows:

Shewanella oneidensis MR-1
Sus scrofa
Pseudomonas syringae pv. tomato str. DC3000
Geobacter sulfurreducens PCA
Clostridium perfringens ATCC 13124
Coxiella burnetii RSA 493
Oryza sativa Indica Group
Oryza sativa f. spontanea
Oryza australiensis
Oryza officinalis
Oryza punctata
Oryza glumipatula
Oryza rhizomatis
Oryza latifolia
Oryza grandiglumis
Oryza barthii
Oryza longistaminata
Oryza glaberrima
Oryza eichingeri
Oryza meridionalis
Oryza alta
Oryza minuta
Oryza meyeriana
Oryza ridleyi
Oryza longiglumis
Most of these are from the rice family.

@cgreene suggested updating only 20 taxa from the last update by limiting the number of taxa in the process.ipynb file. Additionally, we could consider adding as many taxa as possible from the current list, which now includes 2,019 taxa (as of today).

dhimmel · 2024-10-15T15:20:51Z

suggested updating only 20 taxa from the last update by limiting the number of taxa in the process.ipynb file

+1 that sounds like a a great approach

we could consider adding as many taxa as possible from the current list, which now includes 2,019 taxa (as of today).

I don't think we should do this yet as our current data storage method is likely inadequate to store so much data. Most users will be focused on the major model organisms and humans.

NegarJanani · 2024-10-21T20:21:21Z

I updated the repository based on the latest version of Gene Ontology from Entrez Gene, using the files gene2go.gz and gene_info.gz. These files are quite large: the first is 1.1 GB, and the second is 1.2 GB. I removed gene_info.gz because it was not present in the download directory.

However, gene2go.gz is still too large to be pushed back to the forked repository. I tried using Git LFS, but it didn’t work with the forked repository.

At this point, I can either remove the file and push the update back, or I can create a new repository, although I'd prefer not to create a new one. Could you advise on the best course of action?

dhimmel · 2024-10-23T12:40:01Z

Let's not track gene2go.gz and gene_info.gz with git given how massive they've become.

At this point, I can either remove the file and push the update back

Yes that's a good approach.

dhimmel mentioned this issue Sep 1, 2024

Upgrade conda environment and update 'process.ipynb' for compatibility with newer Networkx version #7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the Gene Ontology Annotations in 2024 #6

Update the Gene Ontology Annotations in 2024 #6

dhimmel commented Aug 22, 2024

NegarJanani commented Aug 26, 2024 •

edited

Loading

dhimmel commented Aug 26, 2024

NegarJanani commented Aug 26, 2024

NegarJanani commented Aug 26, 2024

dhimmel commented Aug 27, 2024 •

edited

Loading

dhimmel commented Aug 28, 2024

NegarJanani commented Aug 29, 2024 •

edited

Loading

dhimmel commented Aug 31, 2024

NegarJanani commented Sep 1, 2024

dhimmel commented Sep 1, 2024

NegarJanani commented Sep 2, 2024

dhimmel commented Sep 3, 2024

NegarJanani commented Oct 7, 2024

dhimmel commented Oct 15, 2024

NegarJanani commented Oct 21, 2024 •

edited

Loading

dhimmel commented Oct 23, 2024

Update the Gene Ontology Annotations in 2024 #6

Update the Gene Ontology Annotations in 2024 #6

Comments

dhimmel commented Aug 22, 2024

NegarJanani commented Aug 26, 2024 • edited Loading

dhimmel commented Aug 26, 2024

NegarJanani commented Aug 26, 2024

NegarJanani commented Aug 26, 2024

dhimmel commented Aug 27, 2024 • edited Loading

dhimmel commented Aug 28, 2024

NegarJanani commented Aug 29, 2024 • edited Loading

dhimmel commented Aug 31, 2024

NegarJanani commented Sep 1, 2024

dhimmel commented Sep 1, 2024

NegarJanani commented Sep 2, 2024

dhimmel commented Sep 3, 2024

NegarJanani commented Oct 7, 2024

dhimmel commented Oct 15, 2024

NegarJanani commented Oct 21, 2024 • edited Loading

dhimmel commented Oct 23, 2024

NegarJanani commented Aug 26, 2024 •

edited

Loading

dhimmel commented Aug 27, 2024 •

edited

Loading

NegarJanani commented Aug 29, 2024 •

edited

Loading

NegarJanani commented Oct 21, 2024 •

edited

Loading