-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the Gene Ontology Annotations in 2024 #6
Comments
I installed the Conda environment and executed the command. I encountered some minor errors:
As a workaround, I downgraded to Python 3.8. The code ran fine initially, but I encountered a kernel error that I couldn’t resolve:
I would appreciate any guidance on resolving the kernel issue. |
It looks like your networkx and python versions are newer than those pinned in Lines 5 to 12 in d57fd93
How did you install the conda environment? Did you try:
It might be good to upgrade the conda environment and code to work with newer versions, but let's see if we can get the old environment to install first. The dead kernel could be an out of memory error. How much memory do you have available? I was likely running this on a pretty beefy machine. |
I didn't use the conda env create --file=environment.yml command to install the Conda environment initially. Instead, I created the environment with:
I have 32 GB of physical memory, but after doing some calculations on the memory used and cached files, I’m left with only about 6 GB of available memory. This might be contributing to the issue. I’ll go ahead and install the Conda environment using:
I’ll also try running it on another machine to see if that resolves the problem. |
Okay, you could unpin everything to get latest versions and then re-add the pins with whatever version resolves. This will require more code updates, but might be the best choice if these old conda packages binaries no longer exist. Another option would be to switch to |
Or instead of |
It seems the pined versions in
Additionally, minor changes in process.ipynb can be seen here: commit ae04e74, specifically lines 390, 404, and 409.
These changes should help update the web interface and annotations. I plan to run the command on an HPC to address the memory issue. If it fails, I will consider using poetry or UV. |
Nice work @NegarJanani. Feel free to open a draft pull request if you'd like more feedback while working on these changes. For development you could do something like the following to limit memory usage: gene_df = utilities.read_gene_info(download_dir).head(10_000) Have I mentioned that I'd eventually love if we get this to run on a scheduled basis on CI? |
I’ve opened a pull request for the two changes I’ve made so far. I may need to make additional updates to get everything working correctly. I’ve been running the code on an HPC for two days now. When I checked the results, I noticed a discrepancy: the web version shows 45 taxids, but my file from the HPC contains 1,997 taxids. I also reviewed some files from the last version in 2018 and saw that the numbers have changed. The process is still running, and I’m currently assessing how long it will take and what further changes may happen. By the way, the idea of running this on a scheduled basis using CI is excellent! |
To clarify, you are rerunning with newer/current data rather than reusing the old data? The increase from 25 species to 1,997 is a lot. For development, I'd limit to a couple species like human and rat. |
I believe the links in run.sh are pointing to the latest version of the data. I can either limit the number of taxons in the utilities.py script and rerun it, or I can keep only the 45 taxons and remove the others before pushing the changes back to the repository. |
I would limit the taxons somewhere in the code, possibly to the 45 that were already supported. You will want the benefits of filtering taxons as early in the processing pipeline as possible to save computation. |
Following our last conversation, I limited the number of taxon IDs (species) in process.ipynb to 45 taxa to align with the latest update of the website, but I ended up with only 20 updated taxa. Upon further investigation of the gene2go.gz file in Gene Entrez, I found that 25 taxon IDs out of the initial 45 have been removed, and a significant number of new taxon IDs have been added since the last update of the Gene Ontology annotations. The reason for these changes appears to be that the number of annotations for each taxon has been updated, with some having too few annotations to be retained. The removed taxa are as follows: Shewanella oneidensis MR-1 @cgreene suggested updating only 20 taxa from the last update by limiting the number of taxa in the process.ipynb file. Additionally, we could consider adding as many taxa as possible from the current list, which now includes 2,019 taxa (as of today). |
+1 that sounds like a a great approach
I don't think we should do this yet as our current data storage method is likely inadequate to store so much data. Most users will be focused on the major model organisms and humans. |
I updated the repository based on the latest version of Gene Ontology from Entrez Gene, using the files gene2go.gz and gene_info.gz. These files are quite large: the first is 1.1 GB, and the second is 1.2 GB. I removed However, At this point, I can either remove the file and push the update back, or I can create a new repository, although I'd prefer not to create a new one. Could you advise on the best course of action? |
Let's not track
Yes that's a good approach. |
Motivated by #5
@NegarJanani to attempt to update this repo, you can:
You will likely hit snags, but we can figure it out when you do
The text was updated successfully, but these errors were encountered: