-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metabolism estimation for gene cluster bins in a pangenome #2177
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
They enable smaller gene cluster function summary dict to be initialized
detect contigs db rather than exclude all other input options
Tested and works beautifully! |
The metabolism self tests worked, so I'm merging this :) |
This is awesome. I love how |
Hmm, now that you mention it @meren , that seems like an interesting direction ;) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As requested by @FlorianTrigodet, this PR adds a new input option to
anvi-estimate-metabolism
: gene cluster bin collections from pangenomes. The idea behind this is that people can see which pathways are complete within the core and accessory genomes as inferred from the pangenome.Here is the implementation summary. For the estimation process, we need to extract gene annotations from each bin. Since we treat each gene cluster as a 'gene', we summarize the functional annotations across all genes within the cluster and extract the most popular annotation from each source. The function that does this behind the scenes is
dbops.PanSuperclass.init_gene_clusters_functions_summary_dict()
. Once we have the list of annotations for each gene cluster bin, we run metabolism estimation as normal. There are two new functions inkegg.py
that drive the estimation process for pangenome bins,init_hits_for_pangenome()
andestimate_metabolism_for_pangenome_bins()
.Note that we don't allow copy number estimation for pangenomes. Since we can use multiple annotation sources for user-defined pathways, this means that each gene cluster can still be associated with more than one function, and multiple synonymous enzyme annotations could overinflate the copy number values -- an issue I just fixed in PR #2176, and don't want to re-introduce. I cannot use the same fix for gene clusters because it depends on gene ID information, which gets complicated for gene clusters. However, if someone downstream really wants to estimate copy number for pangenomes, they should let me know and we can consider the best way to implement that :)
One thing that I did NOT include (yet?) for pangenome input is a check for the modules db hash value, like we have for contigs databases. This is not something that gets stored in the genomes storage DB, and theoretically the hash could be different for each genome in the pangenome, so implementing such a sanity check would get complicated. I also wasn't sure if it was worth adding, considering our planned future updates to pathway definitions that will move away from using the modules db. What this means is that it is possible to estimate metabolism using different KEGG versions than were used to annotate the component genomes in the pangenome.
There is no test case yet for this input mode in
anvi-self-test --suite metabolism
(though I am currently running that to make sure I didn't break anything else with these changes), since it would require us to create a pangenome first within that script. I considered adding a test in the pangenomics suite instead, but in that case there is no guarantee that KEGG data is set up (we could do it with a small test set of user-defined pathways perhaps). I wonder if it is better to wait until the pathway definition updates are over to revamp the testing.Feedback welcome :)