Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metabolism estimation for gene cluster bins in a pangenome #2177

Merged
merged 25 commits into from
Nov 20, 2023

Conversation

ivagljiva
Copy link
Contributor

As requested by @FlorianTrigodet, this PR adds a new input option to anvi-estimate-metabolism: gene cluster bin collections from pangenomes. The idea behind this is that people can see which pathways are complete within the core and accessory genomes as inferred from the pangenome.

Here is the implementation summary. For the estimation process, we need to extract gene annotations from each bin. Since we treat each gene cluster as a 'gene', we summarize the functional annotations across all genes within the cluster and extract the most popular annotation from each source. The function that does this behind the scenes is dbops.PanSuperclass.init_gene_clusters_functions_summary_dict(). Once we have the list of annotations for each gene cluster bin, we run metabolism estimation as normal. There are two new functions in kegg.py that drive the estimation process for pangenome bins, init_hits_for_pangenome() and estimate_metabolism_for_pangenome_bins().

Note that we don't allow copy number estimation for pangenomes. Since we can use multiple annotation sources for user-defined pathways, this means that each gene cluster can still be associated with more than one function, and multiple synonymous enzyme annotations could overinflate the copy number values -- an issue I just fixed in PR #2176, and don't want to re-introduce. I cannot use the same fix for gene clusters because it depends on gene ID information, which gets complicated for gene clusters. However, if someone downstream really wants to estimate copy number for pangenomes, they should let me know and we can consider the best way to implement that :)

One thing that I did NOT include (yet?) for pangenome input is a check for the modules db hash value, like we have for contigs databases. This is not something that gets stored in the genomes storage DB, and theoretically the hash could be different for each genome in the pangenome, so implementing such a sanity check would get complicated. I also wasn't sure if it was worth adding, considering our planned future updates to pathway definitions that will move away from using the modules db. What this means is that it is possible to estimate metabolism using different KEGG versions than were used to annotate the component genomes in the pangenome.

There is no test case yet for this input mode in anvi-self-test --suite metabolism (though I am currently running that to make sure I didn't break anything else with these changes), since it would require us to create a pangenome first within that script. I considered adding a test in the pangenomics suite instead, but in that case there is no guarantee that KEGG data is set up (we could do it with a small test set of user-defined pathways perhaps). I wonder if it is better to wait until the pathway definition updates are over to revamp the testing.

Feedback welcome :)

@FlorianTrigodet
Copy link
Contributor

Tested and works beautifully!

@ivagljiva
Copy link
Contributor Author

The metabolism self tests worked, so I'm merging this :)

@ivagljiva ivagljiva merged commit 1242098 into master Nov 20, 2023
@ivagljiva ivagljiva deleted the gene_cluster_metabolism branch November 20, 2023 19:12
@meren
Copy link
Member

meren commented Nov 20, 2023

This is awesome. I love how anvi-estimate-metabolism is branching into every part of the codebase. WHAT IS NEXT??! Estimating conserved metabolic modules as a function of branching patterns in phylogenomic trees??!!11

@ivagljiva
Copy link
Contributor Author

Hmm, now that you mention it @meren , that seems like an interesting direction ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants