Metabolism estimation for gene cluster bins in a pangenome #2177

ivagljiva · 2023-11-20T16:30:57Z

As requested by @FlorianTrigodet, this PR adds a new input option to anvi-estimate-metabolism: gene cluster bin collections from pangenomes. The idea behind this is that people can see which pathways are complete within the core and accessory genomes as inferred from the pangenome.

Here is the implementation summary. For the estimation process, we need to extract gene annotations from each bin. Since we treat each gene cluster as a 'gene', we summarize the functional annotations across all genes within the cluster and extract the most popular annotation from each source. The function that does this behind the scenes is dbops.PanSuperclass.init_gene_clusters_functions_summary_dict(). Once we have the list of annotations for each gene cluster bin, we run metabolism estimation as normal. There are two new functions in kegg.py that drive the estimation process for pangenome bins, init_hits_for_pangenome() and estimate_metabolism_for_pangenome_bins().

Note that we don't allow copy number estimation for pangenomes. Since we can use multiple annotation sources for user-defined pathways, this means that each gene cluster can still be associated with more than one function, and multiple synonymous enzyme annotations could overinflate the copy number values -- an issue I just fixed in PR #2176, and don't want to re-introduce. I cannot use the same fix for gene clusters because it depends on gene ID information, which gets complicated for gene clusters. However, if someone downstream really wants to estimate copy number for pangenomes, they should let me know and we can consider the best way to implement that :)

One thing that I did NOT include (yet?) for pangenome input is a check for the modules db hash value, like we have for contigs databases. This is not something that gets stored in the genomes storage DB, and theoretically the hash could be different for each genome in the pangenome, so implementing such a sanity check would get complicated. I also wasn't sure if it was worth adding, considering our planned future updates to pathway definitions that will move away from using the modules db. What this means is that it is possible to estimate metabolism using different KEGG versions than were used to annotate the component genomes in the pangenome.

There is no test case yet for this input mode in anvi-self-test --suite metabolism (though I am currently running that to make sure I didn't break anything else with these changes), since it would require us to create a pangenome first within that script. I considered adding a test in the pangenomics suite instead, but in that case there is no guarantee that KEGG data is set up (we could do it with a small test set of user-defined pathways perhaps). I wonder if it is better to wait until the pathway definition updates are over to revamp the testing.

Feedback welcome :)

They enable smaller gene cluster function summary dict to be initialized

detect contigs db rather than exclude all other input options

FlorianTrigodet · 2023-11-20T17:38:33Z

Tested and works beautifully!

ivagljiva · 2023-11-20T19:12:04Z

The metabolism self tests worked, so I'm merging this :)

meren · 2023-11-20T20:21:56Z

This is awesome. I love how anvi-estimate-metabolism is branching into every part of the codebase. WHAT IS NEXT??! Estimating conserved metabolic modules as a function of branching patterns in phylogenomic trees??!!11

ivagljiva · 2023-11-21T13:23:16Z

Hmm, now that you mention it @meren , that seems like an interesting direction ;)

ivagljiva added 25 commits November 16, 2023 17:33

new params for init_gene_clusters_functions_summary_dict.

8a75c20

They enable smaller gene cluster function summary dict to be initialized

add arg group for pangenomes

fb46c00

args, better help for pangenome input

c126e95

accept new args and sanity check them

368a60a

reporting on input options including pan dbs

4598177

better if statements

26579a4

detect contigs db rather than exclude all other input options

new function to check if pan db and genomes storage db are compatible

1870158

function to return func annotation sources (if any) from DBInfo

34212fc

change if statement order for loading user pathway data

43e053e

sanity check for sources in genome storage db

ac18410

start option for pangenomes

1449e79

load collection with sanity check for its existence

6b2021c

get list of all gene clusters in collection

42db729

function to load summary of gene cluster annotations

8978272

estimation function for pan bins

a1cd9ee

better header for gene cluster ids

c005349

use the new functions to estimate

91ff8dc

a better tutorial for metabolism code

fc9707b

add pan and gs as accepted artifacts

4d2f245

clarify that genome mode can be used with metagenomes

ebb0247

new help page section on pangenomes

69d58c4

note on annotation for pangenome input

736150f

note on full pangenome

5053bba

fix wrong attribute for collection

eefc80c

don't add annotation if accession is None

3934b39

ivagljiva merged commit 1242098 into master Nov 20, 2023

ivagljiva deleted the gene_cluster_metabolism branch November 20, 2023 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metabolism estimation for gene cluster bins in a pangenome #2177

Metabolism estimation for gene cluster bins in a pangenome #2177

ivagljiva commented Nov 20, 2023

FlorianTrigodet commented Nov 20, 2023

ivagljiva commented Nov 20, 2023

meren commented Nov 20, 2023

ivagljiva commented Nov 21, 2023

Metabolism estimation for gene cluster bins in a pangenome #2177

Metabolism estimation for gene cluster bins in a pangenome #2177

Conversation

ivagljiva commented Nov 20, 2023

FlorianTrigodet commented Nov 20, 2023

ivagljiva commented Nov 20, 2023

meren commented Nov 20, 2023

ivagljiva commented Nov 21, 2023