Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can lca summarize kmers counts be transformed to a corresponding fraction of the community? #1011

Closed
cesarcardonau opened this issue Jun 2, 2020 · 16 comments · Fixed by #1022
Labels
code herein lies code

Comments

@cesarcardonau
Copy link

Hi,
I am looking forward to using lca summarize to see the different taxonomical compositions of my metagenome.
From my first test runs and your documentation [1] it looks like I'm only seeing the number of k-mers that are matching each taxonomic level. I was hoping to see what relative fraction of the community is associated with each taxonomic level, similar to what I get with the f_unique_weighted column in the gather commands, this column adds to 1 (or very close to it) for each gather execution.

My ultimate question: is there a way to go from lca summarize kmers counts to the fraction of the community?
Thanks! Cesar

[1] https://sourmash.readthedocs.io/en/latest/command-line.html
"The output is space-separated and consists of three columns: the percentage of total k-mers that have this classification; the number of k-mers that have this classification; and the lineage classification"

@ctb
Copy link
Contributor

ctb commented Jun 3, 2020

hi, I'm afraid we don't have that functionality implemented yet!

It's briefly mentioned here as one of the future for the LCA databases:

#581 (comment)

but I don't have a timeline for that work just yet. Let's leave this issue open and I'll update you when I know more.

@ctb
Copy link
Contributor

ctb commented Jun 3, 2020

Hmm, you know... I think I could give you some code that takes your gather output and a lineage spreadsheet, and does the summary on the f_unique_weighted - the only tricky bits there are the lineage loading, and I have lots of code for that in various projects. It fits pretty well with stuff I want to add to sourmash anyway. I'll whip that up when I get a chance.

@luizirber
Copy link
Member

I also have something that kind of does that =P

This is the relevant block of code that generates a dataframe with lineage information in gather2opal: https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py#L281L300

@cesarcardonau
Copy link
Author

Thanks @luizirber and @ctb for your messages.
From your comments and the code, it seems that you guys are speaking about reformatting the output of gather to display all taxonomic levels. That is not exactly what I need. In addition to getting the gather matches (those that are resolved to the species or strain level), I would like to see matches at a higher taxonomic level as well, for those reads where the signatures are not enough to tell me specific strains or species, but the signatures might be informative enough for a higher taxonomical level at the very least, right? That is why I was thinking lca summarize was closer to provide the information if we can infer the read proportional abundance for each group

@ctb
Copy link
Contributor

ctb commented Jun 5, 2020

hi @cesarcardonau yep, understood! I have a few thoughts --

first, very few nucleotide k-mers are informative beyond genus, as far as we can tell - see http://ivory.idyll.org/blog/2017-how-specific-kmers.html. A fair number of those that appear to be may in fact be due to contamination or taxonomic mistakes (see http://ivory.idyll.org/blog/2020-sourmash-gtdb-oddities.html for the source of this intuition).

second, now that we are using more complete / larger databases such as GTDB, I suspect that gather is going to be about as sensitive as LCA methods, and a lot more specific. we certainly do find that gather outperforms most taxonomic assignment software (I'm not sure which links @luizirber wants to point, that work will be available soon).

third, some in depth work we are doing in our charcoal project suggests that LCA-based identification is not super reliable.

fourth, and I guess the most practical point, is that we might not support the specific LCA feature you want for a bit :)

@ctb
Copy link
Contributor

ctb commented Jun 5, 2020

I've implemented a simple script to do this reporting for gather results, attached as a .txt file. The key bits of code are this function, which:

  1. for each match, pulls lineage back to desired rank
  2. sums f_unique_weighted from the gather results
  3. reports.
def summarize_gather_at(rank, tax_assign, gather_results):
    # collect!                                                                  

    sum_uniq_weighted = defaultdict(float)
    for row in gather_results:
        match_ident = row['name']
        match_ident = get_ident(match_ident)
        lineage = tax_assign[match_ident]
        lineage = pop_to_rank(lineage, rank)
        assert lineage[-1].rank == rank, lineage[-1]

        f_uniq_weighted = row['f_unique_weighted']
        f_uniq_weighted = float(f_uniq_weighted)
        sum_uniq_weighted[lineage] += f_uniq_weighted

    items = list(sum_uniq_weighted.items())
    items.sort(key = lambda x: -x[1])

    for k, v in items:
        print(rank, f'{v:.3f}', sourmash.lca.display_lineage(k))

To run the attached script, you'll need gather results, and a lineage spreadsheet in the style that sourmash lca index takes. We have such lineage files for a lot of NCBI available on request, as well as GTDB here. Code to build your own for your own subset of NCBI is here.

gather-to-tax.py.txt

@ctb
Copy link
Contributor

ctb commented Jun 5, 2020

some example results here:

superkingdom 0.298 Bacteria
superkingdom 0.060 Archaea
phylum 0.099 Bacteria;Proteobacteria
phylum 0.042 Archaea;Euryarchaeota
phylum 0.025 Bacteria;Bacteroidetes
phylum 0.024 Bacteria;Firmicutes
phylum 0.024 Bacteria;Chlorobi
phylum 0.021 Bacteria;Chloroflexi
phylum 0.019 Bacteria;Actinobacteria
phylum 0.019 Archaea;Crenarchaeota
phylum 0.015 Bacteria;Cyanobacteria
phylum 0.013 Bacteria;Planctomycetes
phylum 0.012 Bacteria;Aquificae
phylum 0.009 Bacteria;Deinococcus-Thermus
phylum 0.008 Bacteria;Gemmatimonadetes
phylum 0.008 Bacteria;Thermotogae
phylum 0.008 Bacteria;Acidobacteria
phylum 0.005 Bacteria;Spirochaetes
phylum 0.005 Bacteria;Verrucomicrobia
phylum 0.004 Bacteria;Dictyoglomi
class 0.040 Bacteria;Proteobacteria;Betaproteobacteria
class 0.025 Bacteria;Bacteroidetes;Bacteroidia
class 0.024 Bacteria;Chlorobi;Chlorobia
class 0.023 Bacteria;Proteobacteria;Alphaproteobacteria
class 0.021 Bacteria;Chloroflexi;Chloroflexia
class 0.021 Bacteria;Firmicutes;Clostridia
class 0.019 Bacteria;Actinobacteria;Actinobacteria
class 0.019 Archaea;Crenarchaeota;Thermoprotei
class 0.017 Bacteria;Proteobacteria;Deltaproteobacteria
class 0.015 Bacteria;Proteobacteria;Gammaproteobacteria
class 0.015 Bacteria;Cyanobacteria;unassigned
class 0.013 Bacteria;Planctomycetes;Planctomycetia
class 0.012 Bacteria;Aquificae;Aquificae
class 0.010 Archaea;Euryarchaeota;Methanomicrobia
class 0.009 Archaea;Euryarchaeota;Methanococci
class 0.009 Bacteria;Deinococcus-Thermus;Deinococci
class 0.008 Bacteria;Gemmatimonadetes;Gemmatimonadetes
class 0.008 Bacteria;Thermotogae;Thermotogae
class 0.008 Bacteria;Acidobacteria;Acidobacteriia
class 0.007 Archaea;Euryarchaeota;Halobacteria
class 0.006 Archaea;Euryarchaeota;Thermococci
class 0.005 Bacteria;Spirochaetes;Spirochaetia
class 0.005 Bacteria;Verrucomicrobia;Verrucomicrobiae
class 0.004 Archaea;Euryarchaeota;Archaeoglobi
class 0.004 Bacteria;Proteobacteria;Epsilonproteobacteria
class 0.004 Bacteria;Dictyoglomi;Dictyoglomia
class 0.003 Archaea;Euryarchaeota;Methanopyri
class 0.003 Bacteria;Firmicutes;Bacilli
class 0.003 Archaea;Euryarchaeota;unassigned
order 0.036 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales
order 0.025 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales
order 0.024 Bacteria;Chlorobi;Chlorobia;Chlorobiales
order 0.019 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales
order 0.019 Bacteria;Actinobacteria;Actinobacteria;Micromonosporales
order 0.015 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales
order 0.015 Bacteria;Cyanobacteria;unassigned;Nostocales
order 0.014 Bacteria;Firmicutes;Clostridia;Thermoanaerobacterales
order 0.013 Bacteria;Planctomycetes;Planctomycetia;Planctomycetales
order 0.012 Bacteria;Aquificae;Aquificae;Aquificales
order 0.012 Bacteria;Chloroflexi;Chloroflexia;Herpetosiphonales
order 0.011 Archaea;Crenarchaeota;Thermoprotei;Thermoproteales
order 0.010 Bacteria;Proteobacteria;Deltaproteobacteria;Desulfovibrionales
order 0.010 Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales
order 0.010 Bacteria;Chloroflexi;Chloroflexia;Chloroflexales
order 0.009 Archaea;Euryarchaeota;Methanococci;Methanococcales
order 0.008 Bacteria;Gemmatimonadetes;Gemmatimonadetes;Gemmatimonadales
order 0.008 Bacteria;Thermotogae;Thermotogae;Thermotogales
order 0.008 Bacteria;Acidobacteria;Acidobacteriia;Acidobacteriales
order 0.007 Bacteria;Firmicutes;Clostridia;Clostridiales
order 0.007 Archaea;Euryarchaeota;Halobacteria;Haloferacales
order 0.007 Bacteria;Proteobacteria;Deltaproteobacteria;Desulfuromonadales
order 0.006 Archaea;Euryarchaeota;Thermococci;Thermococcales
order 0.006 Bacteria;Deinococcus-Thermus;Deinococci;Deinococcales
order 0.005 Bacteria;Spirochaetes;Spirochaetia;Spirochaetales
order 0.005 Archaea;Crenarchaeota;Thermoprotei;Sulfolobales
order 0.005 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales
order 0.005 Bacteria;Proteobacteria;Betaproteobacteria;Nitrosomonadales
order 0.004 Archaea;Euryarchaeota;Archaeoglobi;Archaeoglobales
order 0.004 Bacteria;Proteobacteria;Epsilonproteobacteria;Campylobacterales
order 0.004 Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomonadales
order 0.004 Bacteria;Dictyoglomi;Dictyoglomia;Dictyoglomales
order 0.003 Archaea;Euryarchaeota;Methanopyri;Methanopyrales
order 0.003 Bacteria;Firmicutes;Bacilli;Lactobacillales
order 0.003 Bacteria;Deinococcus-Thermus;Deinococci;Thermales
order 0.003 Archaea;Euryarchaeota;unassigned;unassigned
order 0.002 Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales
family 0.024 Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae
family 0.021 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae
family 0.019 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales;Rhodobacteraceae
family 0.019 Bacteria;Actinobacteria;Actinobacteria;Micromonosporales;Micromonosporaceae
family 0.017 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Burkholderiaceae
family 0.015 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae
family 0.015 Bacteria;Cyanobacteria;unassigned;Nostocales;Nostocaceae
family 0.013 Bacteria;Planctomycetes;Planctomycetia;Planctomycetales;Planctomycetaceae
family 0.012 Bacteria;Chloroflexi;Chloroflexia;Herpetosiphonales;Herpetosiphonaceae
family 0.011 Archaea;Crenarchaeota;Thermoprotei;Thermoproteales;Thermoproteaceae
family 0.010 Bacteria;Proteobacteria;Deltaproteobacteria;Desulfovibrionales;Desulfovibrionaceae
family 0.010 Bacteria;Firmicutes;Clostridia;Thermoanaerobacterales;Thermoanaerobacterales Family III. Incertae Sedis
family 0.010 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Alcaligenaceae
family 0.010 Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae
family 0.010 Bacteria;Chloroflexi;Chloroflexia;Chloroflexales;Chloroflexaceae
family 0.009 Bacteria;Aquificae;Aquificae;Aquificales;Hydrogenothermaceae
family 0.009 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;unassigned
family 0.008 Bacteria;Gemmatimonadetes;Gemmatimonadetes;Gemmatimonadales;Gemmatimonadaceae
family 0.008 Bacteria;Thermotogae;Thermotogae;Thermotogales;Thermotogaceae
family 0.008 Bacteria;Acidobacteria;Acidobacteriia;Acidobacteriales;Acidobacteriaceae
family 0.007 Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae
family 0.007 Archaea;Euryarchaeota;Halobacteria;Haloferacales;Haloferacaceae
family 0.007 Bacteria;Proteobacteria;Deltaproteobacteria;Desulfuromonadales;Geobacteraceae
family 0.006 Archaea;Euryarchaeota;Thermococci;Thermococcales;Thermococcaceae
family 0.006 Bacteria;Deinococcus-Thermus;Deinococci;Deinococcales;Deinococcaceae
family 0.006 Archaea;Euryarchaeota;Methanococci;Methanococcales;Methanococcaceae
family 0.005 Bacteria;Spirochaetes;Spirochaetia;Spirochaetales;Spirochaetaceae
family 0.005 Archaea;Crenarchaeota;Thermoprotei;Sulfolobales;Sulfolobaceae
family 0.005 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae
family 0.005 Bacteria;Proteobacteria;Betaproteobacteria;Nitrosomonadales;Nitrosomonadaceae
family 0.004 Archaea;Euryarchaeota;Archaeoglobi;Archaeoglobales;Archaeoglobaceae
family 0.004 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae
family 0.004 Bacteria;Proteobacteria;Epsilonproteobacteria;Campylobacterales;Helicobacteraceae
family 0.004 Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomonadales;Sphingomonadaceae
family 0.004 Bacteria;Dictyoglomi;Dictyoglomia;Dictyoglomales;Dictyoglomaceae
family 0.004 Bacteria;Firmicutes;Clostridia;Thermoanaerobacterales;Thermoanaerobacteraceae
family 0.003 Archaea;Euryarchaeota;Methanopyri;Methanopyrales;Methanopyraceae
family 0.003 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae
family 0.003 Archaea;Euryarchaeota;Methanococci;Methanococcales;Methanocaldococcaceae
family 0.003 Bacteria;Aquificae;Aquificae;Aquificales;Aquificaceae
family 0.003 Bacteria;Deinococcus-Thermus;Deinococci;Thermales;Thermaceae
family 0.003 Archaea;Euryarchaeota;unassigned;unassigned;unassigned
family 0.002 Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae
genus 0.021 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides
genus 0.019 Bacteria;Actinobacteria;Actinobacteria;Micromonosporales;Micromonosporaceae;Salinispora
genus 0.017 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Burkholderiaceae;Paraburkholderia
genus 0.015 Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae;Chlorobium
genus 0.015 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella
genus 0.015 Bacteria;Cyanobacteria;unassigned;Nostocales;Nostocaceae;Nostoc
genus 0.013 Bacteria;Planctomycetes;Planctomycetia;Planctomycetales;Planctomycetaceae;Rhodopirellula
genus 0.012 Bacteria;Chloroflexi;Chloroflexia;Herpetosiphonales;Herpetosiphonaceae;Herpetosiphon
genus 0.011 Archaea;Crenarchaeota;Thermoprotei;Thermoproteales;Thermoproteaceae;Pyrobaculum
genus 0.011 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales;Rhodobacteraceae;Sulfitobacter
genus 0.010 Bacteria;Proteobacteria;Deltaproteobacteria;Desulfovibrionales;Desulfovibrionaceae;Desulfovibrio
genus 0.010 Bacteria;Firmicutes;Clostridia;Thermoanaerobacterales;Thermoanaerobacterales Family III. Incertae Sedis;Caldicellulosiruptor
genus 0.010 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Alcaligenaceae;Bordetella
genus 0.010 Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanosarcina
genus 0.010 Bacteria;Chloroflexi;Chloroflexia;Chloroflexales;Chloroflexaceae;Chloroflexus
genus 0.009 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;unassigned;Leptothrix
genus 0.009 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales;Rhodobacteraceae;Ruegeria
genus 0.008 Bacteria;Gemmatimonadetes;Gemmatimonadetes;Gemmatimonadales;Gemmatimonadaceae;Gemmatimonas
genus 0.008 Bacteria;Thermotogae;Thermotogae;Thermotogales;Thermotogaceae;Thermotoga
genus 0.008 Bacteria;Acidobacteria;Acidobacteriia;Acidobacteriales;Acidobacteriaceae;Acidobacterium
genus 0.007 Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae;Ruminiclostridium
genus 0.007 Archaea;Euryarchaeota;Halobacteria;Haloferacales;Haloferacaceae;Haloferax
genus 0.007 Bacteria;Proteobacteria;Deltaproteobacteria;Desulfuromonadales;Geobacteraceae;Geobacter
genus 0.006 Archaea;Euryarchaeota;Thermococci;Thermococcales;Thermococcaceae;Pyrococcus
genus 0.006 Bacteria;Aquificae;Aquificae;Aquificales;Hydrogenothermaceae;Sulfurihydrogenibium
genus 0.006 Bacteria;Deinococcus-Thermus;Deinococci;Deinococcales;Deinococcaceae;Deinococcus
genus 0.006 Archaea;Euryarchaeota;Methanococci;Methanococcales;Methanococcaceae;Methanococcus
genus 0.005 Bacteria;Spirochaetes;Spirochaetia;Spirochaetales;Spirochaetaceae;Treponema
genus 0.005 Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae;Pelodictyon
genus 0.005 Archaea;Crenarchaeota;Thermoprotei;Sulfolobales;Sulfolobaceae;Sulfolobus
genus 0.005 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia
genus 0.005 Bacteria;Proteobacteria;Betaproteobacteria;Nitrosomonadales;Nitrosomonadaceae;Nitrosomonas
genus 0.004 Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae;Chlorobaculum
genus 0.004 Archaea;Euryarchaeota;Archaeoglobi;Archaeoglobales;Archaeoglobaceae;Archaeoglobus
genus 0.004 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Porphyromonas
genus 0.004 Bacteria;Proteobacteria;Epsilonproteobacteria;Campylobacterales;Helicobacteraceae;Wolinella
genus 0.004 Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomonadales;Sphingomonadaceae;Zymomonas
genus 0.004 Bacteria;Dictyoglomi;Dictyoglomia;Dictyoglomales;Dictyoglomaceae;Dictyoglomus
genus 0.004 Bacteria;Firmicutes;Clostridia;Thermoanaerobacterales;Thermoanaerobacteraceae;Thermoanaerobacter
genus 0.003 Bacteria;Aquificae;Aquificae;Aquificales;Hydrogenothermaceae;Persephonella
genus 0.003 Archaea;Euryarchaeota;Methanopyri;Methanopyrales;Methanopyraceae;Methanopyrus
genus 0.003 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus
genus 0.003 Archaea;Euryarchaeota;Methanococci;Methanococcales;Methanocaldococcaceae;Methanocaldococcus
genus 0.003 Bacteria;Aquificae;Aquificae;Aquificales;Aquificaceae;Hydrogenobaculum
genus 0.003 Bacteria;Deinococcus-Thermus;Deinococci;Thermales;Thermaceae;Thermus
genus 0.003 Archaea;Euryarchaeota;unassigned;unassigned;unassigned;Aciduliprofundum
genus 0.002 Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Ignicoccus
species 0.017 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Burkholderiaceae;Paraburkholderia;Paraburkholderia xenovorans
species 0.015 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica
species 0.015 Bacteria;Cyanobacteria;unassigned;Nostocales;Nostocaceae;Nostoc;Nostoc sp. PCC 7120
species 0.013 Bacteria;Planctomycetes;Planctomycetia;Planctomycetales;Planctomycetaceae;Rhodopirellula;Rhodopirellula baltica
species 0.012 Bacteria;Chloroflexi;Chloroflexia;Herpetosiphonales;Herpetosiphonaceae;Herpetosiphon;Herpetosiphon aurantiacus
species 0.011 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides thetaiotaomicron
species 0.010 Bacteria;Actinobacteria;Actinobacteria;Micromonosporales;Micromonosporaceae;Salinispora;Salinispora arenicola
species 0.010 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Alcaligenaceae;Bordetella;Bordetella bronchiseptica
species 0.010 Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanosarcina;Methanosarcina acetivorans
species 0.010 Bacteria;Chloroflexi;Chloroflexia;Chloroflexales;Chloroflexaceae;Chloroflexus;Chloroflexus aurantiacus
species 0.009 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides vulgatus
species 0.009 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;unassigned;Leptothrix;Leptothrix cholodnii
species 0.009 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales;Rhodobacteraceae;Ruegeria;Ruegeria pomeroyi
species 0.009 Bacteria;Actinobacteria;Actinobacteria;Micromonosporales;Micromonosporaceae;Salinispora;Salinispora tropica
species 0.008 Bacteria;Gemmatimonadetes;Gemmatimonadetes;Gemmatimonadales;Gemmatimonadaceae;Gemmatimonas;Gemmatimonas aurantiaca
species 0.008 Bacteria;Acidobacteria;Acidobacteriia;Acidobacteriales;Acidobacteriaceae;Acidobacterium;Acidobacterium capsulatum
species 0.007 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales;Rhodobacteraceae;Sulfitobacter;Sulfitobacter sp. NAS-14.1
species 0.007 Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae;Ruminiclostridium;Ruminiclostridium thermocellum
species 0.007 Archaea;Euryarchaeota;Halobacteria;Haloferacales;Haloferacaceae;Haloferax;Haloferax volcanii
species 0.007 Bacteria;Proteobacteria;Deltaproteobacteria;Desulfuromonadales;Geobacteraceae;Geobacter;Geobacter sulfurreducens
species 0.006 Bacteria;Deinococcus-Thermus;Deinococci;Deinococcales;Deinococcaceae;Deinococcus;Deinococcus radiodurans
species 0.006 Archaea;Euryarchaeota;Methanococci;Methanococcales;Methanococcaceae;Methanococcus;Methanococcus maripaludis
species 0.006 Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae;Chlorobium;Chlorobium phaeobacteroides
species 0.005 Bacteria;Firmicutes;Clostridia;Thermoanaerobacterales;Thermoanaerobacterales Family III. Incertae Sedis;Caldicellulosiruptor;Caldicellulosiruptor bescii
species 0.005 Bacteria;Spirochaetes;Spirochaetia;Spirochaetales;Spirochaetaceae;Treponema;Treponema denticola
species 0.005 Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae;Chlorobium;Chlorobium limicola
species 0.005 Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae;Pelodictyon;Pelodictyon phaeoclathratiforme
species 0.005 Bacteria;Proteobacteria;Deltaproteobacteria;Desulfovibrionales;Desulfovibrionaceae;Desulfovibrio;Desulfovibrio vulgaris
species 0.005 Archaea;Crenarchaeota;Thermoprotei;Sulfolobales;Sulfolobaceae;Sulfolobus;Sulfolobus tokodaii
species 0.005 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
species 0.005 Bacteria;Proteobacteria;Deltaproteobacteria;Desulfovibrionales;Desulfovibrionaceae;Desulfovibrio;Desulfovibrio piger
species 0.005 Bacteria;Firmicutes;Clostridia;Thermoanaerobacterales;Thermoanaerobacterales Family III. Incertae Sedis;Caldicellulosiruptor;Caldicellulosiruptor saccharolyticus
species 0.005 Bacteria;Proteobacteria;Betaproteobacteria;Nitrosomonadales;Nitrosomonadaceae;Nitrosomonas;Nitrosomonas europaea
species 0.004 Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae;Chlorobaculum;Chlorobaculum tepidum
species 0.004 Archaea;Euryarchaeota;Archaeoglobi;Archaeoglobales;Archaeoglobaceae;Archaeoglobus;Archaeoglobus fulgidus
species 0.004 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Porphyromonas;Porphyromonas gingivalis
species 0.004 Bacteria;Proteobacteria;Epsilonproteobacteria;Campylobacterales;Helicobacteraceae;Wolinella;Wolinella succinogenes
species 0.004 Archaea;Crenarchaeota;Thermoprotei;Thermoproteales;Thermoproteaceae;Pyrobaculum;Pyrobaculum aerophilum
species 0.004 Archaea;Crenarchaeota;Thermoprotei;Thermoproteales;Thermoproteaceae;Pyrobaculum;Pyrobaculum arsenaticum
species 0.004 Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomonadales;Sphingomonadaceae;Zymomonas;Zymomonas mobilis
species 0.004 Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae;Chlorobium;Chlorobium phaeovibrioides
species 0.004 Bacteria;Thermotogae;Thermotogae;Thermotogales;Thermotogaceae;Thermotoga;Thermotoga neapolitana
species 0.004 Bacteria;Dictyoglomi;Dictyoglomia;Dictyoglomales;Dictyoglomaceae;Dictyoglomus;Dictyoglomus turgidum
species 0.004 Bacteria;Firmicutes;Clostridia;Thermoanaerobacterales;Thermoanaerobacteraceae;Thermoanaerobacter;Thermoanaerobacter pseudethanolicus
species 0.004 Archaea;Crenarchaeota;Thermoprotei;Thermoproteales;Thermoproteaceae;Pyrobaculum;Pyrobaculum calidifontis
species 0.003 Bacteria;Aquificae;Aquificae;Aquificales;Hydrogenothermaceae;Sulfurihydrogenibium;Sulfurihydrogenibium sp. YO3AOP1
species 0.003 Bacteria;Aquificae;Aquificae;Aquificales;Hydrogenothermaceae;Persephonella;Persephonella marina
species 0.003 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales;Rhodobacteraceae;Sulfitobacter;Sulfitobacter sp. EE-36
species 0.003 Archaea;Euryarchaeota;Methanopyri;Methanopyrales;Methanopyraceae;Methanopyrus;Methanopyrus kandleri
species 0.003 Archaea;Euryarchaeota;Thermococci;Thermococcales;Thermococcaceae;Pyrococcus;Pyrococcus furiosus
species 0.003 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis
species 0.003 Bacteria;Thermotogae;Thermotogae;Thermotogales;Thermotogaceae;Thermotoga;Thermotoga petrophila
species 0.003 Archaea;Euryarchaeota;Thermococci;Thermococcales;Thermococcaceae;Pyrococcus;Pyrococcus horikoshii
species 0.003 Archaea;Euryarchaeota;Methanococci;Methanococcales;Methanocaldococcaceae;Methanocaldococcus;Methanocaldococcus jannaschii
species 0.003 Bacteria;Aquificae;Aquificae;Aquificales;Aquificaceae;Hydrogenobaculum;Hydrogenobaculum sp. Y04AAS1
species 0.003 Bacteria;Deinococcus-Thermus;Deinococci;Thermales;Thermaceae;Thermus;Thermus thermophilus
species 0.003 Archaea;Euryarchaeota;unassigned;unassigned;unassigned;Aciduliprofundum;Aciduliprofundum boonei
species 0.003 Bacteria;Aquificae;Aquificae;Aquificales;Hydrogenothermaceae;Sulfurihydrogenibium;Sulfurihydrogenibium yellowstonense
species 0.002 Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Ignicoccus;Ignicoccus hospitalis
species 0.001 Bacteria;Thermotogae;Thermotogae;Thermotogales;Thermotogaceae;Thermotoga;Thermotoga sp. RQ2

@ctb
Copy link
Contributor

ctb commented Jun 7, 2020

so... I think I totally misunderstood this issue because I was focused on something else 😂

It is relatively straightforward to support queries into LCA databases with signatures that have abundances. Totally different problem than having LCA databases support tracking of abundances, which is what we're discussing in #581 and doing in #1015

My apologies. No promises, but this should be a relatively short hack...

@cesarcardonau
Copy link
Author

Thanks @ctb, this sounds promising.
Can you clarify what you meant It is relatively straightforward to support queries into LCA databases with signatures that have abundances did you mean to say that the queries have the signatures with abundances or the LCA database having the abundances? My queries indeed have signatures with abundances (from sourmash compute), so if that is the case, it seems that you might have a solution providing the annotations with the LCA databases! I don't think the abundances at the database level are necessary, right?

@ctb
Copy link
Contributor

ctb commented Jun 8, 2020

yes, exactly! Sorry, I was just confused!

@cesarcardonau
Copy link
Author

great, we are on the same page! thanks for clarifying!

@ctb
Copy link
Contributor

ctb commented Jun 13, 2020

Working on this in #1022. Not quite as easy as I'd expected - there seem to be multiple places where abundances are effectively flattened! - but I've got some tests set up and will work through it.

@ctb
Copy link
Contributor

ctb commented Jun 15, 2020

hi @cesarcardonau I've got something that looks right to me in #1022. I have some more tests to write, but in the meantime it should be ready for use if you want to try it out; you'd need to install that branch in developer mode, per these instructions. Let me know if you are interested and need help getting that to work.

Alternatively we will probably cut a new release (3.3.1?) in the next few weeks and this should be in it by then.

@cesarcardonau
Copy link
Author

Thanks so much for @ctb. I am okay waiting for the next release, I appreciate you guys prioritizing this issue and give such a quick turnaround. Excited to try it in a few weeks or so.

@ctb
Copy link
Contributor

ctb commented Jun 16, 2020

@cesarcardonau and @taylorreiter here are some attempts by me to ground-truth the calculation. Feedback welcome!

sourmash lca summarize --with-abundance

HMP signature from sample G36354

Using the GTDB database, gather reports the following genomic breakdown (using a custom script (see comment above) that aggregates taxonomy of gather matches to higher levels) --

class 0.323 d__Bacteria;p__Firmicutes;c__Bacilli
class 0.021 d__Bacteria;p__Firmicutes_A;c__Clostridia
class 0.018 d__Bacteria;p__Bacteroidota;c__Bacteroidia
class 0.006 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria

vs sourmash lca summarize --with-abundance:

32.1%  1080   p__Firmicutes;c__Bacilli
1.7%     58   p__Bacteroidota;c__Bacteroidia
2.1%     70   p__Firmicutes_A;c__Clostridia
0.6%     20   p__Proteobacteria;c__Gammaproteobacteria

So we see that these reports are pretty consonant in numbers, even for pretty disparate types of analyses!

Fake/simulated abundances

I built a fake signature from 5x one genome, 1x another genome, like so:

sourmash compute -k 31 --track-abundance ../podar-ref/1.fa ../podar-ref/1.fa \
    ../podar-ref/1.fa ../podar-ref/1.fa ../podar-ref/1.fa ../podar-ref/63.fa \
    --merge "query" -o xxx.sig --scaled=1000

Note here that 1.fa, an Archaeum, is about 1/4 the size of 63.fa, a Bacterium --

% wc ../podar-ref/{1,63}.fa
   21242   21246 1508077 ../podar-ref/1.fa
   66992   67018 5426146 ../podar-ref/63.fa
   88234   88264 6934223 total

that is, one genome (archaea) is 22% of the DNA while the other (bacteria) is 78% of the DNA.

>>> size1 = 1508077
>>> size2 = 5426146
>>> size1 / (size1 + size2)
0.21748319891067824
>>> size2 / (size1 + size2)
0.7825168010893218

This is reflected in the output of sourmash lca summarize at the kingdom level ---

% sourmash lca summarize --db ../podar-ref.lca.json.gz --query xxx.sig 
79.6%   550   Bacteria
20.4%   141   Archaea

so that matches well, in terms of ratios of k-mers!

When you add --with-abundance, you see:

% sourmash lca summarize --db ../podar-ref.lca.json.gz --query xxx.sig  --with-abundance
43.2%   563   Bacteria
56.8%   740   Archaea

which (approximately) matches the math weighting the first genome by a factor of five --

>>> size1 = 1508077
>>> size2 = 5426146
>>> size1_weighted = size1*5
>>> total = size1_weighted + size2

>>> f_1 = size1_weighted / total
>>> f_2 = size2 / total
>>> f_1
0.5815267784421292
>>> f_2
0.4184732215578708

huzzah! This all seems to make sense to me.


This is something I'll add into the docs in #1022, would love to entertain any questions you have!

@taylorreiter
Copy link
Contributor

I've implemented a simple script to do this reporting for gather results, attached as a .txt file. The key bits of code are this function, which:

1. for each match, pulls lineage back to desired rank

2. sums f_unique_weighted from the gather results

3. reports.
def summarize_gather_at(rank, tax_assign, gather_results):
    # collect!                                                                  

    sum_uniq_weighted = defaultdict(float)
    for row in gather_results:
        match_ident = row['name']
        match_ident = get_ident(match_ident)
        lineage = tax_assign[match_ident]
        lineage = pop_to_rank(lineage, rank)
        assert lineage[-1].rank == rank, lineage[-1]

        f_uniq_weighted = row['f_unique_weighted']
        f_uniq_weighted = float(f_uniq_weighted)
        sum_uniq_weighted[lineage] += f_uniq_weighted

    items = list(sum_uniq_weighted.items())
    items.sort(key = lambda x: -x[1])

    for k, v in items:
        print(rank, f'{v:.3f}', sourmash.lca.display_lineage(k))

To run the attached script, you'll need gather results, and a lineage spreadsheet in the style that sourmash lca index takes. We have such lineage files for a lot of NCBI available on request, as well as GTDB here. Code to build your own for your own subset of NCBI is here.

gather-to-tax.py.txt

Note that this script will fail if the gather output has matches from multiple databases and the lineage file has not been combined across those databases. It fails on line 88 from assert n_missed == 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code herein lies code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants