Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata - Add taxonomic breakdown to sourmash output #195

Closed
brooksph opened this issue Apr 25, 2017 · 20 comments
Closed

Metadata - Add taxonomic breakdown to sourmash output #195

brooksph opened this issue Apr 25, 2017 · 20 comments

Comments

@brooksph
Copy link
Contributor

brooksph commented Apr 25, 2017

@halexand suggested adding taxonomic breakdowns to sbt signatures and subsequently, gather output. This could facilitate downstream analysis at different taxonomic levels. For example, kraken does d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli

@ctb
Copy link
Contributor

ctb commented Apr 25, 2017 via email

@luizirber
Copy link
Member

Something like kraken-build --download-taxonomy --db $DBNAME can be a start (from http://ccb.jhu.edu/software/kraken/MANUAL.html)

Downloads data from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, might need to figure out what they do with it.

@halexand
Copy link
Contributor

@luizirber suggested piggybacking off of the Karen tree information. http://ccb.jhu.edu/software/kraken/MANUAL.html

E.g. taxonomy/nodes.dmp + taxonomy/names.dmp: Taxonomy names might have good information? I can look more deeply.

My slightly more Luddite and straight forward idea was that we might just snag the tax id information from the metadata and use that as a means of pulling out phylogeny. You can get the full lineage from this file which is apparently updated regularly. So... that might be made into a sort of db to go alongside the sbt?

Info on taxon id db: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt
Actual download: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.Z

@halexand
Copy link
Contributor

Alright, so as I understand them from reading over the files quickly...

nodes.dmp: for any given tax ID it provides the information to identify the 'parental' tax-id (in addition to other things e.g. rank level for the node id)
names.dmp: gives the actual name for the tax ID.

So, if we take the example of Prochlorococcus marinus (Tax ID 1041938). The look up in names.dmp provides the different strains of Prochlorococcus. Looking up in nodes.dmp provides the genus id 1218 which if you look up in names.dmp is Prochlorococcus.

One problem/difficulty that immediately jumps out at me is that there are multiple different entries for a given tax id sometimes. Because why choose just one???

1213 | "Prochlorales" Lewin 1977 | | synonym |
1213 | Prochloraceae | | scientific name |
1213 | Prochloraceae (ex Lewin 1977) Florenzano et al. 1986 | | authority |
1213 | Prochloraceae Lewin 1977 | | authority |
1213 | Prochlorales | | synonym |
1213 | Prochlorales (ex Lewin 1977) Florenzano et al. 1986 emend. Burger-Wiersma et al. 1989 | | authorit|
1213 | Prochlorococcaceae | | synonym |
1213 | prochlorophytes | | common name |

@ctb
Copy link
Contributor

ctb commented May 13, 2017

Asked Zach Foster about what kind of info and output would be most useful for packages like metacoder.

@ctb
Copy link
Contributor

ctb commented May 14, 2017

talking to @meren to find out what they need for anvi'o and to see if we can swipe some code, too.

@meren
Copy link

meren commented May 14, 2017

I've been using something like this to build my own database here (just isolated this function from a larger context):

import sys
import requests

from xml.etree import ElementTree

taxonomy = ['superkingdom', 'phylum', 'order', 'class', 'family', 'genus', 'species']

def get_lineage(tax_id):
    response = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=%s' % str(tax_id))

    tree = ElementTree.fromstring(response.content)

    lineage = dict(zip(taxonomy, [None] * len(taxonomy)))

    try:
        taxon = tree.findall('Taxon')[0]
    except:
        return lineage

    for e in taxon.findall('LineageEx')[0].findall('Taxon'):
        rank = e.find('Rank').text
        if rank in lineage:
            lineage[rank] = e.find('ScientificName').text.replace(' ', '_')

    return lineage

if __name__ == '__main__':
    print(get_lineage(sys.argv[1]))

Here:

$ python get_lineage.py 665953
{'superkingdom': 'Bacteria', 'family': 'Bacteroidaceae', 'class': 'Bacteroidia', 'phylum': 'Bacteroidetes', 'species': 'Bacteroides_eggerthii', 'genus': 'Bacteroides', 'order': 'Bacteroidales'}

This is just to play, of course. There are much better ways to deal with this issue in general. But requires lots of communication and logistics :/

@zachary-foster
Copy link

zachary-foster commented May 15, 2017

Hi @ctb, thanks for asking.

what kind of info and output would be most useful for packages like metacoder.

Currently, metacoder is best at parsing any text-based format that is 1) sequential and 2) encodable by regular expressions. By these terms I mean:

  1. Repeating units in the same format representing taxonomic classifications and associated information.
  2. All pieces of information that should be preserved during parsing can be defined by capture groups (parentheses) in regular expressions.

For examples of this, I suggest looking at our parsing tutorial: https://grunwaldlab.github.io/metacoder_documentation/vignettes--01--extracting_taxonomy_data.html

However, these examples are for parsing FASTA headers, not tables of information.
I dont know what kind of information you are looking to plot or how you want your output files to be formatted yet.
If they are tables, then there is a function called parse_taxonomy_table, that is just a wrapper for extract_taxonomy, that will get the taxonomic information for each row from one column and save the rest of the data in the parsed object and associate it with the taxonomy. That column could have nearly any format, but the easiest to parameterize would be something like:

Fungi;Ascomycota;Leotiomycetes;Helotiales...
or

k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales...

I expect that I will be rewriting parse_taxonomy_table in the near future to be simpler and more flexible. When I do that I plan on adding convenience wrappers for parse_taxonomy_table that parse common formats. I could easily add one for sourmash once I get an idea of what output format you decide on. I expect that whatever format you choose, metacoder will be able to parse it, but some formats will be easier for users to figure out the correct parameters than others. The simplest would be a table with one column that has classifications like Fungi;Ascomycota;Leotiomycetes;Helotiales and any number of other columns with any kind of information you like. The data from the other columns will be available in the parsed object for plotting/filtering/whatever.

Hope that helps. Feel free to ask any questions if something does not make sense. I could say more if I had a better idea of the information you want to output.

@taylorreiter
Copy link
Contributor

RE krona & NCBI taxonomy numbers:

I think supporting krona output is not necessarily the way to go. Perhaps instead outputting NCBI taxonomy number would be a better use of time, so that it's easy for people to feed their results in to a lot of downstream analyses, not just krona.

I like the Krona visualization, but it is glitchy with large-ish datasets (RNAseq, 4M reads).

Originally posted #174

@ctb
Copy link
Contributor

ctb commented May 28, 2017

ok given that we want to do this for 100,000 entries I'm going to mash up @halexand and @meren suggestions.

On MSU HPC under /mnt/home/ctb/research/tax, I have code to produce a leveldb of the access2taxid files (acc2taxid-to-leveldb.py) and under the subdir taxdump the following script seems to roughly work --

 % python -i parse.py nodes.dmp names.dmp
('Shewanella baltica OS223', '', 'scientific name')
('Shewanella baltica', '', 'scientific name')
('Shewanella', '', 'scientific name')
('Shewanellaceae', '', 'scientific name')
('Alteromonadales', '', 'scientific name')
('Gammaproteobacteria', '', 'scientific name')
('Proteobacteria', '', 'scientific name')
('Bacteria', 'Bacteria <prokaryotes>', 'scientific name')
('cellular organisms', '', 'scientific name')

My current plan is to:

  • finish producing the leveldb for the WGS and Genbank Nucleotide databases, so that we can quickly look up the taxid for a given acc;

  • make some form of db dump (maybe another leveldb?) for the much smaller nodes.dmp and names.dmp files;

  • extract all the accessions from our current genbank signatures;

  • write a CSV file that lets us take accession -> lineage like @meren's code above;

  • produce an ad hoc script that may suit our present needs for going from gather/search -> lineage;

...and then we'll at least have the code to annotate the signatures when we figure out what the right metadata approach is. Probably this will end up in the utils submodule as a set of extra commands...

@taylorreiter what's the input format for krona again? thx!

@ctb
Copy link
Contributor

ctb commented May 29, 2017

ok, I can now get output like this:

CP001056,935198,Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium botulinum
CP001072,512562,Bacteria;Proteobacteria;Epsilonproteobacteria;Campylobacterales;Helicobacteraceae;Helicobacter;Helicobacter pylori
CP000805,455434,Bacteria;Spirochaetes;Spirochaetia;Spirochaetales;Spirochaetaceae;Treponema;Treponema pallidum
CP000934,498211,Bacteria;Proteobacteria;Gammaproteobacteria;Cellvibrionales;Cellvibrionaceae;Cellvibrio;Cellvibrio japonicus
CU469464,37692,Bacteria;Tenericutes;Mollicutes;Acholeplasmatales;Acholeplasmataceae;Candidatus Phytoplasma;Candidatus Phytoplasma mali
CP001132,380394,Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans
CP000964,507522,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae

yay w00t.

@ctb
Copy link
Contributor

ctb commented May 29, 2017

too much detail but my memory is weak:

# in genbank SBT directory: extract accession info from genbank
find . \! -name \*internal\* -exec grep -h \"name\" {} \; > ../genbank-names.txt

# in /mnt/home/ctb/research/tax: connect accession info to taxid using REALLY BIG files
python acc2taxid.py ../genbank-k31-acc.txt nucl_gb.accession2taxid nucl_wgs.accession2taxid.gz 

# in /mnt/home/ctb/research/tax/taxdump: output CSV file with accession, taxid, and ';'-separated lineage
python do-the-csv.py nodes.dmp names.dmp ../../genbank-k31-acc.txt.taxid

@ctb
Copy link
Contributor

ctb commented May 29, 2017

CSV file genbank-genomes-accession+lineage-20170529.csv.gz containing 93,209 accessions + taxid + lineage in this format:

CP001132,380394,Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans
CP000964,507522,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae

now available at:

https://osf.io/28ymr/

Note that approximately 40 accessions in our genbank SBTs do not seem to appear in either nucl_wgs.accession2taxid or nucl_gb.accession2taxid and so are not represented in this CSV file. shrug

@ctb
Copy link
Contributor

ctb commented May 29, 2017

We'll work on utilities to extract and correlate this info with sourmash output in useful ways. In the meantime special thx to @meren for providing some useful code!

@ctb
Copy link
Contributor

ctb commented May 29, 2017

@ctb
Copy link
Contributor

ctb commented Jun 3, 2017

A few more thoughts on taxonomy:

in the metadata block for signatures generated from NCBI, we should definitely have a way to include accession(s) of included sequences and taxid(s) of included sequences. I do now think we should also provide an 'ncbi-tax-lineage' entry that encodes something standard like Bacteria;Proteobacteria;Acidithiobacillia;Acidithiobacillales;Acidithiobacillaceae;Acidithiobacillus;Acidithiobacillus ferrooxidans because this really is unlikely to change and is quite useful to have ready to hand.

@ctb
Copy link
Contributor

ctb commented Aug 26, 2017

We've made quite a bit of headway on this issue, most recently with the sourmash LCA stuff (see gist and ncbi_taxdump_utils specifically. Just wanted to link that in here :)

@ctb
Copy link
Contributor

ctb commented Sep 10, 2017

metacodeR viz working: http://ivory.idyll.org/blog/2017-classify-genome-bins-with-custom-db-part-1.html

@ctb
Copy link
Contributor

ctb commented Sep 28, 2017

a tutorial for metacodeR analysis on gather output: https://hackmd.io/EYMwTAhgHGwMwFoCmcAsqGoAwGMwIggE4ATBKHCVANiSxMpziA==

@ctb
Copy link
Contributor

ctb commented Feb 18, 2018

sourmash lca and https://github.com/dib-lab/2018-ncbi-lineages now handle various bits of this. sourmash lca is the way forward for taxonomy in sourmash, at least for now.

@ctb ctb closed this as completed Feb 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants