Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to download all genomes that belong to a specific taxa? #68

Open
johanneswerner opened this issue Jan 28, 2021 · 8 comments
Open

How to download all genomes that belong to a specific taxa? #68

johanneswerner opened this issue Jan 28, 2021 · 8 comments

Comments

@johanneswerner
Copy link

Hello,

cool package, thank you. :-)

Is there a possibility to download all genomes that belong to a certain genus or family? The way I see it, I cannot use the taxid of the genus or family as this taxid does not has a genome and has no information about descending taxa.

I was trying first with the genus name, however in the case of Enterococcus, it also returns me lots of Enterococcus phage genomes which I do not want to download here.

Thank you for your help.

@HajkD
Copy link
Member

HajkD commented Jan 28, 2021

Hi Johannes,

since this is a taxonomic classification issue, did you by any chance have a look at the taxize package and see what happens there when you insert your particular case?

Since biomartr is solely relying on internal data provided by NCBI could you also check there what kind of records they store for your example. This will help me find strategies to capture such particular cases.

Many thanks,
Hajk

@johanneswerner
Copy link
Author

I did not work with taxize yet, I got the idea also from the ncbi-genome-download package.

They have a python script (which uses ete3) that gets the descending taxa based on a specific taxid, see here.

The script is available here: https://github.com/kblin/ncbi-genome-download/blob/master/contrib/gimme_taxa.py

In any way, I will look into taxize, maybe I get an idea from there as well, thank you for the suggestion.

@johanneswerner
Copy link
Author

Possibly linked with #6 (comment)

@johanneswerner
Copy link
Author

shenwei356/taxonkit#41

@johanneswerner
Copy link
Author

From the referenced issue

I wrote Python bindings for TaxonKit recently, if you’d like to use that as a reference. I primarily used the Popen construct in Python’s subprocess library to call TaxonKit, and then loaded the output into dataframes (pandas) for handling in Python.

https://github.com/bioforensics/pytaxonkit/blob/9746225b1c0a9eff708790037e3b53e5d45ac235/pytaxonkit.py#L203-L211

It’s been a long time since I did any serious work in R, so I’m not sure what the best tools are for system calls. But I imagine R’s native dataframes would be suitable for storing most results.

I hope this helps.

I fear this is not going to be pretty - not that it were too hard to use system() in R to invoke shell commands, but I severely doubt that this will work platform-independently (especially on Windows). @HajkD I would certainly be open for suggestions (unless it is okay to not support Windows ...). If the conda recipe still is going to be built, we can add the funcitonality and add taxonkit as requirement.

@mr-eyes
Copy link

mr-eyes commented Feb 6, 2022

This might help! I am getting accessions for the provided organism name/taxon or the closest one in the lineage.

https://gist.github.com/mr-eyes/92d6172c7a5c7d5bd35fcff6f765d48d

@zachary-foster
Copy link

Hello All,

Saw this issue on Slack. Not sure if this is helpful or not, but I have done this with entrez eutils command line tools. I imagine the same functionality exists in rentrez package for R? Here is the script we used. For our purpose, it was in two steps: 1) get CSV of info for all genomes available for each taxon, 2) download a subset of them. This is from a nextflow module, so the bash might look a bit odd, but should give you the idea.

  1. Make CSV with info for all genomes for a given taxon:
esearch -db taxonomy -query "${taxon} OR ${taxon}[subtree]" | \\
        elink -target assembly | \\
        efilter -query "latest[PROP] AND full-genome-representation[PROP] AND has-annotation[PROP] NOT excluded-from-refseq[PROP]" | \\
        efetch -format docsum | \\
        xtract -pattern DocumentSummary -def 'NA' -element \$COLS >> \\
        ${prefix}.tsv
  1. Download genome for each row in CSV:
datasets download genome accession $id --include gff3,rna,cds,protein,genome,seq-report --filename ${prefix}.zip

@salix-d
Copy link

salix-d commented Nov 2, 2023

The taxize package should be able to get the children, but since you mentionned:

Since biomartr is solely relying on internal data provided by NCBI

Do you get the taxdump?
If so with names.dmp & nodes.dmp you can make an sqlite db that makes it easy to get the parents/children of an accession/tax id.
The taxonomizer package does it.
Might also be how taxize does it, I don't remember.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants