Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we / should we integrate genomepy? #775

Closed
ctb opened this issue Nov 20, 2019 · 6 comments
Closed

can we / should we integrate genomepy? #775

ctb opened this issue Nov 20, 2019 · 6 comments
Labels

Comments

@ctb
Copy link
Contributor

ctb commented Nov 20, 2019

genomepy is a query system for finding and retrieving genomes.

https://github.com/vanheeringen-lab/genomepy

could be nice to include it in sourmash, perhaps via a plugin. Could also be tied to the IPFS implementation in some way, so that if the signature already exists for a given ksize, you can retrieve it instead of calculating it again.

@luizirber luizirber added the idea label Dec 1, 2019
@luizirber
Copy link
Member

@ctb
Copy link
Contributor Author

ctb commented Jul 18, 2020

the meta issue here is, how do we grab a sequence (genome) given its accession?

@luizirber
Copy link
Member

Link from sourmash-bio/databases#7 (comment): https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf shows how to build the NCBI Assembly (refseq or genbank) URL from the accession.

Some time ago I also helped @taylorreiter find the URL given an accession, this was the code:

from snakemake import shell

def find_link(accession):
    db, acc = accession.split("_")
    number, version = acc.split(".")
    number = "/".join([number[pos:pos + 3] for pos in range(0, len(number), 3)])

    url = f"ftp://ftp.ncbi.nlm.nih.gov/genomes/all/{db}/{number}/"
    all_names = shell(f"curl -l {url}/", read=True).split('\n')

    full_name = None
    for name in all_names:
        db_, acc_, _ = name.split("_")
        if db_ == db and acc == acc_:
            full_name = name
            break

    return f"{url}/{full_name}/{full_name}_genomic.fna.gz"

it is a bit wasteful because it needs to run an HTTP request to find the proper name (which includes the assembly name), but it works. I have something similar using requests in another project, but similar idea.

If you have the assembly_summary.txt file, then you can use the ftp_path column to generate the URL.

@luizirber
Copy link
Member

If you have the URL, you can stream the data. This is how I do it in wort: https://github.com/dib-lab/wort/blob/6b6958d6898e536d3cf4da4a536167c5e2f48028/wort/blueprints/compute/tasks.py#L97L103

@luizirber
Copy link
Member

could be nice to include it in sourmash, perhaps via a plugin. Could also be tied to the IPFS implementation in some way, so that if the signature already exists for a given ksize, you can retrieve it instead of calculating it again.

Oh, and I have all the refseq/genbank datasets sigs in wort now, so can use that to download. (which, incidentally, also will use IPFS to download the data soon, even if you're using an HTTP client to download it)

@ctb
Copy link
Contributor Author

ctb commented Apr 2, 2022

closing. There are many ways to get genomes and we've probably used most of them and now we just do it en masse and release databases with manifest/picklist support so we only need to retrieve the genomes once!

@ctb ctb closed this as completed Apr 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants