can we / should we integrate genomepy? #775

ctb · 2019-11-20T14:26:06Z

genomepy is a query system for finding and retrieving genomes.

https://github.com/vanheeringen-lab/genomepy

could be nice to include it in sourmash, perhaps via a plugin. Could also be tied to the IPFS implementation in some way, so that if the signature already exists for a given ksize, you can retrieve it instead of calculating it again.

luizirber · 2019-12-06T21:12:58Z

Other options:

ctb · 2020-07-18T14:26:31Z

the meta issue here is, how do we grab a sequence (genome) given its accession?

luizirber · 2020-07-18T16:01:08Z

Link from sourmash-bio/databases#7 (comment): https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf shows how to build the NCBI Assembly (refseq or genbank) URL from the accession.

Some time ago I also helped @taylorreiter find the URL given an accession, this was the code:

from snakemake import shell

def find_link(accession):
    db, acc = accession.split("_")
    number, version = acc.split(".")
    number = "/".join([number[pos:pos + 3] for pos in range(0, len(number), 3)])

    url = f"ftp://ftp.ncbi.nlm.nih.gov/genomes/all/{db}/{number}/"
    all_names = shell(f"curl -l {url}/", read=True).split('\n')

    full_name = None
    for name in all_names:
        db_, acc_, _ = name.split("_")
        if db_ == db and acc == acc_:
            full_name = name
            break

    return f"{url}/{full_name}/{full_name}_genomic.fna.gz"

it is a bit wasteful because it needs to run an HTTP request to find the proper name (which includes the assembly name), but it works. I have something similar using requests in another project, but similar idea.

If you have the assembly_summary.txt file, then you can use the ftp_path column to generate the URL.

luizirber · 2020-07-18T16:02:52Z

If you have the URL, you can stream the data. This is how I do it in wort: https://github.com/dib-lab/wort/blob/6b6958d6898e536d3cf4da4a536167c5e2f48028/wort/blueprints/compute/tasks.py#L97L103

luizirber · 2020-07-18T16:10:13Z

could be nice to include it in sourmash, perhaps via a plugin. Could also be tied to the IPFS implementation in some way, so that if the signature already exists for a given ksize, you can retrieve it instead of calculating it again.

Oh, and I have all the refseq/genbank datasets sigs in wort now, so can use that to download. (which, incidentally, also will use IPFS to download the data soon, even if you're using an HTTP client to download it)

ctb · 2022-04-02T13:17:21Z

closing. There are many ways to get genomes and we've probably used most of them and now we just do it en masse and release databases with manifest/picklist support so we only need to retrieve the genomes once!

luizirber added the idea label Dec 1, 2019

ctb closed this as completed Apr 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can we / should we integrate genomepy? #775

can we / should we integrate genomepy? #775

ctb commented Nov 20, 2019

luizirber commented Dec 6, 2019

ctb commented Jul 18, 2020

luizirber commented Jul 18, 2020

luizirber commented Jul 18, 2020

luizirber commented Jul 18, 2020

ctb commented Apr 2, 2022

can we / should we integrate genomepy? #775

can we / should we integrate genomepy? #775

Comments

ctb commented Nov 20, 2019

luizirber commented Dec 6, 2019

ctb commented Jul 18, 2020

luizirber commented Jul 18, 2020

luizirber commented Jul 18, 2020

luizirber commented Jul 18, 2020

ctb commented Apr 2, 2022