-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can we / should we integrate genomepy? #775
Comments
the meta issue here is, how do we grab a sequence (genome) given its accession? |
Link from sourmash-bio/databases#7 (comment): https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf shows how to build the NCBI Assembly (refseq or genbank) URL from the accession. Some time ago I also helped @taylorreiter find the URL given an accession, this was the code:
it is a bit wasteful because it needs to run an HTTP request to find the proper name (which includes the assembly name), but it works. I have something similar using If you have the |
If you have the URL, you can stream the data. This is how I do it in |
Oh, and I have all the refseq/genbank datasets sigs in wort now, so can use that to download. (which, incidentally, also will use IPFS to download the data soon, even if you're using an HTTP client to download it) |
closing. There are many ways to get genomes and we've probably used most of them and now we just do it en masse and release databases with manifest/picklist support so we only need to retrieve the genomes once! |
genomepy is a query system for finding and retrieving genomes.
https://github.com/vanheeringen-lab/genomepy
could be nice to include it in sourmash, perhaps via a plugin. Could also be tied to the IPFS implementation in some way, so that if the signature already exists for a given ksize, you can retrieve it instead of calculating it again.
The text was updated successfully, but these errors were encountered: