-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how should we name & version our databases moving forward? #3517
Comments
@bluegenes what other databases are we looking at providing? AllTheBacteria has version numbers. ICTV...? |
Practically speaking, I think most people are going to be on |
database search and retrieval ideas:
|
ICTV VMR has version numbers -- the latest is MSL 39 -- master species list 39. They do have ~ bugfix versions as well, e.g. v1. v2, etc. Usually only a few of these per MSL number.
to work for all our databases, it would help to include moltype, ksize params in addition to version/date. But I know this extends the filename a lot. I really like the idea of allowing folks to use a common name for a database, and finding all files associated (including the taxonomy file), if they are available in the expected location(s). Note that if we suggest keeping databases in a common location, handle downloading, and allow common names, then having long names for each individual database won't be as big of a deal. So to reframe + add to your ideas above, a
If we had this plugin, we could use it within sourmash to either print a command to download/update databases + taxonomy files as needed, or automatically download as part of the command, if desired. On our end, we could either include this in the same plugin, or keep a separate database building/updating utility:
As a future note, this plugin could also allow us to download and use database diffs for folks to prevent redownloading whole databases for minor updates. |
🤩 |
over in https://github.com/sourmash-bio/2025-sourmash-databases-doc-template, I'm trying to standardize our database listings so that they can be described programmatically & templates used to generate the docs automatically. It would be nice to converge on some versioning and naming scheme that worked for all our databases.
file naming
But, we (I?) have been really inconsistent with recent database naming. We have:
gtdb databases:
eukaryote databases:
viral databases:
so they (variously) have ksize, moltype, and scaled in them. ☹
It also gets quite long.
versioning
We also have the GTDB versioning scheme, which corresponds to releases 🎉 , and we are versioning the NCBI databases by date, I think, since it's hard (to impossible) to figure out what genomes are actually in any given GenBank release.
I asked around, and @kescobo had the following to say:
The sortable-by-date is kinda a cool idea. We don't necessarily have the compatibility issue, either.
what about versioning, and then providing metadata?
We could do ncbi-viruses-v3, and then in the description say what date that corresponds to. But then you have to look that information up. So recognizable dates seem better.
Hmm.
Anyway, dropping this here for further rumination.
The text was updated successfully, but these errors were encountered: