Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how should we name & version our databases moving forward? #3517

Open
ctb opened this issue Feb 3, 2025 · 5 comments
Open

how should we name & version our databases moving forward? #3517

ctb opened this issue Feb 3, 2025 · 5 comments

Comments

@ctb
Copy link
Contributor

ctb commented Feb 3, 2025

over in https://github.com/sourmash-bio/2025-sourmash-databases-doc-template, I'm trying to standardize our database listings so that they can be described programmatically & templates used to generate the docs automatically. It would be nice to converge on some versioning and naming scheme that worked for all our databases.

file naming

But, we (I?) have been really inconsistent with recent database naming. We have:

gtdb databases:

gtdb-rs220-k{ksize}.zip

eukaryote databases:

metazoa-minus-bilateria.k={ksize}.sig.zip

viral databases:

ncbi-viruses.{moltype_l}.k={ksize}.scaled={scaled}.sig.zip

so they (variously) have ksize, moltype, and scaled in them. ☹

It also gets quite long.

versioning

We also have the GTDB versioning scheme, which corresponds to releases 🎉 , and we are versioning the NCBI databases by date, I think, since it's hard (to impossible) to figure out what genomes are actually in any given GenBank release.

I asked around, and @kescobo had the following to say:

just had a long discussion about this regarding biobakery tools and databases. I don't think my opinion won the day, but an idea I had was to have a thin package wrapper around a database that can itself be managed by a package manager so that you can set things like compatibility, and have your conda environment yaml or your R lock file or whatever include the database too.
What I think was agreed to is that minimally, database versions should
• Have clear compat annotation for what software versions they do and don't work with
• Be unique, and sortable (eg right now the metaphlan databases have an 'Jun23' that gets sorted earlier than 'Oct19', which is annoying)
• Have particular software versions linked to particular database versions by default (though where possible, make it configurable). Eg, don't point to latest, since this can cause pinned environments (eg containers) to silently change versions when installing somewhere new

The sortable-by-date is kinda a cool idea. We don't necessarily have the compatibility issue, either.

what about versioning, and then providing metadata?

We could do ncbi-viruses-v3, and then in the description say what date that corresponds to. But then you have to look that information up. So recognizable dates seem better.

Hmm.

Anyway, dropping this here for further rumination.

@ctb
Copy link
Contributor Author

ctb commented Feb 3, 2025

@bluegenes what other databases are we looking at providing?

AllTheBacteria has version numbers.

ICTV...?

@kescobo
Copy link

kescobo commented Feb 3, 2025

But then you have to look that information up

Practically speaking, I think most people are going to be on latest or on some fixed version. So you kinda only need to look it up once 🤷. But if you're not concerned about the length of the file name, it's only an extra ~6 characters to include the date

@ctb
Copy link
Contributor Author

ctb commented Feb 3, 2025

database search and retrieval ideas:

  • we could suggest that people construct a local manifest of their various databases combined, and then just use that; appropriate sketches would be automatically selected;
  • we could also provide a combined manifest for each collection; and then maybe provide easy ways to download all, or a specific ksize/moltype/scaled combo;
  • we could provide a database loading plugin that takes common names ('gtdb-latest') and interprets them against latest database;
  • we could also provide a plugin that warns you when your database is out of date
  • we could support downloading databases to a common location / warning when they are out of date

@bluegenes
Copy link
Contributor

bluegenes commented Feb 3, 2025

ICTV...?

ICTV VMR has version numbers -- the latest is MSL 39 -- master species list 39. They do have ~ bugfix versions as well, e.g. v1. v2, etc. Usually only a few of these per MSL number.

It would be nice to converge on some versioning and naming scheme that worked for all our databases.

to work for all our databases, it would help to include moltype, ksize params in addition to version/date. But I know this extends the filename a lot. I really like the idea of allowing folks to use a common name for a database, and finding all files associated (including the taxonomy file), if they are available in the expected location(s). Note that if we suggest keeping databases in a common location, handle downloading, and allow common names, then having long names for each individual database won't be as big of a deal.

So to reframe + add to your ideas above, a sourmash_plugin_databases could have:
User utilities:

  • build/update a database manifest, looking in e.g. local directory and SOURMASH_DATABASE_DIR or similar.
  • check online master file to see if databases are up to date
  • identify database users want to search with from common names and params
  • identify the relevant taxonomy file(s) based on a common name and taxonomy type
  • download new databases & updates to existing databases
  • find/download/update the relevant taxonomy file.
  • (maybe) - helper function or version of index command that creates relevant name/location for e.g. rocksdb index of db.

If we had this plugin, we could use it within sourmash to either print a command to download/update databases + taxonomy files as needed, or automatically download as part of the command, if desired.

On our end, we could either include this in the same plugin, or keep a separate database building/updating utility:

  • For e.g. genbank, build new databases quarterly. For db's with release versions, provide template for new version.
  • Publish / update the master list of "up to date" sourmash databases.

As a future note, this plugin could also allow us to download and use database diffs for folks to prevent redownloading whole databases for minor updates.

@ctb
Copy link
Contributor Author

ctb commented Feb 3, 2025

As a future note, this plugin could also allow us to download and use database diffs for folks to prevent redownloading whole databases for minor updates.

🤩

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants