From c366ce01d884e9378458190b6fed532313223fef Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 3 May 2022 06:48:26 -0700 Subject: [PATCH 1/5] add advanced database docs --- doc/databases-advanced.md | 117 ++++++++++++++++++++++++++++++++++++++ doc/index.md | 1 + doc/more-info.md | 4 +- 3 files changed, 121 insertions(+), 1 deletion(-) create mode 100644 doc/databases-advanced.md diff --git a/doc/databases-advanced.md b/doc/databases-advanced.md new file mode 100644 index 0000000000..47a4b577c8 --- /dev/null +++ b/doc/databases-advanced.md @@ -0,0 +1,117 @@ +# sourmash databases - advanced usage information. + +closes https://github.com/sourmash-bio/sourmash/issues/1293. + +sourmash uses a variety of different mechanisms and formats for storing, organizing, and searching signatures. Some of these mechanisms, "collections", just store the signatures; others ("indexed" databases) provide indices on the signatures for fast content-based search. _Most_ of the mechanisms now use manifests that permit fast selection and loading of signatures based on metadata. Below we refer to "databases" generically as any on-disk storage mechanism for sourmash signatures. + +Which database type is best to use depends on what you're doing - which is what this document is about! In general, however, sourmash should be fast enough that database choice will only impacts performance when searching 1000s of signatures, or doing many 1000s of searches. + +The recommended file extensions below are conventions used to signal the output format when using `-o` with `sourmash sketch` and the `sourmash sig` subcommands; so, for example, `sourmash sketch dna *.fa -o xyz.zip` will output signatures in the .zip format. + +sourmash will automatically detect and load the database, based on the database _content_ in most cases. + +Unless noted otherwise, the below database formats are supported in all release since sourmash v3.5. + +## How are signatures actually stored? + +sourmash signatures are typically serialized into JSON for on-disk storage, with rare exceptions (SQLite and LCA databases). The internal sourmash code automatically detects and properly handles compressed (gzipped) JSON data. + +## Storing JSON in `.sig` and `.sig.gz` files: the original format. + +Multiple signatures can be stored in a single JSON file. However, this file will be loaded in its entirety by sourmash, even if you only select one for later analysis. + +This is the least efficient way to store multiple signatures, because all of the JSON must be loaded before any signature can be selected or searched. But it is the oldest format and so a lot of our documentation describes it! + +## Storing signatures in `.zip` files: the **recommended** format. + +**This is our recommended format for storing collections of signatures. It is supported as of sourmash v4.1.** + +Multiple signatures can be stored in a single .zip file. The best way to construct that zip file is from within sourmash, by specifying `-o filename.zip` when outputting signatures. Zip files created from within sourmash will automatically have manifests; this enables rapid subselection and direct loading of signatures via e.g. picklists. + +Zip files are not indexed by content, so they can be slow for searching. But they are small, and provide a good compromise between disk size (small), flexibility (can store any mixture of signatures), and speed (good for `gather`, not good for `search`). + +Zip file collections can contain any number of signatures, of any type (`num` or `scaled`, DNA/protein/dayhoff/hp). + +You can create your own Zip files by simply zipping any number of `.sig` or `.sig.gz` files into a .zip file, and sourmash will read this. However, since this zip file will not have a manifest, it will not be fast for certain operations that rely on manifests for speed, such as picklists and `sourmash sig summarize`. So we recommend using sourmash to create zip file collections with manifests. + +### Storing signatures in SQLite databases + +As of sourmash 4.4, we support storing signatures directly in a [SQLite](https://www.sqlite.org/index.html) database (`-o .sqldb`). This is a fast, low-memory, on-disk format that is suitable for use with `search` and can support multiple simultaneous queries. However, the resulting file is also rather large, so we do not distribute databases in this format. + +SQLite databases are implemented as an [inverted index](https://en.wikipedia.org/wiki/Inverted_index), with hashes stored directly in a table. + +SQLite databases are limited to scaled signatures, and can only contain sketches with the same scaled value across the entire database. They *can* store multiple molecule types. + +While SQLite databases are a new format, they seem promising, especially when disk space is not a concern and/or when memory is limited. We particularly recommend them for use as LCA databases (see next section) where they are a considerable improvement over the legacy JSON format. + +### Other Indexed collections - SBTs and LCAs. + +We provide two other indexed collection formats, Sequence Bloom Trees (SBTs) and LCA databases. + +SBTs implement our version of [Sequence Bloom Trees](http://www.cs.cmu.edu/~ckingsf/software/bloomtree/), a fast tree-based index that support rapid `search` for matches; they are particularly effective when searching for *best* matches across large databases. They are relatively low memory and typically about twice the size of .zip files on disk. They can be constructed with `sourmash index`. + +LCA databases are [inverted indices](https://en.wikipedia.org/wiki/Inverted_index) that support individual hash lookup. They provide fast `search` and `gather`, and also support all of the `sourmash lca` subcommands for hash-based taxonomic analysis. There are two LCA database formats, JSON and SQLite; JSON is small on disk but JSON LCA databases consume a lot of memory when loaded, while SQLite LCA databases are large on disk but low-memory and fast. JSON LCA databases do not support multiprocess queries. LCA databases can be constructed with `sourmash lca index`. + +Both SBTs and LCA databases can only store homogenous collections of signature types - all signatures must have the same molecule type and scaled or num value. Furthermore, LCA databases can only store scaled signatures. + +We recommend SBT and LCA databasesfor use only in specific situations - e.g. SBTs are great for single-genome "best match" search for SBTs, and `sourmash lca` commands require LCA databases. + +### Manifests + +Manifests are catalogs of signature metadata - name, molecule type, k-mer size, and other information - that can be used to select specific signatures for searching or processing. Typically when using manifests the actual signatures themselves are not loaded until they are needed. + +As of sourmash 4.4 manifests can be *directly* loaded from the command line as standalone collections. This lets manifests serve as a catalog of signatures stored in many different locations. + +Standalone manifests are preferable to both directory storage and pathlists (below), because they support fast selection and direct lazy loading. They are the most effective solution for managing custom collections of thousands to millions of signatures. + +Manifests can be created with `sourmash sig manifest` and `sourmash sig check`. For complex situations, we recommend using custom Python scripts to manage them - for example, see [sigs-to-manifest.py in database-examples](https://github.com/sourmash-bio/database-examples/blob/main/sigs-to-manifest.py). + +Sourmash supports two manifest file formats - CSV and SQLite. SQLite manifests are much faster than CSV manifests in exchange for extra disk space. + +### Directories + +Directory hierarchies of signatures are read natively by sourmash, and can be created or extended by specifying `-o dirname/` (with a trailing slash). + +To read from a directory, specify the directory name on the sourmash command line. When reading from directories, the entire directory hierarchy is traversed and all `.sig` and `.sig.gz` files are loaded as signatures. If `--force` is specified, _all_ files will be read, and failures will be ignored. + +When directories are specified as outputs, the signatures will be saved by their complete md5sum underneath the directory. + +We don't particularly recommend storing signatures in directory hierarchies, since most of their use cases are now covered by other approaches. + +### Pathlists + +Pathlists are text files containing paths to one or more sourmash databases; any type of sourmash-readable collection can be listed. + +The paths in pathlists can be relative or absolute within the file system. If they are relative, they must resolve with respect to the current working directory of the sourmash command. + +We don't recommend using pathlists any more, since the original use cases are now supported with picklists, but they are still supported! + +Pathlists are not output by any sourmash commands. + +## Storing taxonomies + +sourmash supports taxonomic information output via the `sourmash lca` and `sourmash tax` subcommands. Both sets of commands rely on the same 7 taxonomic ranks: superkingdom, phylum, class, order, family, genus, and species (with limited support for a 'strain' rank). And both sets of subcommands take lineage spreadsheets that link specific identifiers to taxonomic lineages. + +Lineage spreadsheets can be provided in two on-disk formats, CSV and SQLite. + +CSV is the original format, and consists of separate columns for identifier and each taxonomic rank. + +SQLite taxonomy databases are typically built from CSV using `sourmash tax prepare`. They contain a single table, `sourmash_taxonomy`, with columns for `ident` and each taxonomic rank. Only the `sourmash tax` command supports SQLite taxonomy databases. + +## Appendix: SQLite complexities + +The SQLite implementation of signature storage, metadata manifests, and LCA databases is all bundled into a single SQLite database. Beacuse of this, sourmash must examine the database tables to decide what kind of sourmash structure the database is - the logic is roughly this: + +* does the database store both sketch information and taxonomy information? It's an LCA database! +* if it has sketch information but no taxonomy information, it's just a regular index. +* if it only has manifest information, it's a manifest! +* if it only has taxonomy information, it's a taxonomy! + +This is complicated by several other details - + +* we can treat SQLite databases with sketch information as read-only manifests, but because the sketch information is tightly coupled to the manifest table, we cannot insert new manifest entries; +* we can treat SQLite databases with sketch information as read/write taxonomy files, since the taxonomy information is not tightly coupled to the sketches; + +Last but not least, the hashes in SQLite are stored as signed 64-bit integers and must be converted to unsigned 64-bit numbers internally by sourmash; negative numbers in the SQLite table represent unsigned ints that are larger than 2**63 - 1. Please see [this blog post](http://ivory.idyll.org/blog/2022-storing-ulong-in-sqlite-sourmash.html) for more information. + +The SQLite schema itself is not very complicated and can be used for lineage and manifest querying by other scripts. However, we recommend doing hash value querying/search via the Python code. diff --git a/doc/index.md b/doc/index.md index 0d245d636c..1f0a48b3e5 100644 --- a/doc/index.md +++ b/doc/index.md @@ -187,6 +187,7 @@ developer :hidden: README.md legacy-databases.md +databases-advanced.md plotting-compare.ipynb sourmash-sketch.md ``` diff --git a/doc/more-info.md b/doc/more-info.md index f370fd2130..5d5b007134 100644 --- a/doc/more-info.md +++ b/doc/more-info.md @@ -4,10 +4,12 @@ Read more about the [computational requirements, here.](requirements.md) -## Prepared search database +## Prepared search databases We offer a number of [prepared search databases.](databases.md) +You can read about the supported database formats [here.](databases-advanced.md) + ## Other MinHash implementations for DNA In addition to [mash][0], also see: From 6a6dca8c223e2e01378241d0a325a439de804d1c Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 3 May 2022 07:06:11 -0700 Subject: [PATCH 2/5] add reference --- doc/command-line.md | 4 ++++ doc/databases-advanced.md | 2 -- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index 664998c8a7..ea0a91bd07 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -1613,6 +1613,10 @@ All of these save formats can be loaded by sourmash commands. **We strongly suggest using .zip files to store signatures: they are fast, small, and fully supported by all the sourmash commands.** +For more detailed information on database formats and performance +tradeoffs, please see [the advanced usage information for +databases!](databases-advanced.md) + ### Loading many signatures #### Loading signatures within a directory hierarchy diff --git a/doc/databases-advanced.md b/doc/databases-advanced.md index 47a4b577c8..a651e0627e 100644 --- a/doc/databases-advanced.md +++ b/doc/databases-advanced.md @@ -1,7 +1,5 @@ # sourmash databases - advanced usage information. -closes https://github.com/sourmash-bio/sourmash/issues/1293. - sourmash uses a variety of different mechanisms and formats for storing, organizing, and searching signatures. Some of these mechanisms, "collections", just store the signatures; others ("indexed" databases) provide indices on the signatures for fast content-based search. _Most_ of the mechanisms now use manifests that permit fast selection and loading of signatures based on metadata. Below we refer to "databases" generically as any on-disk storage mechanism for sourmash signatures. Which database type is best to use depends on what you're doing - which is what this document is about! In general, however, sourmash should be fast enough that database choice will only impacts performance when searching 1000s of signatures, or doing many 1000s of searches. From d218f6bdbf74c7b8ee345e6fcaf095957459f7ac Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 3 May 2022 07:31:47 -0700 Subject: [PATCH 3/5] retry hidden --- doc/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/index.md b/doc/index.md index 1f0a48b3e5..28bc97de88 100644 --- a/doc/index.md +++ b/doc/index.md @@ -185,6 +185,7 @@ developer ```{toctree} :hidden: + README.md legacy-databases.md databases-advanced.md From f209b3dc21fc365feb6808d3a1503d1eb87fb96e Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 3 May 2022 16:54:35 -0700 Subject: [PATCH 4/5] fix some myst stuff --- doc/index.md | 15 +++------------ doc/more-info.md | 11 +++++++++++ src/sourmash/minhash.py | 2 +- 3 files changed, 15 insertions(+), 13 deletions(-) diff --git a/doc/index.md b/doc/index.md index 28bc97de88..afeaf29084 100644 --- a/doc/index.md +++ b/doc/index.md @@ -169,7 +169,9 @@ Attribution-ShareAlike 4.0 International License. ## Contents: ```{toctree} -:maxdepth: 2 +--- +maxdepth: 2 +--- command-line tutorials @@ -182,17 +184,6 @@ support developer ``` - -```{toctree} -:hidden: - -README.md -legacy-databases.md -databases-advanced.md -plotting-compare.ipynb -sourmash-sketch.md -``` - # Indices and tables * {ref}`genindex` diff --git a/doc/more-info.md b/doc/more-info.md index 5d5b007134..59510808ff 100644 --- a/doc/more-info.md +++ b/doc/more-info.md @@ -118,6 +118,17 @@ or [this sourmash issue comment](https://github.com/sourmash-bio/sourmash/issues Newer versions of matplotlib do not seem to have this problem. +```{toctree} +--- +hidden: +--- +README.md +legacy-databases.md +databases-advanced.md +plotting-compare.ipynb +sourmash-sketch.md +``` + [0]:https://github.com/marbl/Mash [1]:https://github.com/edawson/rkmh [2]:https://github.com/lskatz/mashtree/blob/master/README.md diff --git a/src/sourmash/minhash.py b/src/sourmash/minhash.py index ba157fcbfc..a1ac06c5c3 100644 --- a/src/sourmash/minhash.py +++ b/src/sourmash/minhash.py @@ -1,4 +1,4 @@ -# -*- coding: UTF-8 -*- +# -*- coding: utf-8 -*- """ sourmash submodule that provides MinHash class and utility functions. From 09d6d0a2f060d824d9ef96e5e8809c49722416d5 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 3 May 2022 17:45:49 -0700 Subject: [PATCH 5/5] Apply suggestions from code review thank you! Co-authored-by: Mohamed Abuelanin --- doc/databases-advanced.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/databases-advanced.md b/doc/databases-advanced.md index a651e0627e..816f22f0a9 100644 --- a/doc/databases-advanced.md +++ b/doc/databases-advanced.md @@ -52,7 +52,7 @@ LCA databases are [inverted indices](https://en.wikipedia.org/wiki/Inverted_inde Both SBTs and LCA databases can only store homogenous collections of signature types - all signatures must have the same molecule type and scaled or num value. Furthermore, LCA databases can only store scaled signatures. -We recommend SBT and LCA databasesfor use only in specific situations - e.g. SBTs are great for single-genome "best match" search for SBTs, and `sourmash lca` commands require LCA databases. +We recommend SBT and LCA databases for use only in specific situations - e.g. SBTs are great for single-genome "best match" search for SBTs, and `sourmash lca` commands require LCA databases. ### Manifests