From 36ea2920a85dd67a14ea3d8ea184946c8dcad159 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 18 Jan 2021 07:47:33 -0800 Subject: [PATCH 01/24] start adjusting docs --- doc/api-example.md | 19 ++++++++----- doc/classifying-signatures.md | 8 +++--- doc/command-line.md | 51 ++++++++++++++++++----------------- 3 files changed, 43 insertions(+), 35 deletions(-) diff --git a/doc/api-example.md b/doc/api-example.md index abfec30555..03b7b4d7cb 100644 --- a/doc/api-example.md +++ b/doc/api-example.md @@ -159,7 +159,7 @@ First, load two signatures: ``` -Then, get the hashes, and (e.g.) compute the union: +Then, get the hashes, and (e.g.) calculate the union: ``` >>> hashes1 = set(sig1.minhash.hashes.keys()) @@ -222,7 +222,7 @@ looking at the `num` and `scaled` attributes on a MinHash object: The MinHash class is otherwise identical between the two types of signatures. -Note that you cannot compute Jaccard similarity or containment for +Note that you cannot calculate Jaccard similarity or containment for MinHash objects with different num or scaled values (or different ksizes): ``` @@ -382,15 +382,20 @@ downsample` if you are interested.) ## Working with fast search trees (Sequence Bloom Trees, or SBTs) -Suppose we have a number of signatures calculated with `--scaled`, like so: +Suppose we create some `scaled` signatures: ``` -sourmash compute --scaled 10000 data/GCF*.fna.gz +sourmash sketch dna -p scaled=10000 data/GCF*.fna.gz --outdir data/ ``` -and now we want to create a Sequence Bloom Tree (SBT) so that we can -search them efficiently. You can do this with `sourmash index`, but -you can also access the Python API directly. +and we want to create a Sequence Bloom Tree (SBT) so that we can +search them efficiently. You can do this with `sourmash index`, + +``` +sourmash index foo.sbt.zip data/GCF*.sig -k 31 +``` + +but you can also access the Python API directly. ### Creating a search tree diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md index de0301439e..d8d443a719 100644 --- a/doc/classifying-signatures.md +++ b/doc/classifying-signatures.md @@ -103,7 +103,7 @@ compares genome or metagenome signatures, it's reporting Jaccard similarity *without* abundance. However, it is possible to take into account abundance information by -computing signatures with `--track-abundance`. The abundance +computing signatures with `-p abund`. The abundance information will be used if it's present in the signature, and it can be ignored with `--ignore-abundance` in any signature comparison. @@ -120,7 +120,7 @@ containment queries against genome databases. This will give you numbers that (approximately) match what you get from counting mapped reads. -If you compute your input signatures with `--track-abundance`, both +If you create your input signatures with `-p abund`, both `sourmash gather` and `sourmash lca gather` will use that information to calculate an abundance-weighted result. This will weight each match to a hash value by the multiplicity of the hash value in @@ -157,7 +157,7 @@ For more information on the value of this kind of comparison for metagenomics, please see the simka paper, [Multiple comparative metagenomics using multiset k-mer counting](https://peerj.com/articles/cs-94/), Benoit et al., 2016. Initial comparisons of metagenome similarity -approximations computed with sourmash to the output of simka suggest a +approximations calculated with sourmash to the output of simka suggest a significant correlation. **Implementation note:** Angular similarity searches cannot be done on @@ -232,7 +232,7 @@ A few quick notes for the algorithmic folk out there -- increase database size. (Although of course it may get a lot slower...) -## Appendix B: sourmash gather and `--track-abundance` +## Appendix B: sourmash gather and signatures with abundance information Below is a discussion of a synthetic set of test cases using three randomly generated (fake) genomes of the same size, with two even read diff --git a/doc/command-line.md b/doc/command-line.md index 1e0a8dcc70..fc091ed31c 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -4,11 +4,12 @@ :depth: 3 ``` -From the command line, sourmash can be used to compute -[MinHash sketches][0] from DNA sequences, compare them to each other, -and plot the results; these sketches are saved into "signature files". -These signatures allow you to estimate sequence similarity quickly and -accurately in large collections, among other capabilities. +From the command line, sourmash can be used to create +[MinHash sketches][0] from DNA and protein sequences, compare them to +each other, and plot the results; these sketches are saved into +"signature files". These signatures allow you to estimate sequence +similarity quickly and accurately in large collections, among other +capabilities. Please see the [mash software][1] and the [mash paper (Ondov et al., 2016)][2] for background information on @@ -25,11 +26,11 @@ Grab three bacterial genomes from NCBI: ``` curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Salmonella_enterica/reference/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz -curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Sphingobacteriaceae_bacterium_DW12/latest_assembly_versions/GCF_000783305.1_ASM78330v1/GCF_000783305.1_ASM78330v1_genomic.fna.gz +curl -L -O https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/783/305/GCA_000783305.1_ASM78330v1/GCA_000783305.1_ASM78330v1_genomic.fna.gz ``` Compute signatures for each: ``` - sourmash compute -k 31 *.fna.gz +sourmash sketch dna -p k=31 *.fna.gz ``` This will produce three `.sig` files containing MinHash signatures at k=31. @@ -60,11 +61,11 @@ Matrix: To get a list of subcommands, run `sourmash` without any arguments. -There are five main subcommands: `compute`, `compare`, `plot`, +There are five main subcommands: `sketch`, `compare`, `plot`, `search`, and `gather`. See [the tutorial](tutorials.md) for a walkthrough of these commands. -* `compute` creates signatures. +* `sketch` creates signatures. * `compare` compares signatures and builds a distance matrix. * `plot` plots distance matrices created by `compare`. * `search` finds matches to a query signature in a collection of signatures. @@ -98,6 +99,8 @@ indexed databases (the SBT and LCA formats) as well as from signature files. ### `sourmash compute` - make sourmash signatures from sequence data +@CTB fixme + The `compute` subcommand computes and saves signatures for each sequence in one or more sequence files. It takes as input FASTA or FASTQ files, and these files can be uncompressed or compressed with @@ -129,8 +132,8 @@ Optional arguments: The `compare` subcommand compares one or more signatures -(created with `compute`) using estimated [Jaccard index][3] or -(if signatures are computed with `--track-abundance`) the [angular +(created with `sketch`) using estimated [Jaccard index][3] or +(if signatures are computed with `-p abund`) the [angular similarity](https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity). The default output @@ -150,7 +153,7 @@ Options: ``` --output -- save the distance matrix to this file (as a numpy binary matrix) --ksize -- do the comparisons at this k-mer size. ---containment -- compute containment instead of similarity. +--containment -- calculate containment instead of similarity. C(i, j) = size(i intersection j) / size(i). --from-file -- append the list of files in this text file to the input signatures @@ -161,7 +164,7 @@ Options: ### `sourmash plot` - cluster and visualize comparisons of many signatures The `plot` subcommand produces two plots -- a dendrogram and a -dendrogram+matrix -- from a distance matrix computed by `sourmash compare +dendrogram+matrix -- from a distance matrix created by `sourmash compare --output `. The default output is two PNG files. Usage: @@ -226,7 +229,7 @@ analysis. (See [Classifying Signatures](classifying-signatures.md) for more information on the different approaches that can be used here.) -If the input signature was computed with `--track-abundance`, output +If the input signature was created with `-p abund`, output will be abundance weighted (unless `--ignore-abundances` is specified). `-o/--output` will create a CSV file containing the matches. @@ -439,7 +442,7 @@ is specifically meant for metagenome and genome bin analysis. (See [Classifying Signatures](classifying-signatures.md) for more information on the different approaches that can be used here.) -If the input signature was computed with `--track-abundance`, output +If the input signature was created with `-p abund`, output will be abundance weighted (unless `--ignore-abundances` is specified). `-o/--output` will create a CSV file containing the matches. @@ -604,8 +607,8 @@ sourmash signature merge file1.sig file2.sig -o merged.sig will output the union of all the hashes in `file1.sig` and `file2.sig` to `merged.sig`. -All of the signatures passed to merge must either have been computed -with `--track-abundance`, or not. If they have `track_abundance` on, +All of the signatures passed to merge must either have been created +with `-p abund`, or not. If they have `track_abundance` on, then the merged signature will have the sum of all abundances across the individual signatures. The `--flatten` flag will override this behavior and allow merging of mixtures by removing all abundances. @@ -661,10 +664,10 @@ Downsample one or more signatures. With `downsample`, you can -- -* increase the `--scaled` value for a signature computed with `--scaled`, shrinking it in size; +* increase the `--scaled` value for a signature created with `-p scaled=SCALED`, shrinking it in size; * decrease the `num` value for a traditional num MinHash, shrinking it in size; -* try to convert a `--scaled` signature to a `num` signature; -* try to convert a `num` signature to a `--scaled` signature. +* try to convert a `scaled` signature to a `num` signature; +* try to convert a `num` signature to a `scaled` signature. For example, ``` @@ -758,7 +761,7 @@ sourmash signature export filename.sig -o filename.sig.msh.json ### `sourmash signature overlap` - detailed comparison of two signatures' overlap -Display a detailed comparison of two signatures. This computes the +Display a detailed comparison of two signatures. This calculates the Jaccard similarity (as in `sourmash compare` or `sourmash search`) and the Jaccard containment in both directions (as with `--containment`). It also displays the number of hash values in the union and @@ -819,7 +822,7 @@ signatures. The simplest is one signature in a single JSON file. You can also put many signatures in a single JSON file, either by building them that -way with `sourmash compute` or by using `sourmash sig cat` or other +way with `sourmash sketch` or by using `sourmash sig cat` or other commands. Searching or comparing these files involves loading them sequentially and iterating across all of the signatures - which can be slow, especially for many (100s or 1000s) of signatures. @@ -876,10 +879,10 @@ been useful. :) ### Using stdin Most commands will take stdin via the usual UNIX convention, `-`. -Moreover, `sourmash compute` and the `sourmash sig` commands will +Moreover, `sourmash sketch` and the `sourmash sig` commands will output to stdout. So, for example, -`sourmash compute ... -o - | sourmash sig describe -` will describe the +`sourmash sketch ... -o - | sourmash sig describe -` will describe the signatures that were just computed. (This is a relatively new feature as of 3.4 and our testing may need From b9c8bbaa27d395ab636146505ed8271c61e53a9f Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 18 Jan 2021 10:57:58 -0800 Subject: [PATCH 02/24] add migration links --- README.md | 16 +++++++++++----- doc/index.md | 4 ++++ doc/support.md | 3 +++ 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 9520189dc7..b43d7f4118 100644 --- a/README.md +++ b/README.md @@ -14,12 +14,18 @@ Quickly search, compare, and analyze genomic and metagenomic data sets. Usage: - sourmash compute *.fq.gz - sourmash compare *.sig -o distances + sourmash sketch dna *.fq.gz + sourmash compare *.sig -o distances -k 31 sourmash plot distances sourmash 1.0 is [published on JOSS](https://doi.org/10.21105/joss.00027); please cite that paper if you use sourmash (`doi: 10.21105/joss.00027`):. +The latest major release is sourmash v4, which has several +command-line and Python incompatibilities with previous +versions. Please +[visit our migration guide](https://sourmash.readthedocs.io/en/latest/support.html#migrating-from-sourmash-v3-x-to-sourmash-4-x) +to ugprade! + ---- The name is a riff off of [Mash](https://github.com/marbl/Mash), @@ -42,7 +48,7 @@ We recommend using bioconda to install sourmash: ``` conda install -c conda-forge -c bioconda sourmash ``` -This will install the latest stable version of sourmash 3. +This will install the latest stable version of sourmash 4. You can also use pip to install sourmash: @@ -70,7 +76,7 @@ you can install sourmash by running: ```bash $ conda create -n sourmash_env -c conda-forge -c bioconda sourmash python=3.7 $ source activate sourmash_env -$ sourmash compute -h +$ sourmash --help ``` which will install @@ -107,4 +113,4 @@ on getting set up with a development environment. ---- CTB -July 2020 +Jan 2021 diff --git a/doc/index.md b/doc/index.md index 9ccc05ea7f..42d8166d7c 100644 --- a/doc/index.md +++ b/doc/index.md @@ -25,6 +25,10 @@ background information on how and why MinHash works. **Questions? Thoughts?** Ask us on the [sourmash issue tracker](https://github.com/dib-lab/sourmash/issues/)! +**Want to migrate to sourmash v4?** sourmash v4 is now available, and +has a number of incompatibilites with v2 and v3. Please see +[our migration guide](support.md#migrating-from-sourmash-v3-x-to-sourmash-4-x)! + ---- To use sourmash, you must be comfortable with the UNIX command line; diff --git a/doc/support.md b/doc/support.md index 6b57ec926d..ce77d9090c 100644 --- a/doc/support.md +++ b/doc/support.md @@ -102,3 +102,6 @@ we suggest you use the following procedure to migrate: * now, run python with the argument `-W error` to turn warnings into errors. * fix all errors! * finally, upgrade to sourmash v4.0. + +@CTB add stuff here + From d33c66ae0d374954e22eb1bff14e01950a99e9a0 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 30 Jan 2021 09:00:27 -0800 Subject: [PATCH 03/24] switch compute over to sketch in most of the markdown docs --- doc/command-line.md | 7 +++++-- doc/developer.md | 1 + doc/more-info.md | 8 +++++--- doc/using-sourmash-a-guide.md | 26 +++++++++++++------------- 4 files changed, 24 insertions(+), 18 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index fc091ed31c..aa91e0657b 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -101,6 +101,9 @@ indexed databases (the SBT and LCA formats) as well as from signature files. @CTB fixme +Note: `sourmash compute` is deprecated in sourmash 4.0 and will be removed in +sourmash 5.0; please switch to using `sourmash sketch` (link). + The `compute` subcommand computes and saves signatures for each sequence in one or more sequence files. It takes as input FASTA or FASTQ files, and these files can be uncompressed or compressed with @@ -133,7 +136,7 @@ Optional arguments: The `compare` subcommand compares one or more signatures (created with `sketch`) using estimated [Jaccard index][3] or -(if signatures are computed with `-p abund`) the [angular +(if signatures are created with `-p abund`) the [angular similarity](https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity). The default output @@ -883,7 +886,7 @@ Moreover, `sourmash sketch` and the `sourmash sig` commands will output to stdout. So, for example, `sourmash sketch ... -o - | sourmash sig describe -` will describe the -signatures that were just computed. +signatures that were just created. (This is a relatively new feature as of 3.4 and our testing may need some work, so please diff --git a/doc/developer.md b/doc/developer.md index ef390f62cd..56b2d0a13a 100644 --- a/doc/developer.md +++ b/doc/developer.md @@ -116,6 +116,7 @@ A short description of the high-level files and dirs in the sourmash repo: src/sourmash ├── cli/ | Command-line parsing, help messages and overall infrastucture ├── command_compute.py | compute command implementation +├── command_compute.py | sketch command implementation ├── commands.py | implementation for other CLI commands ├── compare.py | Signature comparison functions ├── _compat.py | Py2/3 compatibility functions diff --git a/doc/more-info.md b/doc/more-info.md index cbe51395dd..b98150d694 100644 --- a/doc/more-info.md +++ b/doc/more-info.md @@ -2,7 +2,7 @@ ## Computational requirements -Read more about the [compute requirements, here.](requirements.md) +Read more about the [computational requirements, here.](requirements.md) ## Prepared search database @@ -91,8 +91,10 @@ for samples. ## Interoperability with mash -The default sketches computed by sourmash and mash are comparable, but -we are still [working on ways to convert the file formats][11] +The hashing functions used by sourmash and mash are the same, but we +are still [working on ways to convert the file formats][11]. Please +keep an eye on `sourmash signature import` and `sourmash signature +export`! ## Developing sourmash diff --git a/doc/using-sourmash-a-guide.md b/doc/using-sourmash-a-guide.md index af665742ad..7e18907bc2 100644 --- a/doc/using-sourmash-a-guide.md +++ b/doc/using-sourmash-a-guide.md @@ -24,7 +24,7 @@ k=51. The general rule is that longer k-mer sizes are less prone to false positives. But you can pick your own parameters. One additional wrinkle is that we provide a number of -[precomputed databases](databases.md) at k=21, k=31, and k=51. +[precalculated databases](databases.md) at k=21, k=31, and k=51. It is often convenient to calculate signatures at these sizes so that you can use these databases. @@ -37,7 +37,7 @@ however, and it probably doesn't really matter. (When we have blog posts or publications providing more formal guidance, we'll link to them here!) -## What resolution should my signatures be / how should I compute them? +## What resolution should my signatures be / how should I create them? sourmash supports two ways of choosing the resolution or size of your signatures: using `-n` to specify the maximum number of hashes, @@ -74,9 +74,9 @@ rate of PacBio and Nanopore sequencing is problematic for k-mer based approaches and we have not yet explored how to tune parameters for this kind of sequencing. -On a more practical note, `sourmash compute` should autodetect FASTA, -FASTQ, whether they are uncompressed, gzipped, or bzip2-ed. Nothing -special needs to be done. +On a more practical note, `sourmash sketch` will autodetect FASTA and +FASTQ formats, whether they are uncompressed, gzipped, or bzip2-ed. +Nothing special needs to be done. ## How should I prepare my data? @@ -110,11 +110,11 @@ are always real low-abundance k-mers present. Sorry, yes! See below. -### Computing signatures for read files: +### Calculating signatures for read files: ``` trim-low-abund -C 3 -Z 18 -V -M 2e9 input-reads-1.fq input-reads-2.fq ... -sourmash compute --scaled 1000 -k 21,31,51 input-reads*.fq.abundtrim \ +sourmash sketch dna -p scaled=1000,k=21,k=31,k=51 input-reads*.fq.abundtrim \ --merge SOMENAME -o SOMENAME-reads.sig ``` @@ -123,24 +123,24 @@ reads; the second takes all the trimmed read files, subsamples k-mers from them at 1000:1, and outputs a single merged signature named 'SOMENAME' into the file `SOMENAME-reads.sig`. -### Computing signatures for individual genome files: +### Creating signatures for individual genome files: ``` -sourmash compute --scaled 1000 -k 21,31,51 *.fna.gz --name-from-first +sourmash sketch dna -p scaled=1000,k=21,k=31,k=51 *.fna.gz --name-from-first ``` -This command computes signatures for all `*.fna.gz` files, and names +This command creates signatures for all `*.fna.gz` files, and names each signature based on the first FASTA header in each file (that's what the option `--name-from-first` does). The signatures will be placed in `*.fna.gz.sig`. -### Computing signatures from a collection of genomes in a single file: +### Creating signatures from a collection of genomes in a single file: ``` -sourmash compute --scaled 1000 -k 21,31,51 file.fa --singleton +sourmash sketch dna -p scaled=1000,k=21,k=31,k=51 file.fa --singleton ``` -This computes signatures for all individual FASTA sequences in `file.fa`, +This creates signatures for all individual FASTA sequences in `file.fa`, names them based on their FASTA headers, and places them all in a single `.sig` file, `file.fa.sig`. (This behavior is triggered by the option `--singleton`, which tells sourmash to treat each individual sequence in From 312a8e2c8936eb5fb03c76c0094921c1dc3bcf45 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 30 Jan 2021 09:04:50 -0800 Subject: [PATCH 04/24] fix --scaled and --track-abundance thruought --- doc/command-line.md | 4 ++-- doc/using-sourmash-a-guide.md | 22 +++++++++++----------- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index aa91e0657b..e678eec0fd 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -643,7 +643,7 @@ will subtract all of the hashes in `file2.sig` and `file3.sig` from `file1.sig`, and save the new signature to `subtracted.sig`. To use `subtract` on signatures calculated with -`--track-abundance`, you must specify `--flatten`. +`-p abund`, you must specify `--flatten`. ### `sourmash signature intersect` - intersect two (or more) signatures @@ -667,7 +667,7 @@ Downsample one or more signatures. With `downsample`, you can -- -* increase the `--scaled` value for a signature created with `-p scaled=SCALED`, shrinking it in size; +* increase the `scaled` value for a signature created with `-p scaled=SCALED`, shrinking it in size; * decrease the `num` value for a traditional num MinHash, shrinking it in size; * try to convert a `scaled` signature to a `num` signature; * try to convert a `num` signature to a `scaled` signature. diff --git a/doc/using-sourmash-a-guide.md b/doc/using-sourmash-a-guide.md index 7e18907bc2..99585a5a37 100644 --- a/doc/using-sourmash-a-guide.md +++ b/doc/using-sourmash-a-guide.md @@ -40,27 +40,27 @@ guidance, we'll link to them here!) ## What resolution should my signatures be / how should I create them? sourmash supports two ways of choosing the resolution or size of -your signatures: using `-n` to specify the maximum number of hashes, -or `--scaled` to specify the compression ratio. Which should you use? +your signatures: using `num` to specify the maximum number of hashes, +or `scaled` to specify the compression ratio. Which should you use? -We suggest calculating all your signatures using `--scaled -1000`. This will give you a compression ratio of 1000-to-1 while -making it possible to detect regions of similarity in the 10kb range. +We suggest calculating all your signatures using `-p scaled=1000`. +This will give you a compression ratio of 1000-to-1 while making it +possible to detect regions of similarity in the 10kb range. For comparison with more traditional MinHash approaches like `mash`, -if you have a 5 Mbp genome and use `--scaled 1000`, you will extract +if you have a 5 Mbp genome and use `-p scaled=1000`, you will extract approximately 5000 hashes. So a scaled of 1000 is equivalent to using -`-n 5000` with mash on a 5 Mbp genome. +`-p num=5000` with mash on a 5 Mbp genome. -The difference between using `-n` and `--scaled` is in metagenome -analysis: fixing the number of hashes with `-n` limits your ability to +The difference between using `num` and `scaled` is in metagenome +analysis: fixing the number of hashes with `num` limits your ability to detect rare organisms, or alternatively results in very large -signatures (e.g. if you use n larger than 10000). `--scaled` will scale +signatures (e.g. if you use n larger than 10000). `scaled` will scale your resolution with the diversity of the metagenome. You can read more about this in this blog post from the mash folk, [Mash Screen: What's in my sequencing run?](https://genomeinformatics.github.io/mash-screen/) What -we do with sourmash and `--scaled` is similar to the 'modulo hash' +we do with sourmash and `scaled` is similar to the 'modulo hash' mentioned in that blog post. (Again, when we have formal guidance on this based on benchmarks, we'll From 45d80555c0df0837854a5e94fd635816fe2547e1 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 30 Jan 2021 09:28:37 -0800 Subject: [PATCH 05/24] formatting and wording fixes --- doc/command-line.md | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index e678eec0fd..1586766572 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -69,7 +69,7 @@ walkthrough of these commands. * `compare` compares signatures and builds a distance matrix. * `plot` plots distance matrices created by `compare`. * `search` finds matches to a query signature in a collection of signatures. -* `gather` finds non-overlapping matches to a metagenome in a collection of signatures. +* `gather` finds the best reference genomes for a metagenome, using the provided collection of signatures There are also a number of commands that work with taxonomic information; these are grouped under the `sourmash lca` @@ -94,7 +94,7 @@ Finally, there are a number of utility and information commands: Please use the command line option `--help` to get more detailed usage information for each command. -Note that as of sourmash v3.4, most commands will load signatures from +Note that as of sourmash v3.4, all commands should load signatures from indexed databases (the SBT and LCA formats) as well as from signature files. ### `sourmash compute` - make sourmash signatures from sequence data @@ -226,11 +226,12 @@ similarity match ### `sourmash gather` - find metagenome members -The `gather` subcommand finds all non-overlapping matches to the -query. This is specifically meant for metagenome and genome bin -analysis. (See [Classifying Signatures](classifying-signatures.md) -for more information on the different approaches that can be used -here.) +The `gather` subcommand selects the best reference genomes to use for +a metagenome analysis, by finding the smallest set of non-overlapping +matches to the query in a database. This is specifically meant for +metagenome and genome bin analysis. (See +[Classifying Signatures](classifying-signatures.md) for more +information on the different approaches that can be used here.) If the input signature was created with `-p abund`, output will be abundance weighted (unless `--ignore-abundances` is @@ -818,7 +819,7 @@ signatures with multiple ksizes or moltypes at the same time; you need to pick the ksize and moltype to use for your search. Where possible, scaled values will be made compatible. -#### Storing (and searching) signatures +### Storing (and searching) signatures Backing up a little, there are many ways to store and search signatures. @@ -848,7 +849,7 @@ will complain. In contrast, signature files can contain many different types of signatures, and compatible ones will be discovered automatically. -#### Passing in lists of files +### Passing in lists of files Various sourmash commands will also take `--from-file` or `--query-from-file`, which will take a path to a text file containing @@ -856,7 +857,7 @@ a list of file paths. This can be useful for situations where you want to specify thousands of queries, or a subset of signatures produced by some other command. -#### Loading all signatures under a directory +### Loading all signatures under a directory All of the `sourmash` commands support loading signatures from directories provided on the command line. @@ -866,7 +867,7 @@ directories provided on the command line. All of the commands in sourmash operate in "online" mode, so you can combine multiple databases and signatures on the command line and get the same answer as if you built a single large database from all of -them. The only addendum to this rule is that if you have multiple +them. The only caveat to this rule is that if you have multiple identical matches, the first one to be found will differ depending on the order that the files are passed in on the command line. From 58c57afd4be554c672c3af9ae67809671255b8d1 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 30 Jan 2021 09:41:57 -0800 Subject: [PATCH 06/24] add sourmash sketch docs --- doc/command-line.md | 26 ++++++-- doc/sourmash-sketch.md | 147 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 169 insertions(+), 4 deletions(-) create mode 100644 doc/sourmash-sketch.md diff --git a/doc/command-line.md b/doc/command-line.md index 1586766572..ca00fdc375 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -97,12 +97,30 @@ information for each command. Note that as of sourmash v3.4, all commands should load signatures from indexed databases (the SBT and LCA formats) as well as from signature files. -### `sourmash compute` - make sourmash signatures from sequence data +### `sourmash sketch` - make sourmash signatures from sequence data + +Most of the commands in sourmash work with **signatures**, which contain information about genomic or proteomic sequences. Each signature contains one or more **sketches**, which are compressed versions of these sequences. Using sourmash, you can search, compare, and analyze these sequences in various ways. + +To create a signature with one or more sketches, you use the `sourmash sketch` command. There are three main commands: + +``` +sourmash sketch dna +sourmash sketch protein +sourmash sketch translate +``` -@CTB fixme +The `sketch dna` command reads in **DNA sequences** and outputs **DNA sketches**. + +The `sketch protein` command reads in **protein sequences** and outputs **protein sketches**. + +The `sketch translate` command reads in **DNA sequences**, translates them in all six frames, and outputs **protein sketches**. + +Please see [the `sourmash sketch` documentation page](sourmash-sketch.md) for details! + +### `sourmash compute` - make sourmash signatures from sequence data -Note: `sourmash compute` is deprecated in sourmash 4.0 and will be removed in -sourmash 5.0; please switch to using `sourmash sketch` (link). +**Note: `sourmash compute` is deprecated in sourmash 4.0 and will be removed in +sourmash 5.0; please switch to using `sourmash sketch`, above.** The `compute` subcommand computes and saves signatures for each sequence in one or more sequence files. It takes as input FASTA diff --git a/doc/sourmash-sketch.md b/doc/sourmash-sketch.md new file mode 100644 index 0000000000..1d31a2e6b9 --- /dev/null +++ b/doc/sourmash-sketch.md @@ -0,0 +1,147 @@ +# `sourmash sketch` documentation + +Most of the commands in sourmash work with **signatures**, which contain information about genomic or proteomic sequences. Each signature contains one or more **sketches**, which are compressed versions of these sequences. Using sourmash, you can search, compare, and analyze these sequences in various ways. + +To create a signature with one or more sketches, you use the `sourmash sketch` command. There are three main commands: + +``` +sourmash sketch dna +sourmash sketch protein +sourmash sketch translate +``` + +The `sketch dna` command reads in **DNA sequences** and outputs **DNA sketches**. + +The `sketch protein` command reads in **protein sequences** and outputs **protein sketches**. + +The `sketch translate` command reads in **DNA sequences**, translates them in all six frames, and outputs **protein sketches**. + +## Quickstart + +### DNA sketches for genomes and reads + +To compute a DNA sketch for a genome, run: +``` +sourmash sketch dna genome.fna +``` +This will create an output file `genome.fna.sig` in the current directory, containing a single DNA signature for the entire genome, calculated using the default parameters. + + +Sourmash can work with unassembled reads; run +``` +sourmash sketch dna -p k=21,k=31,k=51,abund metagenome.fq.gz +``` +to compute three abundance-weighted sketches at k=21, 31, and 51, for the given FASTQ file. + +### Protein sketches for genomes and proteomes + +Likewise, +``` +sourmash sketch translate genome.fna +``` +will output a protein sketch in `./genome.fna.sig`, calculated by translating the genome sequence in all six frames and then using the default protein sketch parameters. + +And +``` +sourmash sketch protein -p k=25,scaled=500 -p k=27,scaled=250 genome.faa +``` +outputs two protein sketches to `./genome.faa.sig`, one calculated with k=25 and scaled=500, the other calculated with k=27 and scaled=250. + +If you want to use different encodings, you can specify them in a few ways; here is a parameter string that specifies a dayhoff encoding for the k-mers: +``` +sourmash sketch protein -p k=25,scaled=500,dayhoff genome.faa +``` + +## More detailed documentation + +### Input formats + +`sourmash sketch` auto-detects and reads FASTQ or FASTA files, either uncompressed or compressed with gzip or bzip2. The filename doesn't matter; `sourmash sketch` will figure out the format from the file contents. + +You can also stream any of these formats into `sourmash sketch` via stdin by using `-` as the input filename. + +### Input contents and output signatures + +By default, `sourmash sketch` will produce signatures for each input *file*. If the file contains multiple FASTA/FASTQ records, these records will be merged into the output signature. + +If you specify `--singleton`, `sourmash sketch` will produce signatures for each *record*. + +If you specify `--merge `, sourmash sketch will produce signatures for all input files combined into one. + +The output signature(s) will be saved in locations that depend on your input parameters. By default, `sourmash sketch` will put the signatures in the current directory, in a file named for the input file with a `.sig` suffix. If you specify `-o`, all of the signatures will be placed in that file. + +### Protein encodings + +`sourmash sketch protein` and `sourmash sketch translate` output protein sketches by default, but can also use the `dayhoff` and `hp` encodings. The [Dayhoff encoding](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-367/tables/1) collapses multiple amino acids into a smaller alphabet so that amino acids that share biochemical properties map to the same character. The hp encoding divides amino acids into hydrophobic and polar (hydrophilic) amino acids, collapsing amino acids with hydrophobic side chains together and doing the same for polar amino acids. + +We are still in the process of benchmarking these encodings; ask [on the issue tracker](https://github.com/dib-lab/sourmash/issues) if you are interested in updates. + +### Parameter strings + +The `-p` argument to `sourmash sketch` provides parameter strings to sourmash, and these control what signatures and sketches are calculated and output. Zero or more parameter strings can be given to sourmash. Each parameter string produces at least one sketch. + +A parameter string is a space-delimited collection that can contain one or more fields, comma-separated. +* `k=` - compute a sketch at this k-mer size; can provide more than one time in a parameter string. Typically `ksize` is between 4 and 100. +* `scaled=` - create a scaled MinHash with k-mers sampled deterministically at 1 per `` value. This controls sketch compression rates and resolution; for example, a 5 Mbp genome sketched with a scaled of 1000 would yield approximately 5,000 k-mers. `scaled` is incompatible with `num`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-compute-them) for more information. +* `num=` - create a standard MinHash with no more than `` k-mers kept. This will produce sketches identical to [mash sketches](https://mash.readthedocs.io/en/latest/). `num` is incompatible with `scaled`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-compute-them) for more information. +* `abund` / `noabund` - create abundance-weighted (or not) sketches. See [Classify signatures: Abundance Weighting](classifying-signatures.md#abundance-weighting) for details of how this works. +* `dna`, `protein`, `dayhoff`, `hp` - create this kind of sketch. Note that `sourmash sketch dna -p protein` and `sourmash sketch protein -p dna` are invalid; please use `sourmash sketch translate` for the former. + +For all field names but `k`, if multiple fields in a parameter string are provided, the last one encountered overrides the previous values. For `k`, if multiple ksizes are specified a single parameter string, sketches for all ksizes specified are computed. + +If a field isn't specified, then the default value for that sketch type is used; so, for example, `sourmash sketch dna -p abund` would calculate a sketch with `k=31,scaled=1000,abund`. See below for the defaults. + +### Default parameters + +The default parameters for sketches are as follows: + +* dna: `k=31,scaled=1000,noabund` +* protein: `k=10,scaled=200,noabund` +* dayhoff: `k=16,scaled=200,noabund` +* hp=`k=42,scaled=200,noabund` + +These were chosen by a committee of PhDs as being good defaults for an initial analysis, so, beware :). + +More seriously, the DNA parameters were chosen based on the analyses done by Koslicki and Falush in [MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation](https://msystems.asm.org/content/1/3/e00020-16). + +The protein, dayhoff, and hp parameters were selected based on unpublished research results and/or magic formulas. We are working on publishing the results! Please ask on the [issue tracker](https://github.com/dib-lab/sourmash/issues) if you are curious. + +### More complex parameter string examples + +Below are some more complicated `sourmash sketch` command lines: + +* `sourmash sketch dna -p k=51` - default to a scaled=1000 and noabund for a k-mer size of 51 (based on moltype/command) +* `sourmash sketch dna -p k=31,k=51,k=21` - compute multiple ksizes, using the defaults otherwise +* `sourmash sketch translate -p k=20,num=500,protein -p k=19,num=400,dayhoff,abund -p k=30,scaled=200,hp` - compute multiple ksizes, moltypes, and scaled/num. + +### Locations for output files + +Signature files can contain multiple signatures and sketches. Use `sourmash sig describe` to get details on the contents of a file. + +You can use `-o ` to specify a file output location for all the output signatures; `-o -` means stdout. This does not merge signatures unless `--merge` is provided. + +Specify `--outdir` to put all the signatures in a specific directory. + +### Downsampling and flattening signatures + +Calculating signatures is probably the most time consuming part of using sourmash, and it is the only part that requires access to the raw data. Moreover, the output signatures are generally much smaller than the input data. So, we generally suggest calculating a large set of signatures once. + +To support this, sourmash can do two kinds of signature conversion without going back to the raw data. + +First, you can downsample `num` and `scaled` signatures using `sourmash sig downsample`. For any sketch calculated with `num` parameter, you can decrease that `num`. And, for any `scaled` parameter, you can increase the `scaled`. This will decrease the size of the sketch accordingly; for example, going from a num of 5000 to a num of 1000 will decrease the sketch size by a factor of 5, and going from a scaled of 1000 to a scaled of 10000 will decrease the sketch size by a factor of 10. + +(Note that decreasing num or increasing scaled will increase calculation speed and lower the accuracy of your results.) + +Second, you can flatten abundances using `sourmash sig flatten`. For any sketch calculated with `abund`, you can convert it to a `noabund` sketch. This will decrease the sketch size, although not necessarily by a lot. + +Unfortunately, changing the k-mer size or using different DNA/protein encodings cannot be done on a sketch, and you need to calculate new signatures from the raw data for that. + +### Examining the output of `sourmash sketch` + +You can use `sourmash sig describe` to get detailed information about the contents of a signature file. This can help if you want to see exactly what a particular `sourmash sketch` command does! + +### Filing issues and asking for help + +We try to provide good documentation and error messages, but may not succeed in answer all your questions! So we're happy to help out! + +Please post questions [on the sourmash issue tracker](https://github.com/dib-lab/sourmash/issues). If you find something confusing or buggy about the documentation or about sourmash, we'd love to fix it -- for you *and* for everyone else! From 32ebec721a097b1132933ec8f1f25c6a5630b751 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 7 Feb 2021 18:41:40 -0800 Subject: [PATCH 07/24] substantial update for API examples --- doc/api-example.md | 184 ++++++++++++++++++++++++++++++++++++++------- 1 file changed, 155 insertions(+), 29 deletions(-) diff --git a/doc/api-example.md b/doc/api-example.md index 03b7b4d7cb..e5af76a6ee 100644 --- a/doc/api-example.md +++ b/doc/api-example.md @@ -1,7 +1,10 @@ -# `sourmash` API examples +# `sourmash` Python API examples -## A first example: two k-mers +All of sourmash's functionality is available via its [Python API](api.md). Below are both basic and advanced examples that use the API to accomplish common tasks. + +[toc] +## A first example: two k-mers Define two sequences: @@ -42,7 +45,7 @@ and of course the MinHashes match themselves: ``` -We can add sequences and query at any time -- +We can add sequences to the MinHash objects and query at any time -- ``` >>> mh1.add_sequence(seq2) @@ -52,8 +55,63 @@ We can add sequences and query at any time -- ``` -## Consuming files +## Set operations on hashes + +All of the hashes are available via the `hashes` property: + +``` +>>> list(mh1.hashes) +[1274996984489324440, 2529443451610975987, 3115010115530738562, 5059920851104263793, 5740495330885152257, 8652222673649005300, 18398176440806921933] + +``` + +and you can easily do your own set operations with `.hashes` - e.g. +the following calculates the Jaccard similarity (intersection over union) of two +``` +>>> s1 = set(mh1.hashes) +>>> s2 = set(mh2.hashes) +>>> round(len(s1 & s2) / len(s1 | s2), 3) +0.571 + +``` +However, the MinHash class also supports a number of basic operations - the following operations work directly on the hashes: +``` +>>> combined = mh1 + mh2 +>>> combined += mh1 +>>> combined.remove_many(mh1.hashes) +>>> combined.add_many(mh2.hashes) + +``` + +You can create an empty copy of a MinHash object with `copy_and_clear`: +``` +>>> new_mh = mh1.copy_and_clear() + +``` +and you can also access the various parameters of a MinHash object directly as properties -- +``` +>>> mh1.ksize +3 +>>> mh1.scaled +0 +>>> mh1.num +20 +>>> mh1.is_dna +True +>>> mh1.is_protein +False +>>> mh1.dayhoff +False +>>> mh1.hp +False +>>> mh1.moltype +'DNA' + +``` +see the "Advanced" section, below, for a more complete discussion of MinHash objects. + +## Creating MinHash sketches programmatically, from genome files Suppose we want to create MinHash sketches from genomes -- @@ -73,7 +131,7 @@ into `add_sequence` directly; here we set `force=True` in `add_sequence` to skip over k-mers containing characters other than ACTG, rather than raising an exception. -(Note, just for speed reasons, we'll truncate the sequences to 50kb in length.) +(Note, just for speed reasons, we're truncating the sequences to 50kb in length.) ``` >>> import screed @@ -86,7 +144,7 @@ raising an exception. ``` -And now the minhashes can be compared against each other: +And now the result MinHash objects can be compared against each other: ``` >>> import sys @@ -103,7 +161,7 @@ data/GCF_000783305.1 0.0 0.0 1.0 ``` Note that the comparisons are quite quick; most of the time is spent in -making the minhashes, which can be saved and loaded easily. +building the minhashes. ## Plotting dendrograms and matrices @@ -114,7 +172,7 @@ please see the notebook ## Saving and loading signature files Signature files encapsulate MinHashes in JSON, and provide a way to -add some metadata to MinHashes. +wrap MinHash objects with some metadata (the name and filename). To save signatures, use `save_signatures` with a list of signatures and a Python file pointer: ``` >>> from sourmash import SourmashSignature, save_signatures @@ -127,7 +185,7 @@ add some metadata to MinHashes. ``` Here, `genome1.sig` is a JSON file that can now be loaded and -compared -- first, load: +compared -- first, load it using `load_one_signature`: ``` >>> from sourmash import load_one_signature @@ -145,9 +203,24 @@ then compare: ``` -## Manipulating signatures and their hashes. +There are two primary signature loading functions - `load_one_signature`, used above, which loads exactly one signature or else raises an exception; and the powerful and more generic `load_file_as_signatures`, which takes in a filename or directory containing a collection of signatures and returns the individual signatures -- for example, you can load all of the signatures under the `tempdir` created above like so, -It is relatively straightforward to work directly with hashes. +``` +>>> loaded_sigs = list(sourmash.load_file_as_signatures(tempdir)) + +``` + +Both `load_file_as_signatures` and `load_one_signature` take molecule type and k-mer size selectors, e.g. +``` +>>> loaded_sigs = load_one_signature(tempdir + '/genome1.sig', select_moltype='DNA', ksize=31) + +``` +will load precisely one signature containing a DNA MinHash created at k-mer size of 31. + +## Going from signatures back to MinHash objects and their hashes - + +Once you load a signature, you can go back to its MinHash object with +`.minhash`; e.g. First, load two signatures: @@ -165,12 +238,12 @@ Then, get the hashes, and (e.g.) calculate the union: >>> hashes1 = set(sig1.minhash.hashes.keys()) >>> hashes2 = set(sig2.minhash.hashes.keys()) >>> hash_union = hashes1.union(hashes2) ->>> print('{} hashes in union of {} and {}'.format(len(hash_union), len(hashes1), len(hashes2))) +>>> print(f'{len(hash_union)} hashes in union of {len(hashes1)} and {len(hashes2)}') 1000 hashes in union of 500 and 500 ``` -## sourmash MinHash objects and manipulations +## Advanced features of sourmash MinHash objects - `scaled` and `num` sourmash supports two basic kinds of signatures, MinHash and modulo hash signatures. MinHash signatures are equivalent to mash signatures; @@ -186,9 +259,7 @@ be collected for a given input data set. Because of this parameter, below we'll call them 'num' signatures. Modulo hash (or 'scaled') signatures are specific to sourmash and they -enable an expanded range of metagenome analyses, with the downside -that they can become arbitrarily large. The key parameter for modulo -hash signatures is `scaled`, which specifies the average sampling rate +enable containment operations that are useful for metagenome analyses. The tradeoff is that unlike num MinHashes, they can become arbitrarily large. The key parameter for modulo hash signatures is `scaled`, which specifies the average sampling rate for hashes for a given input data set. A scaled factor of 1000 means that, on average, 1 in 1000 k-mers will be turned into a hash for later comparisons; this is a sort of compression factor, in that a 5 Mbp @@ -222,7 +293,7 @@ looking at the `num` and `scaled` attributes on a MinHash object: The MinHash class is otherwise identical between the two types of signatures. -Note that you cannot calculate Jaccard similarity or containment for +You cannot calculate Jaccard similarity or containment for MinHash objects with different num or scaled values (or different ksizes): ``` @@ -234,7 +305,7 @@ TypeError: must have same num: 500 != 1000 ``` -You can make signatures compatible by downsampling; see the next +However, you can make signatures compatible by downsampling; see the next sections. ### A brief introduction to MinHash object methods and attributes @@ -380,24 +451,34 @@ you.* (You can also take a look at the logic in `sourmash signature downsample` if you are interested.) -## Working with fast search trees (Sequence Bloom Trees, or SBTs) +## Working with indexed collections of signatures + +If you want to search large collections of signatures, sourmash provides +two different indexing strategies, together with a generic `Index` class +that supports a common API for searching the collections. -Suppose we create some `scaled` signatures: +The first indexing strategy is a Sequence Bloom Tree, which is +designed to support fast and efficient containment operations on large +collections of signatures. SBTs are an _on disk_ search structure, so +they are a low-memory way to search collections. + +To use SBTs from the command line, we first +need to create some `scaled` signatures: ``` sourmash sketch dna -p scaled=10000 data/GCF*.fna.gz --outdir data/ ``` -and we want to create a Sequence Bloom Tree (SBT) so that we can -search them efficiently. You can do this with `sourmash index`, +and then build a Sequence Bloom Tree (SBT) index with `sourmash +index`, like so: ``` sourmash index foo.sbt.zip data/GCF*.sig -k 31 ``` -but you can also access the Python API directly. +Here, sourmash is storing the entire SBT in a single portable Zip file. -### Creating a search tree +### Creating an on-disk SBT in Python Let's start by using 'glob' to grab some example signatures from the test data in the sourmash repository: @@ -408,11 +489,11 @@ test data in the sourmash repository: ``` -Now, create a tree: +Now, create an SBT: ``` ->>> import sourmash ->>> tree = sourmash.create_sbt_index() +>>> import sourmash.sbtmh +>>> tree = sourmash.sbtmh.create_sbt_index() ``` @@ -428,7 +509,7 @@ Load each signature, and add it to the tree: ``` (note, you'll need to make sure that all of the signatures are compatible with each other! The `sourmash index` command does all of the necessary -checks.) +checks, but the Python API doesn't.) Now, save the tree: @@ -459,7 +540,7 @@ Now, load a DNA sequence: ``` >>> filename = 'data/GCF_000005845.2_ASM584v2_genomic.fna.gz' >>> query_seq = next(iter(screed.open(filename))).sequence ->>> print('got {} DNA characters to query'.format(len(query_seq))) +>>> print(f'got {len(query_seq)} DNA characters to query') got 4641652 DNA characters to query ``` @@ -487,3 +568,48 @@ NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome ``` et voila! + +### In-memory databases: the LCA or "reverse index" database. + +The LCA database lets you work with large collections of signatures in +memory. + +The LCA database was initially designed to support individual hash +queries for taxonomic operations - hence its name, which stands for +"Lowest Common Ancestor." However, it supports all of the standard +`Index` operations, just like the SBT. + +First, let's create an LCA database programmatically. + +``` +>>> from sourmash.lca import LCA_Database +>>> db = LCA_Database(ksize=31, scaled=10000, moltype='DNA') + +``` + +Now, let's load in all of the signatures from the test directory: + +``` +>>> for sig in sourmash.load_file_as_signatures('tests/test-data/doctest-data', ksize=31): +... hashes_inserted = db.insert(sig) +... print(f"Inserted {hashes_inserted} hashes into db.") +Inserted 493 hashes into db. +Inserted 525 hashes into db. +Inserted 490 hashes into db. + +``` + +and now you have an `Index` class that supports all the generic index operations (below). You can save an LCA Database to disk with `db.save(filename)`, and load it with `sourmash.load_file_as_index`, below. + +### The `Index` class API. + +The `Index` class supports a generic API for SBTs, LCAs, and other collections of signatures. + +To load an SBT or an LCA database from a file, use `sourmash.load_file_as_index`: +``` +>>> sbt_db = sourmash.load_file_as_index('tests/test-data/prot/protein.sbt.zip') +>>> lca_db = sourmash.load_file_as_index('tests/test-data/prot/protein.lca.json.gz') + +``` + +`Index` objects provide `search`, `insert`, `load`, `save`, and `__len__`. The signatures can be accessed directly via the `.signatures()` method, which returns an iterable. Last but not least, `Index.select(ksize=..., moltype=...)` will return a view on the Index object that contains only signatures with the desired k-mer size/molecule type. From 9d15cdd00940cb469e3a2e05a970cc955498bdfd Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 7 Feb 2021 18:49:11 -0800 Subject: [PATCH 08/24] add ToC to api-example --- doc/api-example.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/doc/api-example.md b/doc/api-example.md index e5af76a6ee..7adb351c30 100644 --- a/doc/api-example.md +++ b/doc/api-example.md @@ -2,7 +2,9 @@ All of sourmash's functionality is available via its [Python API](api.md). Below are both basic and advanced examples that use the API to accomplish common tasks. -[toc] +```{contents} + :depth: 2 +``` ## A first example: two k-mers From 97f2cd34ba204168338123a930b01b93696a96b2 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 8 Feb 2021 09:06:21 -0800 Subject: [PATCH 09/24] fix heading for API section --- doc/api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/api.md b/doc/api.md index f14ef10271..738b3e7e20 100644 --- a/doc/api.md +++ b/doc/api.md @@ -31,7 +31,7 @@ its Python API. Please also see [examples of using the API](api-example.md). :undoc-members: ``` -# `sourmash.fig`: make plots and figures +## `sourmash.fig`: make plots and figures ```{eval-rst} .. automodule:: sourmash.fig From 92412c2ec2210e056118ed9cbe640caabf63deac Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 8 Feb 2021 09:53:59 -0800 Subject: [PATCH 10/24] bold API examples link --- doc/api.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/doc/api.md b/doc/api.md index 738b3e7e20..8c91a3533f 100644 --- a/doc/api.md +++ b/doc/api.md @@ -1,7 +1,9 @@ # `sourmash` Python API The primary programmatic way of interacting with `sourmash` is via -its Python API. Please also see [examples of using the API](api-example.md). +its Python API. + +**Please also see [examples of using the API](api-example.md).** ```{contents} :depth: 2 From 391674132e580c5d49f7b1d798395ea161108654 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 8 Feb 2021 09:54:39 -0800 Subject: [PATCH 11/24] (untested) update of tutorials to use sourmash sketch --- doc/tutorial-basic.md | 12 ++++++------ doc/tutorials-lca.md | 6 +++--- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/doc/tutorial-basic.md b/doc/tutorial-basic.md index 02d6d44589..dd80aadae1 100644 --- a/doc/tutorial-basic.md +++ b/doc/tutorial-basic.md @@ -66,7 +66,7 @@ Compute a scaled signature from our reads: mkdir ~/sourmash cd ~/sourmash -sourmash compute --scaled 10000 ~/data/ecoli_ref*.fastq.gz -o ecoli-reads.sig -k 31 +sourmash sketch dna -p scaled=10000,k=31 ~/data/ecoli_ref*.fastq.gz -o ecoli-reads.sig ``` ## Compare reads to assemblies @@ -76,14 +76,14 @@ Use case: how much of the read content is contained in the reference genome? Build a signature for an E. coli genome: ``` -sourmash compute --scaled 1000 -k 31 ~/data/ecoliMG1655.fa.gz -o ecoli-genome.sig +sourmash sketch dna -p scaled=1000,k=31 ~/data/ecoliMG1655.fa.gz -o ecoli-genome.sig ``` and now evaluate *containment*, that is, what fraction of the read content is contained in the genome: ``` -sourmash search -k 31 ecoli-reads.sig ecoli-genome.sig --containment +sourmash search 31 ecoli-reads.sig ecoli-genome.sig --containment ``` and you should see: @@ -102,7 +102,7 @@ similarity match Try the reverse - why is it bigger? ``` -sourmash search -k 31 ecoli-genome.sig ecoli-reads.sig --containment +sourmash search ecoli-genome.sig ecoli-reads.sig --containment ``` ## Make and search a database quickly. @@ -135,7 +135,7 @@ ls ecoli_many_sigs Let's turn this into an easily-searchable database with `sourmash index` -- ``` -sourmash index -k 31 ecolidb ecoli_many_sigs/*.sig +sourmash index ecolidb ecoli_many_sigs/*.sig ``` and now we can search! @@ -213,7 +213,7 @@ curl -L -o genbank-k31.lca.json.gz https://osf.io/4f8n3/download Next, run the 'gather' command to see what's in your ecoli genome -- ``` -sourmash gather -k 31 ecoli-genome.sig genbank-k31.lca.json.gz +sourmash gather ecoli-genome.sig genbank-k31.lca.json.gz ``` and you should get: diff --git a/doc/tutorials-lca.md b/doc/tutorials-lca.md index 62e2e30e73..384a6a2b9e 100644 --- a/doc/tutorials-lca.md +++ b/doc/tutorials-lca.md @@ -66,9 +66,9 @@ Download a random genome from genbank: curl -L -o some-genome.fa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/178/875/GCF_000178875.2_ASM17887v2/GCF_000178875.2_ASM17887v2_genomic.fna.gz ``` -Compute a signature for this genome: +Create a signature for this genome: ``` -sourmash compute -k 31 --scaled=1000 --name-from-first some-genome.fa.gz +sourmash sketch -p scaled=1000,k=31 --name-from-first some-genome.fa.gz ``` Now, classify the signature with sourmash `lca classify`, @@ -119,7 +119,7 @@ on the command line; separate them with `--db` or `--query`. (This is an abbreviated version of [this blog post](http://ivory.idyll.org/blog/2017-classify-genome-bins-with-custom-db-try-again.html), updated to use the `sourmash lca` commands.) -Download some pre-computed signatures: +Download some pre-calculated signatures: ``` curl -L https://osf.io/bw8d7/download?version=1 -o delmont-subsample-sigs.tar.gz From 5e1f92bf670bce5e155baf8ef4d4063ac606536a Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 8 Feb 2021 09:54:46 -0800 Subject: [PATCH 12/24] update link targets --- doc/release-notes/sourmash-2.0.md | 2 +- doc/sourmash-sketch.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/release-notes/sourmash-2.0.md b/doc/release-notes/sourmash-2.0.md index c7d57c69b5..8c85bda3ba 100644 --- a/doc/release-notes/sourmash-2.0.md +++ b/doc/release-notes/sourmash-2.0.md @@ -23,7 +23,7 @@ This is a list of substantial new features and functionality in sourmash 2.0. * Created [precomputed databases](../databases.md) for most of GenBank genomes. * Added taxonomic reporting functionality in the `sourmash lca` submodule - [see command-line docs](../command-line.md#sourmash-lca-subcommands-for-taxonomic-classification). * Added signature manipulation utilities in the `sourmash signature` submodule - [see command-line docs](../command-line.md#sourmash-signature-subcommands-for-signature-manipulation) -* Introduced new modulo hash or "scaled" signatures for containment analysis; see [Using sourmash: a practical guide](../using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-compute-them) and [more details in the Python API examples](../api-example.md#sourmash-minhash-objects-and-manipulations). +* Introduced new modulo hash or "scaled" signatures for containment analysis; see [Using sourmash: a practical guide](../using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them) and [more details in the Python API examples](../api-example.md#advanced-features-of-sourmash-minhash-objects-scaled-and-num). * Switched to using JSON instead of YAML for signatures. * Many performance optimizations! * Many more tests! diff --git a/doc/sourmash-sketch.md b/doc/sourmash-sketch.md index 1d31a2e6b9..7b4dbcfdb7 100644 --- a/doc/sourmash-sketch.md +++ b/doc/sourmash-sketch.md @@ -82,8 +82,8 @@ The `-p` argument to `sourmash sketch` provides parameter strings to sourmash, a A parameter string is a space-delimited collection that can contain one or more fields, comma-separated. * `k=` - compute a sketch at this k-mer size; can provide more than one time in a parameter string. Typically `ksize` is between 4 and 100. -* `scaled=` - create a scaled MinHash with k-mers sampled deterministically at 1 per `` value. This controls sketch compression rates and resolution; for example, a 5 Mbp genome sketched with a scaled of 1000 would yield approximately 5,000 k-mers. `scaled` is incompatible with `num`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-compute-them) for more information. -* `num=` - create a standard MinHash with no more than `` k-mers kept. This will produce sketches identical to [mash sketches](https://mash.readthedocs.io/en/latest/). `num` is incompatible with `scaled`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-compute-them) for more information. +* `scaled=` - create a scaled MinHash with k-mers sampled deterministically at 1 per `` value. This controls sketch compression rates and resolution; for example, a 5 Mbp genome sketched with a scaled of 1000 would yield approximately 5,000 k-mers. `scaled` is incompatible with `num`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them) for more information. +* `num=` - create a standard MinHash with no more than `` k-mers kept. This will produce sketches identical to [mash sketches](https://mash.readthedocs.io/en/latest/). `num` is incompatible with `scaled`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them) for more information. * `abund` / `noabund` - create abundance-weighted (or not) sketches. See [Classify signatures: Abundance Weighting](classifying-signatures.md#abundance-weighting) for details of how this works. * `dna`, `protein`, `dayhoff`, `hp` - create this kind of sketch. Note that `sourmash sketch dna -p protein` and `sourmash sketch protein -p dna` are invalid; please use `sourmash sketch translate` for the former. From 40762f3010cdc1b394ec8ef62990eb33765b59d5 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 8 Feb 2021 11:43:38 -0800 Subject: [PATCH 13/24] updates of indexed databases --- doc/command-line.md | 36 +++++++++++++++++++-- doc/using-sourmash-a-guide.md | 61 +++++++++++++++++++++++++++++++++++ 2 files changed, 95 insertions(+), 2 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index 22171783e2..9e7c4d91b1 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -61,8 +61,8 @@ Matrix: To get a list of subcommands, run `sourmash` without any arguments. -There are five main subcommands: `sketch`, `compare`, `plot`, -`search`, and `gather`. See [the tutorial](tutorials.md) for a +There are six main subcommands: `sketch`, `compare`, `plot`, +`search`, `gather`, and `index`. See [the tutorial](tutorials.md) for a walkthrough of these commands. * `sketch` creates signatures. @@ -70,6 +70,7 @@ walkthrough of these commands. * `plot` plots distance matrices created by `compare`. * `search` finds matches to a query signature in a collection of signatures. * `gather` finds the best reference genomes for a metagenome, using the provided collection of signatures +* `index` build a fast index for many (thousands) of signatures There are also a number of commands that work with taxonomic information; these are grouped under the `sourmash lca` @@ -288,6 +289,37 @@ genomes with no (or incomplete) taxonomic information. Use `sourmash lca summarize` to classify a metagenome using a collection of genomes with taxonomic information. +### `sourmash index` - build an SBT index of signatures + +The `sourmash index` command creates a Zipped SBT database +(`.sbt.zip`) from a collection of signatures. This can be used to +create databases from private collections of genomes, and can also be +used to create databases for e.g. subsets of GenBank. + +These databases support fast search and gather on large collections +of signatures in low memory. + +SBTs can only be created on scaled signatures, and all signatures in +an SBT must be of compatible types (i.e. the same k-mer size and +molecule type). You can specify the usual command line selectors +(`-k`, `--scaled`, `--dna`, `--protein`, etc.) to pick out the types +of signatures to include. + +Usage: +``` +sourmash index database [ list of input signatures/directories/databases ] +``` + +This will create a `database.sbt.zip` file containing the SBT of the +input signatures. You can create an "unpacked" version by specifying +`database.sbt.json` and it will create the JSON file as well as a +subdirectory of files under `.sbt.database`. + +Note that you can use `--from-file` to pass `index` a text file +containing a list of files to index; you can also provide individual +signature files, directories full of signatures, or other sourmash +databases. + ## `sourmash lca` subcommands for taxonomic classification These commands use LCA databases (created with `lca index`, below, or diff --git a/doc/using-sourmash-a-guide.md b/doc/using-sourmash-a-guide.md index 99585a5a37..757d3d7654 100644 --- a/doc/using-sourmash-a-guide.md +++ b/doc/using-sourmash-a-guide.md @@ -1,5 +1,9 @@ # Using sourmash: a practical guide +```{contents} + :depth: 2 +``` + So! You've installed sourmash, run a few of the tutorials and commands, and now you actually want to *use* it. This guide is here to answer some of your questions, and explain why we can't answer others. @@ -145,3 +149,60 @@ names them based on their FASTA headers, and places them all in a single `.sig` file, `file.fa.sig`. (This behavior is triggered by the option `--singleton`, which tells sourmash to treat each individual sequence in the file as an independent sequence.) + +## How do I store and search collections of signatures? + +sourmash supports a variety of signature loading and storage options for +flexibility. If you have only a few hundred signatures, here are some +options - + +* you can put all your signature files in a directory and search them all + using the path to the directory. +* you can use `sourmash sig cat` to concatenate multiple signatures into a + single file. +* you can compress any signature file using `gzip` and sourmash will + load them. + +If you have more than a few hundred genome signatures that you +regularly search, it might be worth creating an indexed database of +them that will support faster searches. + +sourmash supports two types of indexed databases: Sequence Bloom +Trees, or SBTs; and reverse indices, or LCAs. (You can read more +detail about their implementation and design considerations +[in Chapter 2 of Dr. Luiz Irber's thesis, "Efficient indexing of collections of signatures"](https://github.com/luizirber/phd/releases/download/2020.09.28/thesis.pdf).) + +### Sequence Bloom Tree (SBT) indexed databases + +Sequence Bloom Trees (SBTs) (see +[Solomon and Kingsford, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4804353/)) +are on disk databases that support low-memory query of 10s-100s of +thousands of signatures. They can be created using `sourmash index`. + +SBTs are the lowest-memory way to run search or gather on a collection +of signatures. The tradeoff is that they may be quite large on disk, +because SBTs also contain intermediate nodes in the tree. The default +way to store SBTs is in a Zip file, named `.sbt.zip`, that can be +built and searched directly from the command line. + +### Reverse indexed (LCA) databases + +Reverse indexed or LCA databases are *in-memory* databases that, once +loaded from disk, support fast search and gather across 10s of thousands +of signatures. They can be created using `sourmash lca index` ([docs](command-line.md#sourmash-lca-index-build-an-lca-database)) + +LCA databases are currently stored in JSON files (that can be gzipped). +As these files get larger, the time required to load them from disk +can be substantial. + +LCA databases are also currently (sourmash 2.0-4.0) the only databases +that support the inclusion of taxonomic information in the database, +and there is an associated collection of commands +[under `sourmash lca`](command.md#sourmash-lca-subcommands-for-taxonomic-classification). +However, they can also be used as regular indexed databases for search +and gather as above. + +(These are called "LCA databases" because they originally were created +to support "lowest common ancestor" taxonomic analyses, e.g. like +Kraken; their functionality has evolved a lot since, but their name +hasn't changed to match!) From b1d988aaf898a127af674fe02b8c8030c2749479 Mon Sep 17 00:00:00 2001 From: Taylor Reiter Date: Tue, 9 Feb 2021 05:40:18 -0800 Subject: [PATCH 14/24] typos in versioning (#1314) --- doc/support.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/support.md b/doc/support.md index ce77d9090c..271f0f228c 100644 --- a/doc/support.md +++ b/doc/support.md @@ -80,13 +80,13 @@ Release notes for minor and patch versions are available on the sourmash v3.x supports Python 2.7 as well as Python 3.x, through Python 3.8. -sourmash v4.0 dropped support for version of Python before Python 3.7, +sourmash v4.0 dropped support for versions of Python before Python 3.7, and our intent is that it will support as-yet unreleased versions of Python 3.x (e.g. 3.9) moving forward. For future versions of sourmash, we plan to follow the [Numpy NEP 29](https://numpy.org/neps/nep-0029-deprecation_policy.html) -proposal for Python version support in the future. For example, this +proposal for Python version support. For example, this would mean that we would drop support for Python 3.7 on December 26, 2021. From 9da9c189a0e71e31489e48f0f8135ff07c13b3b8 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 9 Feb 2021 05:44:00 -0800 Subject: [PATCH 15/24] Apply suggestions from code review Co-authored-by: Taylor Reiter --- doc/sourmash-sketch.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/sourmash-sketch.md b/doc/sourmash-sketch.md index 7b4dbcfdb7..d2c8923946 100644 --- a/doc/sourmash-sketch.md +++ b/doc/sourmash-sketch.md @@ -66,7 +66,7 @@ By default, `sourmash sketch` will produce signatures for each input *file*. If If you specify `--singleton`, `sourmash sketch` will produce signatures for each *record*. -If you specify `--merge `, sourmash sketch will produce signatures for all input files combined into one. +If you specify `--merge `, sourmash sketch will produce signatures for all input files and combine them into one signature. The output signature(s) will be saved in locations that depend on your input parameters. By default, `sourmash sketch` will put the signatures in the current directory, in a file named for the input file with a `.sig` suffix. If you specify `-o`, all of the signatures will be placed in that file. @@ -81,7 +81,7 @@ We are still in the process of benchmarking these encodings; ask [on the issue t The `-p` argument to `sourmash sketch` provides parameter strings to sourmash, and these control what signatures and sketches are calculated and output. Zero or more parameter strings can be given to sourmash. Each parameter string produces at least one sketch. A parameter string is a space-delimited collection that can contain one or more fields, comma-separated. -* `k=` - compute a sketch at this k-mer size; can provide more than one time in a parameter string. Typically `ksize` is between 4 and 100. +* `k=` - create a sketch at this k-mer size; can provide more than one time in a parameter string. Typically `ksize` is between 4 and 100. * `scaled=` - create a scaled MinHash with k-mers sampled deterministically at 1 per `` value. This controls sketch compression rates and resolution; for example, a 5 Mbp genome sketched with a scaled of 1000 would yield approximately 5,000 k-mers. `scaled` is incompatible with `num`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them) for more information. * `num=` - create a standard MinHash with no more than `` k-mers kept. This will produce sketches identical to [mash sketches](https://mash.readthedocs.io/en/latest/). `num` is incompatible with `scaled`. See [our guide to signature resolution](using-sourmash-a-guide.md#what-resolution-should-my-signatures-be-how-should-i-create-them) for more information. * `abund` / `noabund` - create abundance-weighted (or not) sketches. See [Classify signatures: Abundance Weighting](classifying-signatures.md#abundance-weighting) for details of how this works. @@ -124,17 +124,17 @@ Specify `--outdir` to put all the signatures in a specific directory. ### Downsampling and flattening signatures -Calculating signatures is probably the most time consuming part of using sourmash, and it is the only part that requires access to the raw data. Moreover, the output signatures are generally much smaller than the input data. So, we generally suggest calculating a large set of signatures once. +Creating signatures is probably the most time consuming part of using sourmash, and it is the only part that requires access to the raw data. Moreover, the output signatures are generally much smaller than the input data. So, we generally suggest creating a large set of signatures once. To support this, sourmash can do two kinds of signature conversion without going back to the raw data. -First, you can downsample `num` and `scaled` signatures using `sourmash sig downsample`. For any sketch calculated with `num` parameter, you can decrease that `num`. And, for any `scaled` parameter, you can increase the `scaled`. This will decrease the size of the sketch accordingly; for example, going from a num of 5000 to a num of 1000 will decrease the sketch size by a factor of 5, and going from a scaled of 1000 to a scaled of 10000 will decrease the sketch size by a factor of 10. +First, you can downsample `num` and `scaled` signatures using `sourmash sig downsample`. For any sketch created with `num` parameter, you can decrease that `num`. And, for any `scaled` parameter, you can increase the `scaled`. This will decrease the size of the sketch accordingly; for example, going from a `num` of 5000 to a `num` of 1000 will decrease the sketch size by a factor of 5, and going from a `scaled` of 1000 to a `scaled` of 10000 will decrease the sketch size by a factor of 10. -(Note that decreasing num or increasing scaled will increase calculation speed and lower the accuracy of your results.) +(Note that decreasing `num` or increasing `scaled` will increase calculation speed and lower the accuracy of your results.) -Second, you can flatten abundances using `sourmash sig flatten`. For any sketch calculated with `abund`, you can convert it to a `noabund` sketch. This will decrease the sketch size, although not necessarily by a lot. +Second, you can flatten abundances using `sourmash sig flatten`. For any sketch created with `abund`, you can convert it to a `noabund` sketch. This will decrease the sketch size, although not necessarily by a lot. -Unfortunately, changing the k-mer size or using different DNA/protein encodings cannot be done on a sketch, and you need to calculate new signatures from the raw data for that. +Unfortunately, changing the k-mer size or using different DNA/protein encodings cannot be done on a sketch, and you need to create new signatures from the raw data for that. ### Examining the output of `sourmash sketch` From bf9e950693577c4e378642e88a29ad6d7887df34 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 9 Feb 2021 05:45:34 -0800 Subject: [PATCH 16/24] Update doc/api-example.md Co-authored-by: Taylor Reiter --- doc/api-example.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/api-example.md b/doc/api-example.md index 7adb351c30..ba82054437 100644 --- a/doc/api-example.md +++ b/doc/api-example.md @@ -146,7 +146,7 @@ raising an exception. ``` -And now the result MinHash objects can be compared against each other: +And now the resulting MinHash objects can be compared against each other: ``` >>> import sys From ab8656d97a6b3a3f61ea961326679a34fd7ccfc8 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 9 Feb 2021 05:51:01 -0800 Subject: [PATCH 17/24] updated with suggestions from @taylorreiter doc review --- README.md | 4 ++-- doc/classifying-signatures.md | 4 +--- doc/command-line.md | 11 ++++++++++- doc/tutorial-basic.md | 2 +- 4 files changed, 14 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 821887f08f..9b9382ae2e 100644 --- a/README.md +++ b/README.md @@ -15,8 +15,8 @@ Quickly search, compare, and analyze genomic and metagenomic data sets. Usage: sourmash sketch dna *.fq.gz - sourmash compare *.sig -o distances -k 31 - sourmash plot distances + sourmash compare *.sig -o distances.cmp -k 31 + sourmash plot distances.cmp sourmash 1.0 is [published on JOSS](https://doi.org/10.21105/joss.00027); please cite that paper if you use sourmash (`doi: 10.21105/joss.00027`):. diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md index 279670788e..787c2e2d32 100644 --- a/doc/classifying-signatures.md +++ b/doc/classifying-signatures.md @@ -153,9 +153,7 @@ for use in clustering. For more information on the value of this kind of comparison for metagenomics, please see the simka paper, [Multiple comparative metagenomics using multiset k-mer counting](https://peerj.com/articles/cs-94/), -Benoit et al., 2016. Initial comparisons of metagenome similarity -approximations calculated with sourmash to the output of simka suggest a -significant correlation. +Benoit et al., 2016. **Implementation note:** Angular similarity searches cannot be done on SBT or LCA databases currently; you have to provide lists of signature diff --git a/doc/command-line.md b/doc/command-line.md index 9e7c4d91b1..b0af38dd50 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -115,7 +115,16 @@ The `sketch protein` command reads in **protein sequences** and outputs **protei The `sketch translate` command reads in **DNA sequences**, translates them in all six frames, and outputs **protein sketches**. -Please see [the `sourmash sketch` documentation page](sourmash-sketch.md) for details! +`sourmash sketch` takes FASTA or FASTQ sequences as input, and they can be +uncompressed, compressed with gzip, or compressed with bzip2. The output +will be one or more JSON signature files that can be used with the other +sourmash commands. + +Please see +[the `sourmash sketch` documentation page](sourmash-sketch.md) for +details on `sketch`, and see +[Using sourmash: a practical guide](using-sourmash-a-guide.md) for +more information on creating signatures. ### `sourmash compute` - make sourmash signatures from sequence data diff --git a/doc/tutorial-basic.md b/doc/tutorial-basic.md index dd80aadae1..f47d75c29e 100644 --- a/doc/tutorial-basic.md +++ b/doc/tutorial-basic.md @@ -83,7 +83,7 @@ and now evaluate *containment*, that is, what fraction of the read content is contained in the genome: ``` -sourmash search 31 ecoli-reads.sig ecoli-genome.sig --containment +sourmash search ecoli-reads.sig ecoli-genome.sig --containment ``` and you should see: From dda99fece734279cb3ba5b28f3f5f17b86dea218 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 9 Feb 2021 06:30:42 -0800 Subject: [PATCH 18/24] added section on sketch naming --- doc/sourmash-sketch.md | 36 ++++++++++++++++++++++++++++++------ 1 file changed, 30 insertions(+), 6 deletions(-) diff --git a/doc/sourmash-sketch.md b/doc/sourmash-sketch.md index d2c8923946..6f6d220192 100644 --- a/doc/sourmash-sketch.md +++ b/doc/sourmash-sketch.md @@ -20,7 +20,7 @@ The `sketch translate` command reads in **DNA sequences**, translates them in al ### DNA sketches for genomes and reads -To compute a DNA sketch for a genome, run: +To create a DNA sketch for a genome, run: ``` sourmash sketch dna genome.fna ``` @@ -31,7 +31,7 @@ Sourmash can work with unassembled reads; run ``` sourmash sketch dna -p k=21,k=31,k=51,abund metagenome.fq.gz ``` -to compute three abundance-weighted sketches at k=21, 31, and 51, for the given FASTQ file. +to create three abundance-weighted sketches at k=21, 31, and 51, for the given FASTQ file. ### Protein sketches for genomes and proteomes @@ -87,7 +87,7 @@ A parameter string is a space-delimited collection that can contain one or more * `abund` / `noabund` - create abundance-weighted (or not) sketches. See [Classify signatures: Abundance Weighting](classifying-signatures.md#abundance-weighting) for details of how this works. * `dna`, `protein`, `dayhoff`, `hp` - create this kind of sketch. Note that `sourmash sketch dna -p protein` and `sourmash sketch protein -p dna` are invalid; please use `sourmash sketch translate` for the former. -For all field names but `k`, if multiple fields in a parameter string are provided, the last one encountered overrides the previous values. For `k`, if multiple ksizes are specified a single parameter string, sketches for all ksizes specified are computed. +For all field names but `k`, if multiple fields in a parameter string are provided, the last one encountered overrides the previous values. For `k`, if multiple ksizes are specified in a single parameter string, sketches for all ksizes specified are created. If a field isn't specified, then the default value for that sketch type is used; so, for example, `sourmash sketch dna -p abund` would calculate a sketch with `k=31,scaled=1000,abund`. See below for the defaults. @@ -98,7 +98,7 @@ The default parameters for sketches are as follows: * dna: `k=31,scaled=1000,noabund` * protein: `k=10,scaled=200,noabund` * dayhoff: `k=16,scaled=200,noabund` -* hp=`k=42,scaled=200,noabund` +* hp: `k=42,scaled=200,noabund` These were chosen by a committee of PhDs as being good defaults for an initial analysis, so, beware :). @@ -111,8 +111,32 @@ The protein, dayhoff, and hp parameters were selected based on unpublished resea Below are some more complicated `sourmash sketch` command lines: * `sourmash sketch dna -p k=51` - default to a scaled=1000 and noabund for a k-mer size of 51 (based on moltype/command) -* `sourmash sketch dna -p k=31,k=51,k=21` - compute multiple ksizes, using the defaults otherwise -* `sourmash sketch translate -p k=20,num=500,protein -p k=19,num=400,dayhoff,abund -p k=30,scaled=200,hp` - compute multiple ksizes, moltypes, and scaled/num. +* `sourmash sketch dna -p k=31,k=51,k=21` - create one signature with multiple ksizes, using the defaults otherwise +* `sourmash sketch translate -p k=20,num=500,protein -p k=19,num=400,dayhoff,abund -p k=30,scaled=200,hp` - create three signatures with different ksizes, moltypes, and scaled/num. + +### Signature naming + +Signature names are displayed in the output for search, gather, and +compare, and can be specified in a few different ways. + +With default arguments, `sourmash sketch` does not set a name, and the +filename is used in display output. + +You can set a name using `--name`, but this has the side effect of +merging the sequence records before signature creation. So, for example, +`sourmash sketch dna genome1.fa genome2.fa --name genome1 -o +genome.sig` would produce one signature after combining `genome1.fa` +and `genome2.fa`. + +The option `--name-from-first` will set the signature name from the +first record header encountered in each file. When used with `--singleton`, +this will name each signature based on the record that it is created from. + +You can examine the signature name using `sourmash sig describe`. + +Individual signature renaming can be done from the command line using +`sourmash sig split` to create individual files for each signature, +and then `sourmash sig rename`. ### Locations for output files From 219e6069d781a08094fc39875a0e45e522e1bb45 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 9 Feb 2021 10:15:17 -0800 Subject: [PATCH 19/24] [WIP] add migration docs and release notes (#1316) * add migration docs and release notes * Update doc/support.md Co-authored-by: Taylor Reiter * Update doc/support.md Co-authored-by: Taylor Reiter * Update doc/release-notes/sourmash-4.0.md Co-authored-by: Taylor Reiter * Update doc/release-notes/sourmash-4.0.md Co-authored-by: Taylor Reiter * Update doc/release-notes/sourmash-4.0.md Co-authored-by: Taylor Reiter * update with last set of changes * add missing line break Co-authored-by: Taylor Reiter --- doc/release-notes/releases.md | 1 + doc/release-notes/sourmash-4.0.md | 71 ++++++++++++++++++++++++++++++ doc/support.md | 72 +++++++++++++++++++++++++------ 3 files changed, 132 insertions(+), 12 deletions(-) create mode 100644 doc/release-notes/sourmash-4.0.md diff --git a/doc/release-notes/releases.md b/doc/release-notes/releases.md index 6d5ae519aa..6a492e1c24 100644 --- a/doc/release-notes/releases.md +++ b/doc/release-notes/releases.md @@ -7,6 +7,7 @@ for detailed release notes for each version! ```{toctree} :maxdepth: 2 +sourmash-4.0 sourmash-3.0 sourmash-2.0 ``` diff --git a/doc/release-notes/sourmash-4.0.md b/doc/release-notes/sourmash-4.0.md new file mode 100644 index 0000000000..139dd5efef --- /dev/null +++ b/doc/release-notes/sourmash-4.0.md @@ -0,0 +1,71 @@ +# sourmash v4.0 release notes + +```{contents} + :depth: 2 +``` + +We are pleased to announce release 4.0 of sourmash! This release +contains many feature improvements and new functionality, as well as +many breaking changes with sourmash 2.x and 3.x. + +Please see +[our migration guide](../support.md#migrating-from-sourmash-v3-x-to-sourmash-v4-x) +for guidance on updating to sourmash v4, and post questions about +migrating to sourmash 4.0 in the +[sourmash issue tracker](https://github.com/dib-lab/sourmash/issues/new). + +## Major changes for 4.0 + +### New or changed behavior + +* default SBT storage is now .sbt.zip (#1174, #1170) +* add `sourmash sketch` command for creating signatures (#1159) +* protein ksizes in MinHash are now divided by 3, except in `sourmash compute` (#1277) +* refactor MinHash API and implementation: add, iadd, merge, hashes, and max_hash (#1282, #1154, #1139, #1301) +* add HyperLogLog implementation (#1223) +* `SourmashSignature.name` is now a property (not a method): use `str(sig)` instead of `name()` (#1179, #1232) +* `lca summarize` no longer merges all signatures, and uses hash abundance by default (#1175) +* `index `and `lca index` (#1186, #1222) now support `--from-file` and no longer require signature files on command line +* `--traverse-directory` is now on by default for signature loading behavior (#1178) + +### Feature removal + +* remove Python 2.7 support (& end Python 2 compatibility) (#1145, #1144) +* remove `lca gather` (#1307) +* remove 10x support from `sourmash compute` (#1229) +* remove `dump` command (#1157) + +### Feature/function deprecations +* deprecate `sourmash compute` (#1159) +* deprecate `load_signatures`, `sourmash.load_one_signature`, `create_sbt_index`, and `load_sbt_index` (#1279, #1304) +* deprecate `import_csv` in favor of new `sourmash sig import --csv` (#1281) + +## Refactoring, improvements, and minor bug fixes: + +* accept file list in `sourmash sig cat` (#1236) +* add unique_intersect_bp and gather_result_rank to gather CSV output (#1219) +* remove deprecated minhash functions (#1149) +* fix Rust panic error in signature creation (#1172) +* cache nodes in SBT during search (#1161) +* fix two bugs in gather `--output-unassigned` (#1156) + +## Documentation updates + +* add information about versioning, migrations, etc to the docs (#1153) +* @CTB MORE GOES HERE + +## Infrastructure and CI changes: + +* update finch requirement from 0.3.0 to 0.4.1 (#1290) +* update rand for test, and activate "js" feature for getrandom (#1275) +* dev updates (configs and doc) (#1298) +* move wheel building from Travis to GitHub Actions (#1295) +* fix new clippy warnings from Rust 1.49 (#1267) +* use tox for running tests locally (#696) +* CI: small build fixes (#1252) +* CI: Fix releases in GitHub Actions (#1250) +* update build_wheel action paths +* CI: moving python tests from travis to GH actions (#1249) +* CI: move wheel building to GitHub actions (#1244) +* remove last .rst file from docs (#1185) +* update CI for latest branch name change (#1150) diff --git a/doc/support.md b/doc/support.md index 271f0f228c..2f5a3bf775 100644 --- a/doc/support.md +++ b/doc/support.md @@ -1,5 +1,9 @@ # Support, Versioning, and Migration +```{contents} + :depth: 2 +``` + ## Asking questions and filing bugs We do our best to support sourmash users! Users have found important @@ -82,7 +86,7 @@ sourmash v3.x supports Python 2.7 as well as Python 3.x, through Python 3.8. sourmash v4.0 dropped support for versions of Python before Python 3.7, and our intent is that it will support as-yet unreleased versions of Python 3.x -(e.g. 3.9) moving forward. +(e.g. 3.10) moving forward. For future versions of sourmash, we plan to follow the [Numpy NEP 29](https://numpy.org/neps/nep-0029-deprecation_policy.html) @@ -90,18 +94,62 @@ proposal for Python version support. For example, this would mean that we would drop support for Python 3.7 on December 26, 2021. -## Migrating from sourmash v3.x to sourmash 4.x. +## Migrating from sourmash v3.x to sourmash v4.x. + +Our intent is to provide a clear path for migration between versions for our users. We rely on *semantic versioning* and deprecation warnings to do this - +* Within each major version release (v2, v3, v4), the command-line interface and Python APIs should remain the same, with features being only *added*. +* Across major versions (e.g. v2 to v3, and v3 to v4) we provide warnings when functionality will change in the next major version. + +So: if you want to upgrade workflows and scripts from prior releases of sourmash to sourmash v4.0, we suggest doing this in two stages. + +First, upgrade to the latest version of sourmash 3.5.x (currently [v3.5.0](https://github.com/dib-lab/sourmash/releases/tag/v3.5.0)), which is compatible with all files and command lines used in previous versions of sourmash (v2.x and v3.x). After upgrading to 3.5.x, scan the sourmash output for deprecation warnings and fix those. + +Next, upgrade to the latest version of 4.x, which will introduce some backwards incompatibilities based upon the deprecation warnings. + +The major changes are detailed below; please see the [full release notes for 4.0](release-notes/sourmash-4.0.md) for all the details and links to the code changes. + +### Sourmash command line + +If you use sourmash from the command line, there are a few major changes in 4.0 that you should know about. + +First, **`sourmash compute` is deprecated in favor of [`sourmash sketch`](sourmash-sketch.md)**, which provides quite a bit more flexibility in creating signatures. + +Second, **`sourmash index` will now save databases in the Zip format (`.sbt.zip`) instead of the old JSON+subdirectory format** (see [updated docs](command-line.md#sourmash-index-build-an-sbt-index-of-signatures)). You can revert to the old behavior by explicitly specifying the `.sbt.json` filename for output when running `sourmash index`. + +Third, all sourmash commands that operate on signatures should now be able to directly read from lists of signatures in signature files, SBT databases, LCA databases, directories, and files containing lists of filenames (see [updated docs](command-line.md#advanced-command-line-usage)). + +Fourth, if you use `sourmash lca` commands, **`sourmash lca gather` has been removed**. In addition, there are some **changes in how `summarize` works**: it now uses abundances by default, and no longer combines all signatures before summarizing. Specify `--ignore-abundance` and combine your signatures using `sourmash sig merge` to recover the old behavior. Note also that `lca summarize` now includes a new column, `filename`, in the CSV output. + +Finally, **k-mer sizes have changed for amino acid sequences** in v4. If you use protein, Dayhoff, or HP signatures, we now interpret k-mer sizes differently on the command line. Briefly, k-mer sizes for protein/dayhoff/hp signatures are now the size of the k-mer in amino acid space, *not* the space of the k-mer in DNA space (as previously used). In practice this means that you need to divide all your old k-mer sizes by 3 when working with k-mers in amino acid space! + +Note also that while `sourmash compute` still behaves the same way in v4.x as it did in sourmash 3.5.x, `sourmash sketch translate` and `sourmash sketch protein` both use the *new* approach to amino acid k-mer sizes, as do all of the the command line options for searching, manipulation, and display. Again, in practice this means that you need to divide all your old k-mer sizes by 3 if they apply to amino acid k-mers. + +There are several minor changes where error messages should occur appropriately: +* `--traverse-directory` is no longer needed on the command line for `sourmash index` or other functions; directory traversal happens automatically. +* the command lines for `sourmash index` and `sourmash lca index` no longer require signature files to be specified, which can break existing command lines. To fix this, reorder arguments so that any signatures are specified at the end of the command line. + +### Python API + +First, all k-mer sizes for `protein`, `dayhoff`, and `hp` signatures have changed in the Python layer to be "correct", i.e., to be the size of the protein k-mer. Previously they were 3\*k, i.e. based on the size of the DNA k-mer from which the protein sequence would have been created. + +Second, the `MinHash` class API has changed significantly! +* `get_mins()` has been deprecated in favor of `.hashes`, which is a dictionary that contains abundances. +* `merge` now just modifies `MinHash` objects in-place, and no longer returns the merged object; use `__iadd__` (`+=`) for the old behavior, or `__add__` (`+`) to create a new merged object. +* `max_hash` has been deprecated in favor of `scaled`. +* instead of `downsample_scaled(s)` use `downsample(scaled=s)` +* instead of `downsample_n(m)` use `downsample(num=m)` +* `is_molecule_type` has been replaced with a property, `moltype` -- instead of `is_molecule_type(t)` use `moltype == t`. + -Prior to the release of sourmash v4, we are adding deprecation -warnings and/or future warnings to all APIs and modules in sourmash -v3.x that are being removed in v4.0. If you are using the Python API, -we suggest you use the following procedure to migrate: +Third, `SourmashSignature` objects no longer have a `name()` method but instead a `name` property, which can be assigned to. This property is now `None` when no name has been assigned. Note that `str(sig)` should now be used to retrieve a display name, and should replace all previous uses of `sig.name()`. -* first, install the latest version of sourmash v3, which should be v3.5.0 or later. -* then, turn on `DeprecationWarning`s in your code per [the warnings module documentation](https://docs.python.org/3/library/warnings.html#overriding-the-default-filter). -* now, run python with the argument `-W error` to turn warnings into errors. -* fix all errors! -* finally, upgrade to sourmash v4.0. +Fourth, a few top-level functions have been deprecated: `load_signatures(...)`, `load_one_signature(...)`, `create_sbt_index(...)`, and `load_sbt_index(...)`. +* `load_signatures(...)`, `load_one_signature(...)` should be replaced with `load_file_as_signatures(...)`. Note there is currently no top-level way to load signatures from strings. For now, if you need that functionality, you can use `sourmash.signature.load_signatures(...)` and `sourmash.signature.load_one_signature(...)`, but please be aware that these are not considered part of the public API that is under semantic versioning, so they may change in the next minor point release; this is tracked in https://github.com/dib-lab/sourmash/issues/1312. +* `load_sbt_index(...)` have been deprecated. Please use `load_file_as_index(...)` instead. +* `create_sbt_index(...)` has been deprecated. There is currently no replacement, although you can use it directly from `sourmash.sbtmh` if necessary. -@CTB add stuff here +Fifth, directory traversal now happens by default when loading signatures, so remove `traverse=True` arguments to several functions in `sourmash_args` - `load_dbs_and_sigs`, `load_file_as_index`, `and load_file_as_signatures`. +Please post questions and concerns to the +[sourmash issue tracker](https://github.com/dib-lab/sourmash/issues) +and we'll be happy to help! From b6901466f7dcfc5c2fa745034df4d56e6ffeba98 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 9 Feb 2021 10:17:38 -0800 Subject: [PATCH 20/24] resolve missing link --- doc/release-notes/sourmash-4.0.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/release-notes/sourmash-4.0.md b/doc/release-notes/sourmash-4.0.md index 139dd5efef..66baf27c39 100644 --- a/doc/release-notes/sourmash-4.0.md +++ b/doc/release-notes/sourmash-4.0.md @@ -51,8 +51,8 @@ migrating to sourmash 4.0 in the ## Documentation updates -* add information about versioning, migrations, etc to the docs (#1153) -* @CTB MORE GOES HERE +* major update and cleanup of docs given new functionality; add sourmash sketch documentation (#1283) +* add information about versioning, migrations, etc to the docs (#1153, #1283) ## Infrastructure and CI changes: From cf894c48379bfb3429b204fb2acd87c2c4ff53b2 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 9 Feb 2021 15:46:20 -0800 Subject: [PATCH 21/24] update tutorials and notebooks for 4.0 --- doc/sourmash-collections.ipynb | 1482 ++++++++---------------------- doc/sourmash-examples.ipynb | 177 ++-- doc/tutorial-basic.md | 38 +- doc/tutorials-lca.md | 8 +- doc/using-LCA-database-API.ipynb | 65 +- 5 files changed, 553 insertions(+), 1217 deletions(-) diff --git a/doc/sourmash-collections.ipynb b/doc/sourmash-collections.ipynb index e0508c5bf3..fe1c15f400 100644 --- a/doc/sourmash-collections.ipynb +++ b/doc/sourmash-collections.ipynb @@ -41,8 +41,8 @@ "/Users/t/dev/sourmash/doc/big_genomes\n", " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", - "100 459 100 459 0 0 750 0 --:--:-- --:--:-- --:--:-- 750\n", - "100 61.1M 100 61.1M 0 0 2966k 0 0:00:21 0:00:21 --:--:-- 3496k\n" + "100 459 100 459 0 0 1017 0 --:--:-- --:--:-- --:--:-- 1017\n", + "100 61.1M 100 61.1M 0 0 2932k 0 0:00:21 0:00:21 --:--:-- 3468k\n" ] } ], @@ -68,217 +68,215 @@ "output_type": "stream", "text": [ "/Users/t/dev/sourmash/doc/big_genomes\n", - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", - "\u001b[Ksetting num_hashes to 0 because --scaled is set\n", "\u001b[Kcomputing signatures for files: 0.fa, 1.fa, 10.fa, 11.fa, 12.fa, 13.fa, 14.fa, 15.fa, 16.fa, 17.fa, 18.fa, 19.fa, 2.fa, 20.fa, 21.fa, 22.fa, 23.fa, 24.fa, 25.fa, 26.fa, 27.fa, 28.fa, 29.fa, 3.fa, 30.fa, 31.fa, 32.fa, 33.fa, 34.fa, 35.fa, 36.fa, 37.fa, 38.fa, 39.fa, 4.fa, 40.fa, 41.fa, 42.fa, 43.fa, 44.fa, 45.fa, 46.fa, 47.fa, 48.fa, 49.fa, 5.fa, 50.fa, 51.fa, 52.fa, 53.fa, 54.fa, 55.fa, 56.fa, 57.fa, 58.fa, 59.fa, 6.fa, 60.fa, 61.fa, 62.fa, 63.fa, 7.fa, 8.fa, 9.fa\n", - "\u001b[KComputing signature for ksizes: [31]\n", - "\u001b[KComputing only nucleotide (and not protein) signatures.\n", "\u001b[KComputing a total of 1 signature(s).\n", "\u001b[K... reading sequences from 0.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 0.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 0.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 1.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 1.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 1.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 10.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 10.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 10.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 11.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 11.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 11.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 12.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 12.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 12.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 13.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 13.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 13.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 14.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 14.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 14.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 15.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 15.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 15.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 16.fa\n", "\u001b[Kcalculated 1 signatures for 4 sequences in 16.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 16.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 17.fa\n", "\u001b[Kcalculated 1 signatures for 2 sequences in 17.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 17.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 18.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 18.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 18.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 19.fa\n", "\u001b[Kcalculated 1 signatures for 9 sequences in 19.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 19.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 2.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 2.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 2.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 20.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 20.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 20.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 21.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 21.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 21.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 22.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 22.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 22.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 23.fa\n", "\u001b[Kcalculated 1 signatures for 5 sequences in 23.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 23.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 24.fa\n", "\u001b[Kcalculated 1 signatures for 3 sequences in 24.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 24.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 25.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 25.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 25.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 26.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 26.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 26.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 27.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 27.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 27.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 28.fa\n", "\u001b[Kcalculated 1 signatures for 3 sequences in 28.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 28.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 29.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 29.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 29.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 3.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 3.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 3.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 30.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 30.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 30.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 31.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 31.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 31.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 32.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 32.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 32.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 33.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 33.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 33.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 34.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 34.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 34.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 35.fa\n", "\u001b[Kcalculated 1 signatures for 7 sequences in 35.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 35.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 36.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 36.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 36.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 37.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 37.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 37.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 38.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 38.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 38.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 39.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 39.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 39.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 4.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 4.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 4.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 40.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 40.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 40.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 41.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 41.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 41.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 42.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 42.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 42.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 43.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 43.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 43.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 44.fa\n", "\u001b[Kcalculated 1 signatures for 2 sequences in 44.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 44.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 45.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 45.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 45.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 46.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 46.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 46.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 47.fa\n", "\u001b[Kcalculated 1 signatures for 2 sequences in 47.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 47.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 48.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 48.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 48.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 49.fa\n", "\u001b[Kcalculated 1 signatures for 228 sequences in 49.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 49.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 5.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 5.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 5.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 50.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 50.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 50.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 51.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 51.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 51.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 52.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 52.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", - "\u001b[K... reading sequences from 53.fa\n", - "\u001b[Kcalculated 1 signatures for 1 sequences in 53.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", - "\u001b[K... reading sequences from 54.fa\n", - "\u001b[Kcalculated 1 signatures for 1 sequences in 54.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", - "\u001b[K... reading sequences from 55.fa\n" + "\u001b[Ksaved signature(s) to 52.fa.sig. Note: signature license is CC0.\n", + "\u001b[K... reading sequences from 53.fa\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ + "\u001b[Kcalculated 1 signatures for 1 sequences in 53.fa\n", + "\u001b[Ksaved signature(s) to 53.fa.sig. Note: signature license is CC0.\n", + "\u001b[K... reading sequences from 54.fa\n", + "\u001b[Kcalculated 1 signatures for 1 sequences in 54.fa\n", + "\u001b[Ksaved signature(s) to 54.fa.sig. Note: signature license is CC0.\n", + "\u001b[K... reading sequences from 55.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 55.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 55.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 56.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 56.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 56.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 57.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 57.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 57.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 58.fa\n", "\u001b[Kcalculated 1 signatures for 30 sequences in 58.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 58.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 59.fa\n", "\u001b[Kcalculated 1 signatures for 5 sequences in 59.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 59.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 6.fa\n", "\u001b[Kcalculated 1 signatures for 76 sequences in 6.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 6.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 60.fa\n", "\u001b[Kcalculated 1 signatures for 11 sequences in 60.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 60.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 61.fa\n", "\u001b[Kcalculated 1 signatures for 47 sequences in 61.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 61.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 62.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 62.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 62.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 63.fa\n", "\u001b[Kcalculated 1 signatures for 4 sequences in 63.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 63.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 7.fa\n", "\u001b[Kcalculated 1 signatures for 3 sequences in 7.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 7.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 8.fa\n", "\u001b[Kcalculated 1 signatures for 1 sequences in 8.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n", + "\u001b[Ksaved signature(s) to 8.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from 9.fa\n", "\u001b[Kcalculated 1 signatures for 3 sequences in 9.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n" + "\u001b[Ksaved signature(s) to 9.fa.sig. Note: signature license is CC0.\n" ] } ], "source": [ - "!cd big_genomes/ && sourmash compute -k 31 --scaled=1000 --name-from-first *.fa" + "!cd big_genomes/ && sourmash sketch dna -p k=31,scaled=1000 --name-from-first *.fa" ] }, { @@ -297,16 +295,81 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/0.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/1.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/10.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/11.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/12.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/13.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/14.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/15.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/16.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/17.fa.sig'10 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/18.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/19.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/2.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/20.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/21.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/22.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/23.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/24.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/25.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/26.fa.sig'20 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/27.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/28.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/29.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/3.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/30.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/31.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/32.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/33.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/34.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/35.fa.sig'30 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/36.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/37.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/38.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/39.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/4.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/40.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/41.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/42.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/43.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/44.fa.sig'40 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/45.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/46.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/47.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/48.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/49.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/5.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/50.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/51.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/52.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/53.fa.sig'50 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/54.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/55.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/56.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/57.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/58.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/59.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/6.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/60.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/61.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/62.fa.sig'60 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/63.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/7.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/8.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/9.fa.sig'g'\n", "\u001b[Kloaded 64 signatures total. \n", - "\u001b[Kdownsampling to scaled value of 1000\n", "\u001b[K\n", "min similarity in matrix: 0.000\n", "\u001b[Ksaving labels to: compare_all.mat.labels.txt\n", - "\u001b[Ksaving distance matrix to: compare_all.mat\n", - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[Ksaving comparison matrix to: compare_all.mat\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kloading comparison matrix from compare_all.mat...\n", @@ -330,7 +393,7 @@ "outputs": [ { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "" ] @@ -361,14 +424,80 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kloading 64 files into SBT\n", - "\u001b[Kreading from big_genomes/9.fa.sig (63 signatures so far))\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/0.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/1.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/10.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/11.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/12.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/13.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/14.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/15.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/16.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/17.fa.sig'10 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/18.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/19.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/2.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/20.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/21.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/22.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/23.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/24.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/25.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/26.fa.sig'20 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/27.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/28.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/29.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/3.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/30.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/31.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/32.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/33.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/34.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/35.fa.sig'30 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/36.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/37.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/38.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/39.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/4.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/40.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/41.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/42.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/43.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/44.fa.sig'40 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/45.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/46.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/47.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/48.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/49.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/5.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/50.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/51.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/52.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/53.fa.sig'50 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/54.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/55.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/56.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/57.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/58.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/59.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/6.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/60.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/61.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/62.fa.sig'60 sigs total\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/63.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/7.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/8.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'big_genomes/9.fa.sig'g'\n", + "\u001b[K\n", "\u001b[Kloaded 64 sigs; saving SBT under \"all-genomes\"\n", - "\u001b[K127 of 127 nodes saved\n", - "Finished saving nodes, now saving SBT json file.\n" + "\u001b[KFinished saving nodes, now saving SBT index file.\n", + "\u001b[KFinished saving SBT index, available at /Users/t/dev/sourmash/doc/all-genomes.sbt.zip\n", + "\n" ] } ], @@ -392,18 +521,14 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", - "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", - "\n", - "\u001b[Kselecting default query k=31.\n", - "\u001b[Kloaded query: NC_009665.1 Shewanella baltica... (k=31, DNA)\n", - "\u001b[Kloaded 1 databases. \n", - "\n", - "2 matches:\n", - "similarity match\n", - "---------- -----\n", - " 9.5% NC_009665.1 Shewanella baltica OS185, complete genome\n", - " 4.4% NC_011663.1 Shewanella baltica OS223, complete genome\n" + "\r", + "\u001b[K\r\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\r\n", + "\r", + "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\r\n", + "\r\n", + "\r", + "\u001b[KCannot open file 'shew_os185.fa.sig'\r\n" ] } ], @@ -420,30 +545,23 @@ "name": "stdout", "output_type": "stream", "text": [ - "\r", - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\r\n", - "\r", - "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\r\n", - "\r\n", - "\r", - "\u001b[Ksetting num_hashes to 0 because --scaled is set\r\n", - "\r", - "\u001b[Kcomputing signatures for files: fake-metagenome.fa\r\n", - "\r", - "\u001b[KComputing signature for ksizes: [31]\r\n", - "\r", - "\u001b[KComputing only nucleotide (and not protein) signatures.\r\n", - "\r", - "\u001b[KComputing a total of 1 signature(s).\r\n", - "\r", - "\u001b[Kskipping fake-metagenome.fa - already done\r\n" + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", + "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", + "\n", + "\u001b[Kcomputing signatures for files: fake-metagenome.fa\n", + "\u001b[KComputing a total of 1 signature(s).\n", + "\u001b[K... reading sequences from fake-metagenome.fa\n", + "\u001b[Kcalculated 1 signatures for 3 sequences in fake-metagenome.fa\n", + "\u001b[Ksaved signature(s) to fake-metagenome.fa.sig. Note: signature license is CC0.\n" ] } ], "source": [ "# (make fake metagenome again, just in case)\n", "!cat genomes/*.fa > fake-metagenome.fa\n", - "!sourmash compute -k 31 --scaled=1000 fake-metagenome.fa" + "!rm -f fake-metagenome.fa.sig\n", + "!sourmash sketch dna -p k=31,scaled=1000 fake-metagenome.fa" ] }, { @@ -455,7 +573,8 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kselect query k=31 automatically.\n", @@ -593,331 +712,6 @@ " Pyrococcus furiosus DSM 3638\n", " \n", " \n", - " 5\n", - " AE009951\n", - " 190304\n", - " Bacteria\n", - " Fusobacteria\n", - " Fusobacteriia\n", - " Fusobacteriales\n", - " Fusobacteriaceae\n", - " Fusobacterium\n", - " Fusobacterium nucleatum\n", - " NaN\n", - " \n", - " \n", - " 6\n", - " AE010299\n", - " 188937\n", - " Archaea\n", - " Euryarchaeota\n", - " Methanomicrobia\n", - " Methanosarcinales\n", - " Methanosarcinaceae\n", - " Methanosarcina\n", - " Methanosarcina acetivorans\n", - " Methanosarcina acetivorans C2A\n", - " \n", - " \n", - " 7\n", - " AE009439\n", - " 190192\n", - " Archaea\n", - " Euryarchaeota\n", - " Methanopyri\n", - " Methanopyrales\n", - " Methanopyraceae\n", - " Methanopyrus\n", - " Methanopyrus kandleri\n", - " Methanopyrus kandleri AV19\n", - " \n", - " \n", - " 8\n", - " NC_003911\n", - " 246200\n", - " Bacteria\n", - " Proteobacteria\n", - " Alphaproteobacteria\n", - " Rhodobacterales\n", - " Rhodobacteraceae\n", - " Ruegeria\n", - " Ruegeria pomeroyi\n", - " Ruegeria pomeroyi DSS-3\n", - " \n", - " \n", - " 9\n", - " AE006470\n", - " 194439\n", - " Bacteria\n", - " Chlorobi\n", - " Chlorobia\n", - " Chlorobiales\n", - " Chlorobiaceae\n", - " Chlorobaculum\n", - " Chlorobaculum tepidum\n", - " Chlorobaculum tepidum TLS\n", - " \n", - " \n", - " 10\n", - " AE015928\n", - " 226186\n", - " Bacteria\n", - " Bacteroidetes\n", - " Bacteroidia\n", - " Bacteroidales\n", - " Bacteroidaceae\n", - " Bacteroides\n", - " Bacteroides thetaiotaomicron\n", - " Bacteroides thetaiotaomicron VPI-5482\n", - " \n", - " \n", - " 11\n", - " AL954747\n", - " 228410\n", - " Bacteria\n", - " Proteobacteria\n", - " Betaproteobacteria\n", - " Nitrosomonadales\n", - " Nitrosomonadaceae\n", - " Nitrosomonas\n", - " Nitrosomonas europaea\n", - " Nitrosomonas europaea ATCC 19718\n", - " \n", - " \n", - " 12\n", - " BX119912\n", - " 243090\n", - " Bacteria\n", - " Planctomycetes\n", - " Planctomycetia\n", - " Planctomycetales\n", - " Planctomycetaceae\n", - " Rhodopirellula\n", - " Rhodopirellula baltica\n", - " Rhodopirellula baltica SH 1\n", - " \n", - " \n", - " 13\n", - " BX571656\n", - " 273121\n", - " Bacteria\n", - " Proteobacteria\n", - " Epsilonproteobacteria\n", - " Campylobacterales\n", - " Helicobacteraceae\n", - " Wolinella\n", - " Wolinella succinogenes\n", - " Wolinella succinogenes DSM 1740\n", - " \n", - " \n", - " 14\n", - " AE017180\n", - " 243231\n", - " Bacteria\n", - " Proteobacteria\n", - " Deltaproteobacteria\n", - " Desulfuromonadales\n", - " Geobacteraceae\n", - " Geobacter\n", - " Geobacter sulfurreducens\n", - " Geobacter sulfurreducens PCA\n", - " \n", - " \n", - " 15\n", - " AE017226\n", - " 243275\n", - " Bacteria\n", - " Spirochaetes\n", - " Spirochaetia\n", - " Spirochaetales\n", - " Spirochaetaceae\n", - " Treponema\n", - " Treponema denticola\n", - " Treponema denticola ATCC 35405\n", - " \n", - " \n", - " 16\n", - " BX950229\n", - " 267377\n", - " Archaea\n", - " Euryarchaeota\n", - " Methanococci\n", - " Methanococcales\n", - " Methanococcaceae\n", - " Methanococcus\n", - " Methanococcus maripaludis\n", - " Methanococcus maripaludis S2\n", - " \n", - " \n", - " 17\n", - " AE017221\n", - " 262724\n", - " Bacteria\n", - " Deinococcus-Thermus\n", - " Deinococci\n", - " Thermales\n", - " Thermaceae\n", - " Thermus\n", - " Thermus thermophilus\n", - " Thermus thermophilus HB27\n", - " \n", - " \n", - " 18\n", - " BA000001\n", - " 70601\n", - " Archaea\n", - " Euryarchaeota\n", - " Thermococci\n", - " Thermococcales\n", - " Thermococcaceae\n", - " Pyrococcus\n", - " Pyrococcus horikoshii\n", - " Pyrococcus horikoshii OT3\n", - " \n", - " \n", - " 19\n", - " BA000023\n", - " 273063\n", - " Archaea\n", - " Crenarchaeota\n", - " Thermoprotei\n", - " Sulfolobales\n", - " Sulfolobaceae\n", - " Sulfolobus\n", - " Sulfolobus tokodaii\n", - " Sulfolobus tokodaii str. 7\n", - " \n", - " \n", - " 20\n", - " NC_007951\n", - " 266265\n", - " Bacteria\n", - " Proteobacteria\n", - " Betaproteobacteria\n", - " Burkholderiales\n", - " Burkholderiaceae\n", - " Paraburkholderia\n", - " Paraburkholderia xenovorans\n", - " Paraburkholderia xenovorans LB400\n", - " \n", - " \n", - " 21\n", - " CP000492\n", - " 290317\n", - " Bacteria\n", - " Chlorobi\n", - " Chlorobia\n", - " Chlorobiales\n", - " Chlorobiaceae\n", - " Chlorobium\n", - " Chlorobium phaeobacteroides\n", - " Chlorobium phaeobacteroides DSM 266\n", - " \n", - " \n", - " 22\n", - " NC_008751\n", - " 391774\n", - " Bacteria\n", - " Proteobacteria\n", - " Deltaproteobacteria\n", - " Desulfovibrionales\n", - " Desulfovibrionaceae\n", - " Desulfovibrio\n", - " Desulfovibrio vulgaris\n", - " Desulfovibrio vulgaris DP4\n", - " \n", - " \n", - " 23\n", - " CP000568\n", - " 203119\n", - " Bacteria\n", - " Firmicutes\n", - " Clostridia\n", - " Clostridiales\n", - " Ruminococcaceae\n", - " Ruminiclostridium\n", - " Ruminiclostridium thermocellum\n", - " Ruminiclostridium thermocellum ATCC 27405\n", - " \n", - " \n", - " 24\n", - " CP000561\n", - " 410359\n", - " Archaea\n", - " Crenarchaeota\n", - " Thermoprotei\n", - " Thermoproteales\n", - " Thermoproteaceae\n", - " Pyrobaculum\n", - " Pyrobaculum calidifontis\n", - " Pyrobaculum calidifontis JCM 11548\n", - " \n", - " \n", - " 25\n", - " CP000609\n", - " 402880\n", - " Archaea\n", - " Euryarchaeota\n", - " Methanococci\n", - " Methanococcales\n", - " Methanococcaceae\n", - " Methanococcus\n", - " Methanococcus maripaludis\n", - " Methanococcus maripaludis C5\n", - " \n", - " \n", - " 26\n", - " CP000607\n", - " 290318\n", - " Bacteria\n", - " Chlorobi\n", - " Chlorobia\n", - " Chlorobiales\n", - " Chlorobiaceae\n", - " Chlorobium\n", - " Chlorobium phaeovibrioides\n", - " Chlorobium phaeovibrioides DSM 265\n", - " \n", - " \n", - " 27\n", - " CP000660\n", - " 340102\n", - " Archaea\n", - " Crenarchaeota\n", - " Thermoprotei\n", - " Thermoproteales\n", - " Thermoproteaceae\n", - " Pyrobaculum\n", - " Pyrobaculum arsenaticum\n", - " Pyrobaculum arsenaticum DSM 13514\n", - " \n", - " \n", - " 28\n", - " CP000667\n", - " 369723\n", - " Bacteria\n", - " Actinobacteria\n", - " Actinobacteria\n", - " Micromonosporales\n", - " Micromonosporaceae\n", - " Salinispora\n", - " Salinispora tropica\n", - " Salinispora tropica CNB-440\n", - " \n", - " \n", - " 29\n", - " CP000679\n", - " 351627\n", - " Bacteria\n", - " Firmicutes\n", - " Clostridia\n", - " Thermoanaerobacterales\n", - " Thermoanaerobacterales Family III. Incertae Sedis\n", - " Caldicellulosiruptor\n", - " Caldicellulosiruptor saccharolyticus\n", - " Caldicellulosiruptor saccharolyticus DSM 8903\n", - " \n", - " \n", " ...\n", " ...\n", " ...\n", @@ -931,331 +725,6 @@ " ...\n", " \n", " \n", - " 34\n", - " CP000850\n", - " 391037\n", - " Bacteria\n", - " Actinobacteria\n", - " Actinobacteria\n", - " Micromonosporales\n", - " Micromonosporaceae\n", - " Salinispora\n", - " Salinispora arenicola\n", - " Salinispora arenicola CNS-205\n", - " \n", - " \n", - " 35\n", - " CP000909\n", - " 324602\n", - " Bacteria\n", - " Chloroflexi\n", - " Chloroflexia\n", - " Chloroflexales\n", - " Chloroflexaceae\n", - " Chloroflexus\n", - " Chloroflexus aurantiacus\n", - " Chloroflexus aurantiacus J-10-fl\n", - " \n", - " \n", - " 36\n", - " CP000924\n", - " 340099\n", - " Bacteria\n", - " Firmicutes\n", - " Clostridia\n", - " Thermoanaerobacterales\n", - " Thermoanaerobacteraceae\n", - " Thermoanaerobacter\n", - " Thermoanaerobacter pseudethanolicus\n", - " Thermoanaerobacter pseudethanolicus ATCC 33223\n", - " \n", - " \n", - " 37\n", - " CP000969\n", - " 126740\n", - " Bacteria\n", - " Thermotogae\n", - " Thermotogae\n", - " Thermotogales\n", - " Thermotogaceae\n", - " Thermotoga\n", - " Thermotoga sp. RQ2\n", - " NaN\n", - " \n", - " \n", - " 38\n", - " CP001013\n", - " 395495\n", - " Bacteria\n", - " Proteobacteria\n", - " Betaproteobacteria\n", - " Burkholderiales\n", - " NaN\n", - " Leptothrix\n", - " Leptothrix cholodnii\n", - " Leptothrix cholodnii SP-6\n", - " \n", - " \n", - " 39\n", - " CP001071\n", - " 349741\n", - " Bacteria\n", - " Verrucomicrobia\n", - " Verrucomicrobiae\n", - " Verrucomicrobiales\n", - " Akkermansiaceae\n", - " Akkermansia\n", - " Akkermansia muciniphila\n", - " Akkermansia muciniphila ATCC BAA-835\n", - " \n", - " \n", - " 40\n", - " AP009380\n", - " 431947\n", - " Bacteria\n", - " Bacteroidetes\n", - " Bacteroidia\n", - " Bacteroidales\n", - " Porphyromonadaceae\n", - " Porphyromonas\n", - " Porphyromonas gingivalis\n", - " Porphyromonas gingivalis ATCC 33277\n", - " \n", - " \n", - " 41\n", - " NC_010730\n", - " 436114\n", - " Bacteria\n", - " Aquificae\n", - " Aquificae\n", - " Aquificales\n", - " Hydrogenothermaceae\n", - " Sulfurihydrogenibium\n", - " Sulfurihydrogenibium sp. YO3AOP1\n", - " NaN\n", - " \n", - " \n", - " 42\n", - " CP001097\n", - " 290315\n", - " Bacteria\n", - " Chlorobi\n", - " Chlorobia\n", - " Chlorobiales\n", - " Chlorobiaceae\n", - " Chlorobium\n", - " Chlorobium limicola\n", - " Chlorobium limicola DSM 245\n", - " \n", - " \n", - " 43\n", - " CP001110\n", - " 324925\n", - " Bacteria\n", - " Chlorobi\n", - " Chlorobia\n", - " Chlorobiales\n", - " Chlorobiaceae\n", - " Pelodictyon\n", - " Pelodictyon phaeoclathratiforme\n", - " Pelodictyon phaeoclathratiforme BU-1\n", - " \n", - " \n", - " 44\n", - " CP001130\n", - " 380749\n", - " Bacteria\n", - " Aquificae\n", - " Aquificae\n", - " Aquificales\n", - " Aquificaceae\n", - " Hydrogenobaculum\n", - " Hydrogenobaculum sp. Y04AAS1\n", - " NaN\n", - " \n", - " \n", - " 45\n", - " NZ_CH959311\n", - " 52598\n", - " Bacteria\n", - " Proteobacteria\n", - " Alphaproteobacteria\n", - " Rhodobacterales\n", - " Rhodobacteraceae\n", - " Sulfitobacter\n", - " Sulfitobacter sp. EE-36\n", - " NaN\n", - " \n", - " \n", - " 46\n", - " NZ_CH959317\n", - " 314267\n", - " Bacteria\n", - " Proteobacteria\n", - " Alphaproteobacteria\n", - " Rhodobacterales\n", - " Rhodobacteraceae\n", - " Sulfitobacter\n", - " Sulfitobacter sp. NAS-14.1\n", - " NaN\n", - " \n", - " \n", - " 47\n", - " CP001251\n", - " 515635\n", - " Bacteria\n", - " Dictyoglomi\n", - " Dictyoglomia\n", - " Dictyoglomales\n", - " Dictyoglomaceae\n", - " Dictyoglomus\n", - " Dictyoglomus turgidum\n", - " Dictyoglomus turgidum DSM 6724\n", - " \n", - " \n", - " 48\n", - " NC_011663\n", - " 407976\n", - " Bacteria\n", - " Proteobacteria\n", - " Gammaproteobacteria\n", - " Alteromonadales\n", - " Shewanellaceae\n", - " Shewanella\n", - " Shewanella baltica\n", - " Shewanella baltica OS223\n", - " \n", - " \n", - " 49\n", - " CP000916\n", - " 309803\n", - " Bacteria\n", - " Thermotogae\n", - " Thermotogae\n", - " Thermotogales\n", - " Thermotogaceae\n", - " Thermotoga\n", - " Thermotoga neapolitana\n", - " Thermotoga neapolitana DSM 4359\n", - " \n", - " \n", - " 50\n", - " NZ_DS996397\n", - " 411464\n", - " Bacteria\n", - " Proteobacteria\n", - " Deltaproteobacteria\n", - " Desulfovibrionales\n", - " Desulfovibrionaceae\n", - " Desulfovibrio\n", - " Desulfovibrio piger\n", - " Desulfovibrio piger ATCC 29098\n", - " \n", - " \n", - " 51\n", - " CP001230\n", - " 123214\n", - " Bacteria\n", - " Aquificae\n", - " Aquificae\n", - " Aquificales\n", - " Hydrogenothermaceae\n", - " Persephonella\n", - " Persephonella marina\n", - " Persephonella marina EX-H1\n", - " \n", - " \n", - " 52\n", - " CP001472\n", - " 240015\n", - " Bacteria\n", - " Acidobacteria\n", - " Acidobacteriia\n", - " Acidobacteriales\n", - " Acidobacteriaceae\n", - " Acidobacterium\n", - " Acidobacterium capsulatum\n", - " Acidobacterium capsulatum ATCC 51196\n", - " \n", - " \n", - " 53\n", - " AP009153\n", - " 379066\n", - " Bacteria\n", - " Gemmatimonadetes\n", - " Gemmatimonadetes\n", - " Gemmatimonadales\n", - " Gemmatimonadaceae\n", - " Gemmatimonas\n", - " Gemmatimonas aurantiaca\n", - " Gemmatimonas aurantiaca T-27\n", - " \n", - " \n", - " 54\n", - " CP001941\n", - " 439481\n", - " Archaea\n", - " Euryarchaeota\n", - " NaN\n", - " NaN\n", - " NaN\n", - " Aciduliprofundum\n", - " Aciduliprofundum boonei\n", - " Aciduliprofundum boonei T469\n", - " \n", - " \n", - " 55\n", - " NC_013968\n", - " 309800\n", - " Archaea\n", - " Euryarchaeota\n", - " Halobacteria\n", - " Haloferacales\n", - " Haloferacaceae\n", - " Haloferax\n", - " Haloferax volcanii\n", - " Haloferax volcanii DS2\n", - " \n", - " \n", - " 56\n", - " NZ_KE136524\n", - " 226185\n", - " Bacteria\n", - " Firmicutes\n", - " Bacilli\n", - " Lactobacillales\n", - " Enterococcaceae\n", - " Enterococcus\n", - " Enterococcus faecalis\n", - " Enterococcus faecalis V583\n", - " \n", - " \n", - " 57\n", - " NZ_KQ961402\n", - " 542\n", - " Bacteria\n", - " Proteobacteria\n", - " Alphaproteobacteria\n", - " Sphingomonadales\n", - " Sphingomonadaceae\n", - " Zymomonas\n", - " Zymomonas mobilis\n", - " NaN\n", - " \n", - " \n", - " 58\n", - " NZ_CP015081\n", - " 243230\n", - " Bacteria\n", - " Deinococcus-Thermus\n", - " Deinococci\n", - " Deinococcales\n", - " Deinococcaceae\n", - " Deinococcus\n", - " Deinococcus radiodurans\n", - " Deinococcus radiodurans R1\n", - " \n", - " \n", " 59\n", " NZ_ABZS01000228\n", " 432331\n", @@ -1326,320 +795,57 @@ "" ], "text/plain": [ - " accession taxid superkingdom phylum \\\n", - "0 AE000782 224325 Archaea Euryarchaeota \n", - "1 NC_000909 243232 Archaea Euryarchaeota \n", - "2 NC_003272 103690 Bacteria Cyanobacteria \n", - "3 AE009441 178306 Archaea Crenarchaeota \n", - "4 AE009950 186497 Archaea Euryarchaeota \n", - "5 AE009951 190304 Bacteria Fusobacteria \n", - "6 AE010299 188937 Archaea Euryarchaeota \n", - "7 AE009439 190192 Archaea Euryarchaeota \n", - "8 NC_003911 246200 Bacteria Proteobacteria \n", - "9 AE006470 194439 Bacteria Chlorobi \n", - "10 AE015928 226186 Bacteria Bacteroidetes \n", - "11 AL954747 228410 Bacteria Proteobacteria \n", - "12 BX119912 243090 Bacteria Planctomycetes \n", - "13 BX571656 273121 Bacteria Proteobacteria \n", - "14 AE017180 243231 Bacteria Proteobacteria \n", - "15 AE017226 243275 Bacteria Spirochaetes \n", - "16 BX950229 267377 Archaea Euryarchaeota \n", - "17 AE017221 262724 Bacteria Deinococcus-Thermus \n", - "18 BA000001 70601 Archaea Euryarchaeota \n", - "19 BA000023 273063 Archaea Crenarchaeota \n", - "20 NC_007951 266265 Bacteria Proteobacteria \n", - "21 CP000492 290317 Bacteria Chlorobi \n", - "22 NC_008751 391774 Bacteria Proteobacteria \n", - "23 CP000568 203119 Bacteria Firmicutes \n", - "24 CP000561 410359 Archaea Crenarchaeota \n", - "25 CP000609 402880 Archaea Euryarchaeota \n", - "26 CP000607 290318 Bacteria Chlorobi \n", - "27 CP000660 340102 Archaea Crenarchaeota \n", - "28 CP000667 369723 Bacteria Actinobacteria \n", - "29 CP000679 351627 Bacteria Firmicutes \n", - ".. ... ... ... ... \n", - "34 CP000850 391037 Bacteria Actinobacteria \n", - "35 CP000909 324602 Bacteria Chloroflexi \n", - "36 CP000924 340099 Bacteria Firmicutes \n", - "37 CP000969 126740 Bacteria Thermotogae \n", - "38 CP001013 395495 Bacteria Proteobacteria \n", - "39 CP001071 349741 Bacteria Verrucomicrobia \n", - "40 AP009380 431947 Bacteria Bacteroidetes \n", - "41 NC_010730 436114 Bacteria Aquificae \n", - "42 CP001097 290315 Bacteria Chlorobi \n", - "43 CP001110 324925 Bacteria Chlorobi \n", - "44 CP001130 380749 Bacteria Aquificae \n", - "45 NZ_CH959311 52598 Bacteria Proteobacteria \n", - "46 NZ_CH959317 314267 Bacteria Proteobacteria \n", - "47 CP001251 515635 Bacteria Dictyoglomi \n", - "48 NC_011663 407976 Bacteria Proteobacteria \n", - "49 CP000916 309803 Bacteria Thermotogae \n", - "50 NZ_DS996397 411464 Bacteria Proteobacteria \n", - "51 CP001230 123214 Bacteria Aquificae \n", - "52 CP001472 240015 Bacteria Acidobacteria \n", - "53 AP009153 379066 Bacteria Gemmatimonadetes \n", - "54 CP001941 439481 Archaea Euryarchaeota \n", - "55 NC_013968 309800 Archaea Euryarchaeota \n", - "56 NZ_KE136524 226185 Bacteria Firmicutes \n", - "57 NZ_KQ961402 542 Bacteria Proteobacteria \n", - "58 NZ_CP015081 243230 Bacteria Deinococcus-Thermus \n", - "59 NZ_ABZS01000228 432331 Bacteria Aquificae \n", - "60 NZ_JGWU01000001 1458259 Bacteria Proteobacteria \n", - "61 NZ_FWDH01000003 31899 Bacteria Firmicutes \n", - "62 NC_009972 316274 Bacteria Chloroflexi \n", - "63 NC_005213 228908 Archaea Nanoarchaeota \n", - "\n", - " class order \\\n", - "0 Archaeoglobi Archaeoglobales \n", - "1 Methanococci Methanococcales \n", - "2 NaN Nostocales \n", - "3 Thermoprotei Thermoproteales \n", - "4 Thermococci Thermococcales \n", - "5 Fusobacteriia Fusobacteriales \n", - "6 Methanomicrobia Methanosarcinales \n", - "7 Methanopyri Methanopyrales \n", - "8 Alphaproteobacteria Rhodobacterales \n", - "9 Chlorobia Chlorobiales \n", - "10 Bacteroidia Bacteroidales \n", - "11 Betaproteobacteria Nitrosomonadales \n", - "12 Planctomycetia Planctomycetales \n", - "13 Epsilonproteobacteria Campylobacterales \n", - "14 Deltaproteobacteria Desulfuromonadales \n", - "15 Spirochaetia Spirochaetales \n", - "16 Methanococci Methanococcales \n", - "17 Deinococci Thermales \n", - "18 Thermococci Thermococcales \n", - "19 Thermoprotei Sulfolobales \n", - "20 Betaproteobacteria Burkholderiales \n", - "21 Chlorobia Chlorobiales \n", - "22 Deltaproteobacteria Desulfovibrionales \n", - "23 Clostridia Clostridiales \n", - "24 Thermoprotei Thermoproteales \n", - "25 Methanococci Methanococcales \n", - "26 Chlorobia Chlorobiales \n", - "27 Thermoprotei Thermoproteales \n", - "28 Actinobacteria Micromonosporales \n", - "29 Clostridia Thermoanaerobacterales \n", - ".. ... ... \n", - "34 Actinobacteria Micromonosporales \n", - "35 Chloroflexia Chloroflexales \n", - "36 Clostridia Thermoanaerobacterales \n", - "37 Thermotogae Thermotogales \n", - "38 Betaproteobacteria Burkholderiales \n", - "39 Verrucomicrobiae Verrucomicrobiales \n", - "40 Bacteroidia Bacteroidales \n", - "41 Aquificae Aquificales \n", - "42 Chlorobia Chlorobiales \n", - "43 Chlorobia Chlorobiales \n", - "44 Aquificae Aquificales \n", - "45 Alphaproteobacteria Rhodobacterales \n", - "46 Alphaproteobacteria Rhodobacterales \n", - "47 Dictyoglomia Dictyoglomales \n", - "48 Gammaproteobacteria Alteromonadales \n", - "49 Thermotogae Thermotogales \n", - "50 Deltaproteobacteria Desulfovibrionales \n", - "51 Aquificae Aquificales \n", - "52 Acidobacteriia Acidobacteriales \n", - "53 Gemmatimonadetes Gemmatimonadales \n", - "54 NaN NaN \n", - "55 Halobacteria Haloferacales \n", - "56 Bacilli Lactobacillales \n", - "57 Alphaproteobacteria Sphingomonadales \n", - "58 Deinococci Deinococcales \n", - "59 Aquificae Aquificales \n", - "60 Betaproteobacteria Burkholderiales \n", - "61 Clostridia Thermoanaerobacterales \n", - "62 Chloroflexia Herpetosiphonales \n", - "63 NaN Nanoarchaeales \n", + " accession taxid superkingdom phylum class \\\n", + "0 AE000782 224325 Archaea Euryarchaeota Archaeoglobi \n", + "1 NC_000909 243232 Archaea Euryarchaeota Methanococci \n", + "2 NC_003272 103690 Bacteria Cyanobacteria NaN \n", + "3 AE009441 178306 Archaea Crenarchaeota Thermoprotei \n", + "4 AE009950 186497 Archaea Euryarchaeota Thermococci \n", + ".. ... ... ... ... ... \n", + "59 NZ_ABZS01000228 432331 Bacteria Aquificae Aquificae \n", + "60 NZ_JGWU01000001 1458259 Bacteria Proteobacteria Betaproteobacteria \n", + "61 NZ_FWDH01000003 31899 Bacteria Firmicutes Clostridia \n", + "62 NC_009972 316274 Bacteria Chloroflexi Chloroflexia \n", + "63 NC_005213 228908 Archaea Nanoarchaeota NaN \n", "\n", - " family genus \\\n", - "0 Archaeoglobaceae Archaeoglobus \n", - "1 Methanocaldococcaceae Methanocaldococcus \n", - "2 Nostocaceae Nostoc \n", - "3 Thermoproteaceae Pyrobaculum \n", - "4 Thermococcaceae Pyrococcus \n", - "5 Fusobacteriaceae Fusobacterium \n", - "6 Methanosarcinaceae Methanosarcina \n", - "7 Methanopyraceae Methanopyrus \n", - "8 Rhodobacteraceae Ruegeria \n", - "9 Chlorobiaceae Chlorobaculum \n", - "10 Bacteroidaceae Bacteroides \n", - "11 Nitrosomonadaceae Nitrosomonas \n", - "12 Planctomycetaceae Rhodopirellula \n", - "13 Helicobacteraceae Wolinella \n", - "14 Geobacteraceae Geobacter \n", - "15 Spirochaetaceae Treponema \n", - "16 Methanococcaceae Methanococcus \n", - "17 Thermaceae Thermus \n", - "18 Thermococcaceae Pyrococcus \n", - "19 Sulfolobaceae Sulfolobus \n", - "20 Burkholderiaceae Paraburkholderia \n", - "21 Chlorobiaceae Chlorobium \n", - "22 Desulfovibrionaceae Desulfovibrio \n", - "23 Ruminococcaceae Ruminiclostridium \n", - "24 Thermoproteaceae Pyrobaculum \n", - "25 Methanococcaceae Methanococcus \n", - "26 Chlorobiaceae Chlorobium \n", - "27 Thermoproteaceae Pyrobaculum \n", - "28 Micromonosporaceae Salinispora \n", - "29 Thermoanaerobacterales Family III. Incertae Sedis Caldicellulosiruptor \n", - ".. ... ... \n", - "34 Micromonosporaceae Salinispora \n", - "35 Chloroflexaceae Chloroflexus \n", - "36 Thermoanaerobacteraceae Thermoanaerobacter \n", - "37 Thermotogaceae Thermotoga \n", - "38 NaN Leptothrix \n", - "39 Akkermansiaceae Akkermansia \n", - "40 Porphyromonadaceae Porphyromonas \n", - "41 Hydrogenothermaceae Sulfurihydrogenibium \n", - "42 Chlorobiaceae Chlorobium \n", - "43 Chlorobiaceae Pelodictyon \n", - "44 Aquificaceae Hydrogenobaculum \n", - "45 Rhodobacteraceae Sulfitobacter \n", - "46 Rhodobacteraceae Sulfitobacter \n", - "47 Dictyoglomaceae Dictyoglomus \n", - "48 Shewanellaceae Shewanella \n", - "49 Thermotogaceae Thermotoga \n", - "50 Desulfovibrionaceae Desulfovibrio \n", - "51 Hydrogenothermaceae Persephonella \n", - "52 Acidobacteriaceae Acidobacterium \n", - "53 Gemmatimonadaceae Gemmatimonas \n", - "54 NaN Aciduliprofundum \n", - "55 Haloferacaceae Haloferax \n", - "56 Enterococcaceae Enterococcus \n", - "57 Sphingomonadaceae Zymomonas \n", - "58 Deinococcaceae Deinococcus \n", - "59 Hydrogenothermaceae Sulfurihydrogenibium \n", - "60 Alcaligenaceae Bordetella \n", - "61 Thermoanaerobacterales Family III. Incertae Sedis Caldicellulosiruptor \n", - "62 Herpetosiphonaceae Herpetosiphon \n", - "63 Nanoarchaeaceae Nanoarchaeum \n", + " order family \\\n", + "0 Archaeoglobales Archaeoglobaceae \n", + "1 Methanococcales Methanocaldococcaceae \n", + "2 Nostocales Nostocaceae \n", + "3 Thermoproteales Thermoproteaceae \n", + "4 Thermococcales Thermococcaceae \n", + ".. ... ... \n", + "59 Aquificales Hydrogenothermaceae \n", + "60 Burkholderiales Alcaligenaceae \n", + "61 Thermoanaerobacterales Thermoanaerobacterales Family III. Incertae Sedis \n", + "62 Herpetosiphonales Herpetosiphonaceae \n", + "63 Nanoarchaeales Nanoarchaeaceae \n", "\n", - " species \\\n", - "0 Archaeoglobus fulgidus \n", - "1 Methanocaldococcus jannaschii \n", - "2 Nostoc sp. PCC 7120 \n", - "3 Pyrobaculum aerophilum \n", - "4 Pyrococcus furiosus \n", - "5 Fusobacterium nucleatum \n", - "6 Methanosarcina acetivorans \n", - "7 Methanopyrus kandleri \n", - "8 Ruegeria pomeroyi \n", - "9 Chlorobaculum tepidum \n", - "10 Bacteroides thetaiotaomicron \n", - "11 Nitrosomonas europaea \n", - "12 Rhodopirellula baltica \n", - "13 Wolinella succinogenes \n", - "14 Geobacter sulfurreducens \n", - "15 Treponema denticola \n", - "16 Methanococcus maripaludis \n", - "17 Thermus thermophilus \n", - "18 Pyrococcus horikoshii \n", - "19 Sulfolobus tokodaii \n", - "20 Paraburkholderia xenovorans \n", - "21 Chlorobium phaeobacteroides \n", - "22 Desulfovibrio vulgaris \n", - "23 Ruminiclostridium thermocellum \n", - "24 Pyrobaculum calidifontis \n", - "25 Methanococcus maripaludis \n", - "26 Chlorobium phaeovibrioides \n", - "27 Pyrobaculum arsenaticum \n", - "28 Salinispora tropica \n", - "29 Caldicellulosiruptor saccharolyticus \n", - ".. ... \n", - "34 Salinispora arenicola \n", - "35 Chloroflexus aurantiacus \n", - "36 Thermoanaerobacter pseudethanolicus \n", - "37 Thermotoga sp. RQ2 \n", - "38 Leptothrix cholodnii \n", - "39 Akkermansia muciniphila \n", - "40 Porphyromonas gingivalis \n", - "41 Sulfurihydrogenibium sp. YO3AOP1 \n", - "42 Chlorobium limicola \n", - "43 Pelodictyon phaeoclathratiforme \n", - "44 Hydrogenobaculum sp. Y04AAS1 \n", - "45 Sulfitobacter sp. EE-36 \n", - "46 Sulfitobacter sp. NAS-14.1 \n", - "47 Dictyoglomus turgidum \n", - "48 Shewanella baltica \n", - "49 Thermotoga neapolitana \n", - "50 Desulfovibrio piger \n", - "51 Persephonella marina \n", - "52 Acidobacterium capsulatum \n", - "53 Gemmatimonas aurantiaca \n", - "54 Aciduliprofundum boonei \n", - "55 Haloferax volcanii \n", - "56 Enterococcus faecalis \n", - "57 Zymomonas mobilis \n", - "58 Deinococcus radiodurans \n", - "59 Sulfurihydrogenibium yellowstonense \n", - "60 Bordetella bronchiseptica \n", - "61 Caldicellulosiruptor bescii \n", - "62 Herpetosiphon aurantiacus \n", - "63 Nanoarchaeum equitans \n", + " genus species \\\n", + "0 Archaeoglobus Archaeoglobus fulgidus \n", + "1 Methanocaldococcus Methanocaldococcus jannaschii \n", + "2 Nostoc Nostoc sp. PCC 7120 \n", + "3 Pyrobaculum Pyrobaculum aerophilum \n", + "4 Pyrococcus Pyrococcus furiosus \n", + ".. ... ... \n", + "59 Sulfurihydrogenibium Sulfurihydrogenibium yellowstonense \n", + "60 Bordetella Bordetella bronchiseptica \n", + "61 Caldicellulosiruptor Caldicellulosiruptor bescii \n", + "62 Herpetosiphon Herpetosiphon aurantiacus \n", + "63 Nanoarchaeum Nanoarchaeum equitans \n", "\n", - " strain \n", - "0 Archaeoglobus fulgidus DSM 4304 \n", - "1 Methanocaldococcus jannaschii DSM 2661 \n", - "2 NaN \n", - "3 Pyrobaculum aerophilum str. IM2 \n", - "4 Pyrococcus furiosus DSM 3638 \n", - "5 NaN \n", - "6 Methanosarcina acetivorans C2A \n", - "7 Methanopyrus kandleri AV19 \n", - "8 Ruegeria pomeroyi DSS-3 \n", - "9 Chlorobaculum tepidum TLS \n", - "10 Bacteroides thetaiotaomicron VPI-5482 \n", - "11 Nitrosomonas europaea ATCC 19718 \n", - "12 Rhodopirellula baltica SH 1 \n", - "13 Wolinella succinogenes DSM 1740 \n", - "14 Geobacter sulfurreducens PCA \n", - "15 Treponema denticola ATCC 35405 \n", - "16 Methanococcus maripaludis S2 \n", - "17 Thermus thermophilus HB27 \n", - "18 Pyrococcus horikoshii OT3 \n", - "19 Sulfolobus tokodaii str. 7 \n", - "20 Paraburkholderia xenovorans LB400 \n", - "21 Chlorobium phaeobacteroides DSM 266 \n", - "22 Desulfovibrio vulgaris DP4 \n", - "23 Ruminiclostridium thermocellum ATCC 27405 \n", - "24 Pyrobaculum calidifontis JCM 11548 \n", - "25 Methanococcus maripaludis C5 \n", - "26 Chlorobium phaeovibrioides DSM 265 \n", - "27 Pyrobaculum arsenaticum DSM 13514 \n", - "28 Salinispora tropica CNB-440 \n", - "29 Caldicellulosiruptor saccharolyticus DSM 8903 \n", - ".. ... \n", - "34 Salinispora arenicola CNS-205 \n", - "35 Chloroflexus aurantiacus J-10-fl \n", - "36 Thermoanaerobacter pseudethanolicus ATCC 33223 \n", - "37 NaN \n", - "38 Leptothrix cholodnii SP-6 \n", - "39 Akkermansia muciniphila ATCC BAA-835 \n", - "40 Porphyromonas gingivalis ATCC 33277 \n", - "41 NaN \n", - "42 Chlorobium limicola DSM 245 \n", - "43 Pelodictyon phaeoclathratiforme BU-1 \n", - "44 NaN \n", - "45 NaN \n", - "46 NaN \n", - "47 Dictyoglomus turgidum DSM 6724 \n", - "48 Shewanella baltica OS223 \n", - "49 Thermotoga neapolitana DSM 4359 \n", - "50 Desulfovibrio piger ATCC 29098 \n", - "51 Persephonella marina EX-H1 \n", - "52 Acidobacterium capsulatum ATCC 51196 \n", - "53 Gemmatimonas aurantiaca T-27 \n", - "54 Aciduliprofundum boonei T469 \n", - "55 Haloferax volcanii DS2 \n", - "56 Enterococcus faecalis V583 \n", - "57 NaN \n", - "58 Deinococcus radiodurans R1 \n", - "59 Sulfurihydrogenibium yellowstonense SS-5 \n", - "60 Bordetella bronchiseptica D989 \n", - "61 NaN \n", - "62 Herpetosiphon aurantiacus DSM 785 \n", - "63 Nanoarchaeum equitans Kin4-M \n", + " strain \n", + "0 Archaeoglobus fulgidus DSM 4304 \n", + "1 Methanocaldococcus jannaschii DSM 2661 \n", + "2 NaN \n", + "3 Pyrobaculum aerophilum str. IM2 \n", + "4 Pyrococcus furiosus DSM 3638 \n", + ".. ... \n", + "59 Sulfurihydrogenibium yellowstonense SS-5 \n", + "60 Bordetella bronchiseptica D989 \n", + "61 NaN \n", + "62 Herpetosiphon aurantiacus DSM 785 \n", + "63 Nanoarchaeum equitans Kin4-M \n", "\n", "[64 rows x 10 columns]" ] @@ -1664,14 +870,18 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", + "\u001b[KBuilding LCA database with ksize=31 scaled=10000 moltype=DNA.\n", "\u001b[Kexamining spreadsheet headers...\n", "\u001b[K** assuming column 'accession' is identifiers in spreadsheet\n", "\u001b[K64 distinct identities in spreadsheet out of 64 rows.\n", "\u001b[K64 distinct lineages in spreadsheet out of 64 rows.\n", - "\u001b[K64 assigned lineages out of 64 distinct lineages in spreadsheet. 64)\n", + "\u001b[K... loaded 64 signatures.H01000003.1 Caldicellulo (64 of 64); skipped 0 so far\n", + "\u001b[Kloaded 19993 hashes at ksize=31 scaled=10000\n", + "\u001b[K64 assigned lineages out of 64 distinct lineages in spreadsheet.\n", "\u001b[K64 identifiers used out of 64 distinct identifiers in spreadsheet.\n", "\u001b[Ksaving to LCA DB: taxdb.lca.json\n" ] @@ -1697,23 +907,35 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", - "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", - "\n", - "\u001b[Kselect query k=31 automatically.\n", - "\u001b[Kloaded query: fake-metagenome.fa... (k=31, DNA)\n", - "\u001b[Kloaded 1 databases. \n", - "\n", - "\n", - "overlap p_query p_match\n", - "--------- ------- -------\n", - "0.6 Mbp 46.7% 11.6% NC_011663.1 Shewanella baltica OS223,...\n", - "0.5 Mbp 38.7% 19.3% CP001071.1 Akkermansia muciniphila AT...\n", - "0.5 Mbp 14.6% 3.9% NC_009665.1 Shewanella baltica OS185,...\n", - "\n", - "found 3 matches total;\n", - "the recovered matches hit 100.0% of the query\n", - "\n" + "\r", + "\u001b[K\r\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\r\n", + "\r", + "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\r\n", + "\r\n", + "\r", + "\u001b[Kselect query k=31 automatically.\r\n", + "\r", + "\u001b[Kloaded query: fake-metagenome.fa... (k=31, DNA)\r\n", + "\r", + "\u001b[Kloading from taxdb.lca.json...\r", + "\r", + "\u001b[Kloaded LCA taxdb.lca.json\r", + "\r", + "\u001b[K \r", + "\r", + "\u001b[Kloaded 1 databases.\r\n", + "\r\n", + "\r\n", + "overlap p_query p_match\r\n", + "--------- ------- -------\r\n", + "0.6 Mbp 46.7% 11.6% NC_011663.1 Shewanella baltica OS223,...\r\n", + "0.5 Mbp 38.7% 19.3% CP001071.1 Akkermansia muciniphila AT...\r\n", + "0.5 Mbp 14.6% 3.9% NC_009665.1 Shewanella baltica OS185,...\r\n", + "\r\n", + "found 3 matches total;\r\n", + "the recovered matches hit 100.0% of the query\r\n", + "\r\n" ] } ], @@ -1737,28 +959,46 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", - "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", - "\n", - "\u001b[Kloaded 1 LCA databases. ksize=31, scaled=10000\n", - "\u001b[Kfinding query signatures...\n", - "\u001b[Kloaded 1 signatures from 1 files total.of 1)\n", - "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n", - "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n", - "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia\n", - "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae\n", - "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales\n", - "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae\n", - "38.7% 53 Bacteria;Verrucomicrobia\n", - "100.0% 137 Bacteria\n", - "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica\n", - "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella\n", - "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae\n", - "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales\n", - "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria\n", - "61.3% 84 Bacteria;Proteobacteria\n", - "22.6% 31 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS223\n", - "14.6% 20 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185\n" + "\r", + "\u001b[K\r\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\r\n", + "\r", + "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\r\n", + "\r\n", + "\r", + "\u001b[K\r", + "\u001b[K\r", + "\u001b[K... loading database taxdb.lca.json\r", + "\r", + "\u001b[K\r", + "\u001b[K\r", + "\u001b[Kloaded 1 LCA databases. ksize=31, scaled=10000 moltype=DNA\r\n", + "\r", + "\u001b[Kfinding query signatures...\r\n", + "\r", + "\u001b[K\r", + "\u001b[K\r", + "\u001b[K... loading fake-metagenome.fa (file 1 of 1)\r", + "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "38.7% 53 Bacteria;Verrucomicrobia fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "100.0% 137 Bacteria fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "61.3% 84 Bacteria;Proteobacteria fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "22.6% 31 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS223 fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "14.6% 20 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185 fake-metagenome.fa.sig:4e1ac0cf fake-metagenome.fa\r\n", + "\r", + "\u001b[K\r", + "\u001b[K\r", + "\u001b[Kloaded 1 signatures from 1 files total.\r\n" ] } ], @@ -1799,9 +1039,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python (myenv)", "language": "python", - "name": "python3" + "name": "myenv" }, "language_info": { "codemirror_mode": { @@ -1813,7 +1053,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.7" + "version": "3.7.6" } }, "nbformat": 4, diff --git a/doc/sourmash-examples.ipynb b/doc/sourmash-examples.ipynb index 40b489fd83..c3fc51edfd 100644 --- a/doc/sourmash-examples.ipynb +++ b/doc/sourmash-examples.ipynb @@ -46,29 +46,27 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", - "\u001b[Ksetting num_hashes to 0 because --scaled is set\n", "\u001b[Kcomputing signatures for files: genomes/akkermansia.fa, genomes/shew_os185.fa, genomes/shew_os223.fa\n", - "\u001b[KComputing signature for ksizes: [21, 31, 51]\n", - "\u001b[KComputing only nucleotide (and not protein) signatures.\n", - "\u001b[KComputing a total of 3 signature(s).\n", + "\u001b[KComputing a total of 1 signature(s).\n", "\u001b[K... reading sequences from genomes/akkermansia.fa\n", - "\u001b[Kcalculated 3 signatures for 1 sequences in genomes/akkermansia.fa\n", - "\u001b[Ksaved 3 signature(s). Note: signature license is CC0.\n", + "\u001b[Kcalculated 1 signatures for 1 sequences in genomes/akkermansia.fa\n", + "\u001b[Ksaved signature(s) to akkermansia.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from genomes/shew_os185.fa\n", - "\u001b[Kcalculated 3 signatures for 1 sequences in genomes/shew_os185.fa\n", - "\u001b[Ksaved 3 signature(s). Note: signature license is CC0.\n", + "\u001b[Kcalculated 1 signatures for 1 sequences in genomes/shew_os185.fa\n", + "\u001b[Ksaved signature(s) to shew_os185.fa.sig. Note: signature license is CC0.\n", "\u001b[K... reading sequences from genomes/shew_os223.fa\n", - "\u001b[Kcalculated 3 signatures for 1 sequences in genomes/shew_os223.fa\n", - "\u001b[Ksaved 3 signature(s). Note: signature license is CC0.\n" + "\u001b[Kcalculated 1 signatures for 1 sequences in genomes/shew_os223.fa\n", + "\u001b[Ksaved signature(s) to shew_os223.fa.sig. Note: signature license is CC0.\n" ] } ], "source": [ "!rm -f *.sig\n", - "!sourmash compute -k 21,31,51 --scaled=1000 genomes/*.fa --name-from-first -f" + "!sourmash sketch dna -p k=21,k=31,k=51,scaled=1000 genomes/*.fa --name-from-first -f" ] }, { @@ -120,17 +118,38 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", - "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", - "\n", - "\u001b[Kloaded query: NC_011663.1 Shewanella baltica... (k=31, DNA)\n", - "\u001b[Kloaded 3 signatures. \n", - "\n", - "2 matches:\n", - "similarity match\n", - "---------- -----\n", - "100.0% NC_011663.1 Shewanella baltica OS223, complete genome\n", - " 22.8% NC_009665.1 Shewanella baltica OS185, complete genome\n" + "\r", + "\u001b[K\r\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\r\n", + "\r", + "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\r\n", + "\r\n", + "\r", + "\u001b[Kselecting specified query k=31\r\n", + "\r", + "\u001b[Kloaded query: NC_011663.1 Shewanella baltica... (k=31, DNA)\r\n", + "\r", + "\u001b[Kloading from akkermansia.fa.sig...\r", + "\r", + "\u001b[Kloaded 1 signatures from akkermansia.fa.sig\r", + "\r", + "\u001b[Kloading from shew_os185.fa.sig...\r", + "\r", + "\u001b[Kloaded 1 signatures from shew_os185.fa.sig\r", + "\r", + "\u001b[Kloading from shew_os223.fa.sig...\r", + "\r", + "\u001b[Kloaded 1 signatures from shew_os223.fa.sig\r", + "\r", + "\u001b[K \r", + "\r", + "\u001b[Kloaded 3 signatures.\r\n", + "\r\n", + "2 matches:\r\n", + "similarity match\r\n", + "---------- -----\r\n", + "100.0% NC_011663.1 Shewanella baltica OS223, complete genome\r\n", + " 22.8% NC_009665.1 Shewanella baltica OS185, complete genome\r\n" ] } ], @@ -154,9 +173,11 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", + "\u001b[Kselecting specified query k=31\n", "\u001b[Kloaded query: NC_011663.1 Shewanella baltica... (k=31, DNA)\n", "\u001b[Kloaded 3 signatures. \n", "\n", @@ -190,11 +211,14 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", + "\u001b[Kloaded 1 sigs from 'akkermansia.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'shew_os185.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'shew_os223.fa.sig'g'\n", "\u001b[Kloaded 3 signatures total. \n", - "\u001b[Kdownsampling to scaled value of 1000\n", "\u001b[K\n", "0-CP001071.1 Akke...\t[1. 0. 0.]\n", "1-NC_009665.1 She...\t[0. 1. 0.228]\n", @@ -223,18 +247,21 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", + "\u001b[Kloaded 1 sigs from 'akkermansia.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'shew_os185.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'shew_os223.fa.sig'g'\n", "\u001b[Kloaded 3 signatures total. \n", - "\u001b[Kdownsampling to scaled value of 1000\n", "\u001b[K\n", "0-CP001071.1 Akke...\t[1. 0. 0.]\n", "1-NC_009665.1 She...\t[0. 1. 0.228]\n", "2-NC_011663.1 She...\t[0. 0.228 1. ]\n", "min similarity in matrix: 0.000\n", "\u001b[Ksaving labels to: genome_compare.mat.labels.txt\n", - "\u001b[Ksaving distance matrix to: genome_compare.mat\n" + "\u001b[Ksaving comparison matrix to: genome_compare.mat\n" ] } ], @@ -251,7 +278,8 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", "\u001b[Kloading comparison matrix from genome_compare.mat...\n", @@ -267,7 +295,7 @@ }, { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "" ] @@ -300,11 +328,14 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", + "\u001b[Kloaded 1 sigs from 'akkermansia.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'shew_os185.fa.sig'g'\n", + "\u001b[Kloaded 1 sigs from 'shew_os223.fa.sig'g'\n", "\u001b[Kloaded 3 signatures total. \n", - "\u001b[Kdownsampling to scaled value of 1000\n", "\u001b[K\n", "0-CP001071.1 Akke...\t[1. 0. 0.]\n", "1-NC_009665.1 She...\t[0. 1. 0.228]\n", @@ -328,9 +359,12 @@ "text": [ "\"CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome\",\"NC_009665.1 Shewanella baltica OS185, complete genome\",\"NC_011663.1 Shewanella baltica OS223, complete genome\"\r", "\r\n", - "1.0,0.0,0.0\r\n", - "0.0,1.0,0.22846441947565543\r\n", - "0.0,0.22846441947565543,1.0\r\n" + "1.0,0.0,0.0\r", + "\r\n", + "0.0,1.0,0.22846441947565543\r", + "\r\n", + "0.0,0.22846441947565543,1.0\r", + "\r\n" ] } ], @@ -363,23 +397,22 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", + "\u001b[K\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\n", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", "\n", - "\u001b[Ksetting num_hashes to 0 because --scaled is set\n", "\u001b[Kcomputing signatures for files: fake-metagenome.fa\n", - "\u001b[KComputing signature for ksizes: [31]\n", - "\u001b[KComputing only nucleotide (and not protein) signatures.\n", "\u001b[KComputing a total of 1 signature(s).\n", "\u001b[K... reading sequences from fake-metagenome.fa\n", "\u001b[Kcalculated 1 signatures for 3 sequences in fake-metagenome.fa\n", - "\u001b[Ksaved 1 signature(s). Note: signature license is CC0.\n" + "\u001b[Ksaved signature(s) to fake-metagenome.fa.sig. Note: signature license is CC0.\n" ] } ], "source": [ + "!rm -f fake-metagenome.fa*\n", "!cat genomes/*.fa > fake-metagenome.fa\n", - "!sourmash compute -k 31 --scaled=1000 fake-metagenome.fa" + "!sourmash sketch dna -p k=31,scaled=1000 fake-metagenome.fa" ] }, { @@ -398,23 +431,43 @@ "name": "stdout", "output_type": "stream", "text": [ - "\u001b[K== This is sourmash version 2.0.0a12.dev48+ga92289b. ==\n", - "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\n", - "\n", - "\u001b[Kselect query k=31 automatically.\n", - "\u001b[Kloaded query: fake-metagenome.fa... (k=31, DNA)\n", - "\u001b[Kloaded 3 signatures. \n", - "\n", - "\n", - "overlap p_query p_match\n", - "--------- ------- -------\n", - "499.0 kbp 38.4% 100.0% CP001071.1 Akkermansia muciniphila AT...\n", - "494.0 kbp 38.0% 100.0% NC_009665.1 Shewanella baltica OS185,...\n", - "490.0 kbp 23.6% 62.7% NC_011663.1 Shewanella baltica OS223,...\n", - "\n", - "found 3 matches total;\n", - "the recovered matches hit 100.0% of the query\n", - "\n" + "\r", + "\u001b[K\r\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\r\n", + "\r", + "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\r\n", + "\r\n", + "\r", + "\u001b[Kselect query k=31 automatically.\r\n", + "\r", + "\u001b[Kloaded query: fake-metagenome.fa... (k=31, DNA)\r\n", + "\r", + "\u001b[Kloading from shew_os185.fa.sig...\r", + "\r", + "\u001b[Kloaded 1 signatures from shew_os185.fa.sig\r", + "\r", + "\u001b[Kloading from shew_os223.fa.sig...\r", + "\r", + "\u001b[Kloaded 1 signatures from shew_os223.fa.sig\r", + "\r", + "\u001b[Kloading from akkermansia.fa.sig...\r", + "\r", + "\u001b[Kloaded 1 signatures from akkermansia.fa.sig\r", + "\r", + "\u001b[K \r", + "\r", + "\u001b[Kloaded 3 signatures.\r\n", + "\r\n", + "\r\n", + "overlap p_query p_match\r\n", + "--------- ------- -------\r\n", + "499.0 kbp 38.4% 100.0% CP001071.1 Akkermansia muciniphila AT...\r\n", + "494.0 kbp 38.0% 100.0% NC_009665.1 Shewanella baltica OS185,...\r\n", + "490.0 kbp 23.6% 62.7% NC_011663.1 Shewanella baltica OS223,...\r\n", + "\r\n", + "found 3 matches total;\r\n", + "the recovered matches hit 100.0% of the query\r\n", + "\r\n" ] } ], @@ -457,9 +510,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python (myenv)", "language": "python", - "name": "python3" + "name": "myenv" }, "language_info": { "codemirror_mode": { @@ -471,7 +524,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.7" + "version": "3.7.6" } }, "nbformat": 4, diff --git a/doc/tutorial-basic.md b/doc/tutorial-basic.md index f47d75c29e..41311f9f96 100644 --- a/doc/tutorial-basic.md +++ b/doc/tutorial-basic.md @@ -3,9 +3,11 @@ This tutorial should run without modification on Linux or Mac OS X, under [Miniconda](https://docs.conda.io/en/latest/miniconda.html). -You'll need about 5 GB of free disk space, -and about 5 GB of RAM to search GenBank. The tutorial should take about -20 minutes total to run. +You'll need about 5 GB of free disk space, and about 5 GB of RAM to +search GenBank. The tutorial should take about 20 minutes total to +run. In fact, we have successfully tested it on +[binder.pangeo.io](https://binder.pangeo.io/v2/gh/binder-examples/r-conda/master?urlpath=urlpath%3Drstudio) +if you want to give it a try! ## Install miniconda @@ -56,8 +58,8 @@ Download some reads and a reference genome: ``` mkdir ~/data cd ~/data -wget https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m.fastq.gz -wget https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz +curl -L https://osf.io/ruanf/download -o ecoliMG1655.fa.gz +curl -L https://osf.io/q472x/download -o ecoli_ref-5m.fastq.gz ``` Compute a scaled signature from our reads: @@ -89,17 +91,19 @@ sourmash search ecoli-reads.sig ecoli-genome.sig --containment and you should see: ``` -# running sourmash subcommand: search -loaded query: /home/ubuntu/data/ecoli_ref-5m... (k=31, DNA) -loaded 1 signatures from ecoli-genome.sig + +select query k=31 automatically. +loaded query: /home/jovyan/data/ecoli_ref-5m... (k=31, DNA) +loaded 1 signatures. + 1 matches: similarity match ---------- ----- - 10.6% /home/ubuntu/data/ecoliMG1655.fa.gz + 31.0% /home/jovyan/data/ecoliMG1655.fa.gz ``` -Try the reverse - why is it bigger? +Try the reverse, too! ``` sourmash search ecoli-genome.sig ecoli-reads.sig --containment @@ -141,7 +145,7 @@ sourmash index ecolidb ecoli_many_sigs/*.sig and now we can search! ``` -sourmash search ecoli-genome.sig ecolidb.sbt.json -n 20 +sourmash search ecoli-genome.sig ecolidb.sbt.zip -n 20 ``` You should see output like this: @@ -226,7 +230,7 @@ loaded 1 databases. overlap p_query p_match --------- ------- ------- -4.9 Mbp 100.0% 100.0% AP009048.1 Escherichia coli str. K-12... +4.9 Mbp 100.0% 100.0% LRDF01000001.1 Escherichia coli strai... found 1 matches total; the recovered matches hit 100.0% of the query @@ -287,11 +291,11 @@ the recovered matches hit 73.1% of the query If you use the `-o` flag, gather will write out a csv that contains additional information. The column headers and their meanings are: -+ intersect_bp: the approximate number of base pairs in common between the query and the match -+ f_orig_query: fraction of original query; the fraction of the original query that is contained within the match -+ f_match: fraction of match; the fraction of the match that is contained within the query -+ f_unique_to_query: fraction unique to query; the fraction of the query that uniquely overlaps with the match -+ f_unique_weighted: fraction unique to query weighted by abundance; fraction unique to query, weighted by abundance in the query ++ `intersect_bp`: the approximate number of base pairs in common between the query and the match ++ `f_orig_query`: fraction of original query; the fraction of the original query that is contained within the match ++ `f_match`: fraction of match; the fraction of the match that is contained within the query ++ `f_unique_to_query`: fraction unique to query; the fraction of the query that uniquely overlaps with the match ++ `f_unique_weighted`: fraction unique to query weighted by abundance; fraction unique to query, weighted by abundance in the query It is straightforward to build your own databases for use with `search` and `gather`; see `sourmash index`, above, [the LCA tutorial][4], or diff --git a/doc/tutorials-lca.md b/doc/tutorials-lca.md index 384a6a2b9e..6fe002c29f 100644 --- a/doc/tutorials-lca.md +++ b/doc/tutorials-lca.md @@ -11,6 +11,10 @@ You'll need about 5 GB of free disk space to download the database, and about 5 GB of RAM to search it. The tutorial should take about 20 minutes total to run. +Note, we have successfully tested it on +[binder.pangeo.io](https://binder.pangeo.io/v2/gh/binder-examples/r-conda/master?urlpath=urlpath%3Drstudio) +if you want to give it a try! + ## Install miniconda If you don't have the `conda` command installed, you'll need to install @@ -63,12 +67,12 @@ curl -L -o genbank-k31.lca.json.gz https://osf.io/4f8n3/download Download a random genome from genbank: ``` -curl -L -o some-genome.fa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/178/875/GCF_000178875.2_ASM17887v2/GCF_000178875.2_ASM17887v2_genomic.fna.gz +curl -L -o some-genome.fa.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/178/875/GCF_000178875.2_ASM17887v2/GCF_000178875.2_ASM17887v2_genomic.fna.gz ``` Create a signature for this genome: ``` -sourmash sketch -p scaled=1000,k=31 --name-from-first some-genome.fa.gz +sourmash sketch dna -p scaled=1000,k=31 --name-from-first some-genome.fa.gz ``` Now, classify the signature with sourmash `lca classify`, diff --git a/doc/using-LCA-database-API.ipynb b/doc/using-LCA-database-API.ipynb index 3bfebe875c..ad8b8ac5ae 100644 --- a/doc/using-LCA-database-API.ipynb +++ b/doc/using-LCA-database-API.ipynb @@ -74,19 +74,13 @@ "text": [ "\r", "\u001b[K\r\n", - "== This is sourmash version 3.2.4.dev5+g6484e78f. ==\r\n", + "== This is sourmash version 4.0.0a4.dev12+g31c5eda2. ==\r\n", "\r", "\u001b[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==\r\n", "\r\n", "\r", - "\u001b[Ksetting num_hashes to 0 because --scaled is set\r\n", - "\r", "\u001b[Kcomputing signatures for files: genomes/akkermansia.fa, genomes/shew_os185.fa, genomes/shew_os223.fa\r\n", "\r", - "\u001b[KComputing signature for ksizes: [31]\r\n", - "\r", - "\u001b[KComputing only nucleotide (and not protein) signatures.\r\n", - "\r", "\u001b[KComputing a total of 1 signature(s).\r\n", "\r", "\u001b[Kskipping genomes/akkermansia.fa - already done\r\n", @@ -98,7 +92,7 @@ } ], "source": [ - "!sourmash compute --name-from-first -k 31 --scaled=1000 genomes/*" + "!sourmash sketch dna -p k=31,scaled=1000 genomes/*" ] }, { @@ -116,7 +110,18 @@ "cell_type": "code", "execution_count": 4, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "490" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "db.insert(sig1, ident='akkermansia')\n", "db.insert(sig2, ident='shew_os185')\n", @@ -207,9 +212,9 @@ "name": "stdout", "output_type": "stream", "text": [ - "SourmashSignature('CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome', 6822e0b7)\n", - "SourmashSignature('NC_009665.1 Shewanella baltica OS185, complete genome', b47b13ef)\n", - "SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6)\n" + "CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome\n", + "NC_009665.1 Shewanella baltica OS185, complete genome\n", + "NC_011663.1 Shewanella baltica OS223, complete genome\n" ] } ], @@ -415,7 +420,18 @@ "cell_type": "code", "execution_count": 18, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "499" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)\n", "db.insert(sig1, lineage=lineage)" @@ -438,7 +454,7 @@ ], "source": [ "# by default, the identifier is the signature name --\n", - "ident = sig1.name()\n", + "ident = sig1.name\n", "idx = db.ident_to_idx[ident]\n", "print(\"ident '{}' has idx {}\".format(ident, idx))\n", "\n", @@ -666,7 +682,18 @@ "cell_type": "code", "execution_count": 26, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "490" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)\n", "db.insert(sig1, lineage=lineage1)\n", @@ -693,6 +720,14 @@ "text": [ "num hashvals: 494\n" ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/t/miniconda3/envs/py37/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecatedWarning: get_mins is deprecated as of 3.5 and will be removed in 5.0. Use .hashes property instead.\n", + " \"\"\"Entry point for launching an IPython kernel.\n" + ] } ], "source": [ From c802035833b136b857cf7a903f78e49f7ff9ed59 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 15 Feb 2021 14:51:23 -0800 Subject: [PATCH 22/24] [MRG] update the migration guide for 4.0 with version pinning instructions (#1330) * update 4.0 migration docs with version pinning instructions * update migration docs --- doc/support.md | 88 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 61 insertions(+), 27 deletions(-) diff --git a/doc/support.md b/doc/support.md index 2f5a3bf775..c6315ea2da 100644 --- a/doc/support.md +++ b/doc/support.md @@ -1,7 +1,7 @@ # Support, Versioning, and Migration ```{contents} - :depth: 2 + :depth: 3 ``` ## Asking questions and filing bugs @@ -18,11 +18,25 @@ You can also ask questions of Titus on Twitter at [@ctitusbrown][1]. [0]:https://github.com/dib-lab/sourmash/issues [1]:https://twitter.com/ctitusbrown/ -## Versioning +## Versioning and stability of features and APIs + +We do our best to guarantee stability of features and APIs within +major versions - because of this, upgrading from (e.g.) sourmash v3.4 to +sourmash v3.5 should be a simple matter of installing the new version. + +We also recommend using _version pinning_ for software and workflows +that depend on sourmash, e.g. specifying `sourmash >=3,<4` for +software that is tested with sourmash 3.x. Read on for details! + +Upgrading major versions (to sourmash 4.0, for example) will often involve +more work; see the [next section](#upgrading-versions) for more +our suggested process. + +### Semantic versioning Our goal is to support the use of sourmash in pipelines and applications by communicating clearly about bug fixes, feature -additions, and feature changes. Versions are tagged in a +additions, and feature changes in sourmash. Versions are tagged in a `vMAJOR.MINOR.PATCH` format, following the [Semantic Versioning] convention. From their definition: @@ -44,6 +58,20 @@ So, for example, We do sometimes (rarely!) alter behavior in minor versions by fixing bugs; this will be documented in release notes. +### Version pinning + +For software and workflows that depend on sourmash, we recommend +pinning versions to the current _major_ release of sourmash. + +For example, with Python toolchains such as pip, you should be able to use: + +``` +sourmash>=3,<4 +``` +to pin the version requirement to any sourmash v3.x release. + +For conda, the same syntax should work. + ### Command line stability We intend that all command-line commands, command-line options, input @@ -64,22 +92,6 @@ will contain deprecations for all top-level API changes at the time of the first major release. See below for our suggested migration procedure. -### Rust API - -The Rust API is not yet at 1.0 and should not be regarded as stable. - -### How to "pin" sourmash versions - -If you are relying on sourmash in a pipeline or application, we -suggest specifying your version requirements at the major release, -e.g. in conda you would specify `sourmash>=3,<4` to rely on sourmash -v3.x features. - -Release notes for minor and patch versions are available on the -[GitHub releases page](https://github.com/dib-lab/sourmash/releases). - -[Semantic Versioning]: https://semver.org/ - ### Python version support sourmash v3.x supports Python 2.7 as well as Python 3.x, through Python 3.8. @@ -94,19 +106,41 @@ proposal for Python version support. For example, this would mean that we would drop support for Python 3.7 on December 26, 2021. -## Migrating from sourmash v3.x to sourmash v4.x. +### Rust API + +The Rust API is not yet at 1.0 and should not be regarded as stable. + +## Upgrading major versions + +If you depend on sourmash, we recommend using the following process: + +* pin sourmash to the major version you developed against, e.g. `sourmash >=3,<4`. +* when ready to upgrade sourmash, upgrade to the latest minor release within that major version (e.g. sourmash 3.5.x). +* scan for deprecations that affect you, check [the release notes](https://github.com/dib-lab/sourmash/releases), +and fix any major issues noted. +* upgrade to the next major version (e.g. sourmash 4.0) and run your integration tests or workflow. +* fix outstanding issues. + +In particular, we recommend upgrading major versions of sourmash in +isolation, without adding any new features to your software. -Our intent is to provide a clear path for migration between versions for our users. We rely on *semantic versioning* and deprecation warnings to do this - -* Within each major version release (v2, v3, v4), the command-line interface and Python APIs should remain the same, with features being only *added*. -* Across major versions (e.g. v2 to v3, and v3 to v4) we provide warnings when functionality will change in the next major version. +### Migrating from sourmash v3.x to sourmash v4.x. -So: if you want to upgrade workflows and scripts from prior releases of sourmash to sourmash v4.0, we suggest doing this in two stages. +If you want to upgrade workflows and scripts from prior releases of +sourmash to sourmash v4.0, we suggest doing this in two stages. -First, upgrade to the latest version of sourmash 3.5.x (currently [v3.5.0](https://github.com/dib-lab/sourmash/releases/tag/v3.5.0)), which is compatible with all files and command lines used in previous versions of sourmash (v2.x and v3.x). After upgrading to 3.5.x, scan the sourmash output for deprecation warnings and fix those. +First, upgrade to the latest version of sourmash 3.5.x (currently +[v3.5.0](https://github.com/dib-lab/sourmash/releases/tag/v3.5.0)), +which is compatible with all files and command lines used in previous +versions of sourmash (v2.x and v3.x). After upgrading to 3.5.x, scan +the sourmash output for deprecation warnings and fix those. -Next, upgrade to the latest version of 4.x, which will introduce some backwards incompatibilities based upon the deprecation warnings. +Next, upgrade to the latest version of 4.x, which will introduce some +backwards incompatibilities based upon the deprecation warnings. -The major changes are detailed below; please see the [full release notes for 4.0](release-notes/sourmash-4.0.md) for all the details and links to the code changes. +The major changes are detailed below; please see the +[full release notes for 4.0](release-notes/sourmash-4.0.md) for all +the details and links to the code changes. ### Sourmash command line From 01019dd8d6531196d3daf3c9b6662dc3b9f68bad Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Wed, 17 Feb 2021 11:16:12 -0800 Subject: [PATCH 23/24] Apply suggestions from code review Co-authored-by: Luiz Irber --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9b9382ae2e..1df1b800ce 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ The latest major release is sourmash v4, which has several command-line and Python incompatibilities with previous versions. Please [visit our migration guide](https://sourmash.readthedocs.io/en/latest/support.html#migrating-from-sourmash-v3-x-to-sourmash-4-x) -to ugprade! +to upgrade! ---- @@ -113,4 +113,4 @@ on getting set up with a development environment. ---- CTB -Jan 2021 +Feb 2021 From 71ef18d050243a821de8ffc4ceaf713c23101478 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Wed, 17 Feb 2021 14:47:22 -0800 Subject: [PATCH 24/24] Apply suggestions from code review Co-authored-by: Luiz Irber --- doc/release-notes/sourmash-4.0.md | 1 + doc/support.md | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/doc/release-notes/sourmash-4.0.md b/doc/release-notes/sourmash-4.0.md index 66baf27c39..620af1f970 100644 --- a/doc/release-notes/sourmash-4.0.md +++ b/doc/release-notes/sourmash-4.0.md @@ -36,6 +36,7 @@ migrating to sourmash 4.0 in the * remove `dump` command (#1157) ### Feature/function deprecations + * deprecate `sourmash compute` (#1159) * deprecate `load_signatures`, `sourmash.load_one_signature`, `create_sbt_index`, and `load_sbt_index` (#1279, #1304) * deprecate `import_csv` in favor of new `sourmash sig import --csv` (#1281) diff --git a/doc/support.md b/doc/support.md index c6315ea2da..406ed81e65 100644 --- a/doc/support.md +++ b/doc/support.md @@ -130,7 +130,7 @@ If you want to upgrade workflows and scripts from prior releases of sourmash to sourmash v4.0, we suggest doing this in two stages. First, upgrade to the latest version of sourmash 3.5.x (currently -[v3.5.0](https://github.com/dib-lab/sourmash/releases/tag/v3.5.0)), +[v3.5.1](https://github.com/dib-lab/sourmash/releases/tag/v3.5.1)), which is compatible with all files and command lines used in previous versions of sourmash (v2.x and v3.x). After upgrading to 3.5.x, scan the sourmash output for deprecation warnings and fix those.