Skip to content

Commit

Permalink
add docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ctb committed Jun 17, 2021
1 parent ab286cf commit 4d156e9
Showing 1 changed file with 61 additions and 38 deletions.
99 changes: 61 additions & 38 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,15 +177,14 @@ sourmash compare file1.sig [ file2.sig ... ]
```

Options:
```
--output -- save the distance matrix to this file (as a numpy binary matrix)
--ksize -- do the comparisons at this k-mer size.
--containment -- calculate containment instead of similarity.
C(i, j) = size(i intersection j) / size(i).
--from-file -- append the list of files in this text file to the input

* `--output` -- save the distance matrix to this file (as a numpy binary matrix)
* `--ksize` -- do the comparisons at this k-mer size.
* `--containment` -- calculate containment instead of similarity; `C(i, j) = size(i intersection j) / size(i)`
* `--from-file` -- append the list of files in this text file to the input
signatures.
--ignore-abundance -- ignore abundances in signatures.
```
* `--ignore-abundance` -- ignore abundances in signatures.
* `--picklist` -- select a subset of signatures with [a picklist](#using-picklists-to-subset-large-collections-of-signatures)

**Note:** compare by default produces a symmetric similarity matrix that can be used as an input to clustering. With `--containment`, however, this matrix is no longer symmetric and cannot formally be used for clustering.

Expand Down Expand Up @@ -249,6 +248,9 @@ similarity match
...
```

Note, as of sourmash 4.2.0, `search` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash gather` - find metagenome members

The `gather` subcommand selects the best reference genomes to use for
Expand Down Expand Up @@ -289,6 +291,9 @@ which matches are no longer reported; by default, this is set to
50kb. see the Appendix in
[Classifying Signatures](classifying-signatures.md) for details.

As of sourmash 4.2.0, `gather` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

Note:

Use `sourmash gather` to classify a metagenome against a collection of
Expand Down Expand Up @@ -350,6 +355,9 @@ containing a list of file names to index; you can also provide individual
signature files, directories full of signatures, or other sourmash
databases.

As of sourmash 4.2.0, `index` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash prefetch` - select subsets of very large databases for more processing

The `prefetch` subcommand searches a collection of scaled signatures
Expand All @@ -375,6 +383,7 @@ Other options include:
* `--threshold-bp` to require a minimum estimated bp overlap for output;
* `--scaled` for downsampling;
* `--force` to continue past survivable errors;
* `--picklist` select a subset of signatures with [a picklist](#using-picklists-to-subset-large-collections-of-signatures)

### Alternative search mode for low-memory (but slow) search: `--linear`

Expand Down Expand Up @@ -589,6 +598,9 @@ see
You can use `--from-file` to pass `lca index` a text file containing a
list of file names to index.

As of sourmash 4.2.0, `lca index` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash lca rankinfo` - examine an LCA database

The `sourmash lca rankinfo` command displays k-mer specificity
Expand Down Expand Up @@ -821,36 +833,8 @@ will extract the same signature, which has an accession number of
#### Using picklists with `sourmash sig extract`

As of sourmash 4.2.0, `extract` also supports picklists, a feature by
which you can select signatures based on values in a CSV file.

For example,
```
sourmash sig extract --picklist list.csv:md5:md5sum <signatures>
```
will extract only the signatures that have md5sums matching the
column `md5sum` in the CSV file `list.csv`.

The `--picklist` argument string must be of the format
`pickfile:colname:coltype`, where `pickfile` is the path to a CSV
file, `colname` is the name of the column to select from the CSV
file (based on the headers in the first line of the CSV file),
and `coltype` is the type of match.

The following `coltype`s are currently supported by `sourmash sig extract`:

* `name` - exact match to signature's name
* `md5` - exact match to signature's md5sum
* `md5prefix8` - match to 8-character prefix of signature's md5sum
* `md5short` - same as `md5prefix8`
* `ident` - exact match to signature's identifier
* `identprefix` - match to signature's identifier, before '.'

Identifiers are constructed by using the first space delimited word in
the signature name.

One way to build a picklist is to use `sourmash sig describe --csv
out.csv <signatures>` to construct an initial CSV file that you can
then edit further.
which you can select signatures based on values in a CSV file. See
[Using picklists to subset large collections of signatures](#using-picklists-to-subset-large-collections-of-signatures), below.

### `sourmash signature flatten` - remove abundance information from signatures

Expand Down Expand Up @@ -963,6 +947,45 @@ signatures with multiple ksizes or moltypes at the same time; you need
to pick the ksize and moltype to use for your search. Where possible,
scaled values will be made compatible.

### Using picklists to subset large collections of signatures

As of sourmash 4.2.0, many commands support *picklists*, a feature by
which you can select or "pick out" signatures based on values in a CSV
file.

For example,
```
sourmash sig extract --picklist list.csv:md5:md5sum <signatures>
```
will extract only the signatures that have md5sums matching the
column `md5sum` in the CSV file `list.csv`.

The `--picklist` argument string must be of the format
`pickfile:colname:coltype`, where `pickfile` is the path to a CSV
file, `colname` is the name of the column to select from the CSV
file (based on the headers in the first line of the CSV file),
and `coltype` is the type of match.

The following `coltype`s are currently supported by `sourmash sig extract`:

* `name` - exact match to signature's name
* `md5` - exact match to signature's md5sum
* `md5prefix8` - match to 8-character prefix of signature's md5sum
* `md5short` - same as `md5prefix8`
* `ident` - exact match to signature's identifier
* `identprefix` - match to signature's identifier, before '.'

Identifiers are constructed by using the first space delimited word in
the signature name.

One way to build a picklist is to use `sourmash sig describe --csv
out.csv <signatures>` to construct an initial CSV file that you can
then edit further.

In addition to `sig extract`, the following commands support
`--picklist` selection: `index`, `search`, `gather`, `prefetch`,
`compare`, `index`, and `lca index`.

### Storing (and searching) signatures

Backing up a little, there are many ways to store and search
Expand Down

0 comments on commit 4d156e9

Please sign in to comment.