Skip to content

Commit

Permalink
Merge branch 'add/picklist_selectors' into add/picklist_zf_manifests
Browse files Browse the repository at this point in the history
  • Loading branch information
ctb committed Jun 17, 2021
2 parents bba101c + 4d156e9 commit 7937292
Show file tree
Hide file tree
Showing 3 changed files with 138 additions and 39 deletions.
99 changes: 61 additions & 38 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,15 +177,14 @@ sourmash compare file1.sig [ file2.sig ... ]
```

Options:
```
--output -- save the distance matrix to this file (as a numpy binary matrix)
--ksize -- do the comparisons at this k-mer size.
--containment -- calculate containment instead of similarity.
C(i, j) = size(i intersection j) / size(i).
--from-file -- append the list of files in this text file to the input

* `--output` -- save the distance matrix to this file (as a numpy binary matrix)
* `--ksize` -- do the comparisons at this k-mer size.
* `--containment` -- calculate containment instead of similarity; `C(i, j) = size(i intersection j) / size(i)`
* `--from-file` -- append the list of files in this text file to the input
signatures.
--ignore-abundance -- ignore abundances in signatures.
```
* `--ignore-abundance` -- ignore abundances in signatures.
* `--picklist` -- select a subset of signatures with [a picklist](#using-picklists-to-subset-large-collections-of-signatures)

**Note:** compare by default produces a symmetric similarity matrix that can be used as an input to clustering. With `--containment`, however, this matrix is no longer symmetric and cannot formally be used for clustering.

Expand Down Expand Up @@ -249,6 +248,9 @@ similarity match
...
```

Note, as of sourmash 4.2.0, `search` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash gather` - find metagenome members

The `gather` subcommand selects the best reference genomes to use for
Expand Down Expand Up @@ -289,6 +291,9 @@ which matches are no longer reported; by default, this is set to
50kb. see the Appendix in
[Classifying Signatures](classifying-signatures.md) for details.

As of sourmash 4.2.0, `gather` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

Note:

Use `sourmash gather` to classify a metagenome against a collection of
Expand Down Expand Up @@ -350,6 +355,9 @@ containing a list of file names to index; you can also provide individual
signature files, directories full of signatures, or other sourmash
databases.

As of sourmash 4.2.0, `index` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash prefetch` - select subsets of very large databases for more processing

The `prefetch` subcommand searches a collection of scaled signatures
Expand All @@ -375,6 +383,7 @@ Other options include:
* `--threshold-bp` to require a minimum estimated bp overlap for output;
* `--scaled` for downsampling;
* `--force` to continue past survivable errors;
* `--picklist` select a subset of signatures with [a picklist](#using-picklists-to-subset-large-collections-of-signatures)

### Alternative search mode for low-memory (but slow) search: `--linear`

Expand Down Expand Up @@ -589,6 +598,9 @@ see
You can use `--from-file` to pass `lca index` a text file containing a
list of file names to index.

As of sourmash 4.2.0, `lca index` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash lca rankinfo` - examine an LCA database

The `sourmash lca rankinfo` command displays k-mer specificity
Expand Down Expand Up @@ -821,36 +833,8 @@ will extract the same signature, which has an accession number of
#### Using picklists with `sourmash sig extract`

As of sourmash 4.2.0, `extract` also supports picklists, a feature by
which you can select signatures based on values in a CSV file.

For example,
```
sourmash sig extract --picklist list.csv:md5:md5sum <signatures>
```
will extract only the signatures that have md5sums matching the
column `md5sum` in the CSV file `list.csv`.

The `--picklist` argument string must be of the format
`pickfile:colname:coltype`, where `pickfile` is the path to a CSV
file, `colname` is the name of the column to select from the CSV
file (based on the headers in the first line of the CSV file),
and `coltype` is the type of match.

The following `coltype`s are currently supported by `sourmash sig extract`:

* `name` - exact match to signature's name
* `md5` - exact match to signature's md5sum
* `md5prefix8` - match to 8-character prefix of signature's md5sum
* `md5short` - same as `md5prefix8`
* `ident` - exact match to signature's identifier
* `identprefix` - match to signature's identifier, before '.'

Identifiers are constructed by using the first space delimited word in
the signature name.

One way to build a picklist is to use `sourmash sig describe --csv
out.csv <signatures>` to construct an initial CSV file that you can
then edit further.
which you can select signatures based on values in a CSV file. See
[Using picklists to subset large collections of signatures](#using-picklists-to-subset-large-collections-of-signatures), below.

### `sourmash signature flatten` - remove abundance information from signatures

Expand Down Expand Up @@ -963,6 +947,45 @@ signatures with multiple ksizes or moltypes at the same time; you need
to pick the ksize and moltype to use for your search. Where possible,
scaled values will be made compatible.

### Using picklists to subset large collections of signatures

As of sourmash 4.2.0, many commands support *picklists*, a feature by
which you can select or "pick out" signatures based on values in a CSV
file.

For example,
```
sourmash sig extract --picklist list.csv:md5:md5sum <signatures>
```
will extract only the signatures that have md5sums matching the
column `md5sum` in the CSV file `list.csv`.

The `--picklist` argument string must be of the format
`pickfile:colname:coltype`, where `pickfile` is the path to a CSV
file, `colname` is the name of the column to select from the CSV
file (based on the headers in the first line of the CSV file),
and `coltype` is the type of match.

The following `coltype`s are currently supported by `sourmash sig extract`:

* `name` - exact match to signature's name
* `md5` - exact match to signature's md5sum
* `md5prefix8` - match to 8-character prefix of signature's md5sum
* `md5short` - same as `md5prefix8`
* `ident` - exact match to signature's identifier
* `identprefix` - match to signature's identifier, before '.'

Identifiers are constructed by using the first space delimited word in
the signature name.

One way to build a picklist is to use `sourmash sig describe --csv
out.csv <signatures>` to construct an initial CSV file that you can
then edit further.

In addition to `sig extract`, the following commands support
`--picklist` selection: `index`, `search`, `gather`, `prefetch`,
`compare`, `index`, and `lca index`.

### Storing (and searching) signatures

Backing up a little, there are many ways to store and search
Expand Down
1 change: 0 additions & 1 deletion src/sourmash/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -399,7 +399,6 @@ def signatures(self):
"Return the selected signatures."
db = self.db.select(**self.selection_dict)
for ss in db.signatures():
print('MATCH!', ss)
yield ss

def signatures_with_location(self):
Expand Down
77 changes: 77 additions & 0 deletions tests/test_sourmash.py
Original file line number Diff line number Diff line change
Expand Up @@ -4897,3 +4897,80 @@ def test_index_with_picklist(runtmp):
assert len(siglist) == 3
for ss in siglist:
assert 'Thermotoga' in ss.name


def test_index_matches_search_with_picklist(runtmp):
# test 'sourmash index' with picklists
gcf_sig_dir = utils.get_test_data('gather/')
gcf_sigs = glob.glob(utils.get_test_data('gather/GCF*.sig'))
picklist = utils.get_test_data('gather/thermotoga-picklist.csv')
metag_sig = utils.get_test_data('gather/combined.sig')

output_db = runtmp.output('thermo.sbt.zip')

runtmp.sourmash('index', output_db, gcf_sig_dir, '-k', '21')
print(runtmp.last_result.out)
print(runtmp.last_result.err)

# verify:
siglist = list(sourmash.load_file_as_signatures(output_db))
assert len(siglist) > 3 # all signatures included...

n_thermo = 0
for ss in siglist:
if 'Thermotoga' in ss.name:
n_thermo += 1

assert n_thermo == 3

runtmp.sourmash('search', metag_sig, output_db, '--containment',
'-k', '21', '--picklist', f"{picklist}:md5:md5")

err = runtmp.last_result.err
print(err)
assert "for given picklist, found 3 matches to 9 distinct values" in err
# these are the different ksizes
assert "WARNING: 6 missing picklist values." in err

out = runtmp.last_result.out
print(out)
assert "3 matches:" in out
assert "13.1% NC_000853.1 Thermotoga" in out
assert "13.0% NC_009486.1 Thermotoga" in out
assert "12.8% NC_011978.1 Thermotoga" in out


def test_gather_with_prefetch_picklist(runtmp, linear_gather):
# test 'gather' using a picklist taken from 'sourmash prefetch' output
gcf_sigs = glob.glob(utils.get_test_data('gather/GCF*.sig'))
metag_sig = utils.get_test_data('gather/combined.sig')
prefetch_csv = runtmp.output('prefetch-out.csv')

runtmp.sourmash('prefetch', metag_sig, *gcf_sigs,
'-k', '21', '-o', prefetch_csv)

err = runtmp.last_result.err
print(err)

out = runtmp.last_result.out
print(out)

assert "total of 12 matching signatures." in err
assert "of 1466 distinct query hashes, 1466 were found in matches above threshold." in err

# now, do a gather with the results
runtmp.sourmash('gather', metag_sig, *gcf_sigs, linear_gather,
'-k', '21', '--picklist',
f'{prefetch_csv}:match_md5:md5short')

err = runtmp.last_result.err
print(err)

out = runtmp.last_result.out
print(out)

assert "found 11 matches total;" in out
assert "the recovered matches hit 99.9% of the query" in out

assert "4.9 Mbp 33.2% 100.0% NC_003198.1 " in out
assert "1.9 Mbp 13.1% 100.0% NC_000853.1 " in out

0 comments on commit 7937292

Please sign in to comment.