Skip to content

Commit

Permalink
MRG: fix multigather output by adding md5sum along with `-U/--outpu…
Browse files Browse the repository at this point in the history
…t-add-query-md5sum` (#2722)

This PR:
- adds documentation for `multigather` to sourmash docs!
- builds on #2065 /
#2721 so that tests pass.
- adds an option `-U/--output-add-query-md5sum` to `sourmash
multigather`
- adds an option `--force-allow-overwrite-output` to `sourmash
multigather`
- **CHANGES BEHAVIOR** of multigather by treating `query.filename ==
'-'` as if `query.filename` is empty, thus replacing it with md5sum
- **CHANGES BEHAVIOR** of multigather by failing loudly and clearly if
output files are going to be overwritten
- adds `-E/--extension` to allow output to files other than `.sig`

See discussion over in [#2328: `multigather` CSV output uses signature
`filename` as
basename](#2328).

To add:
- [x] tests for `-U`;
- [x] implement and test `-E/--extension`
- [x] implement and test `--force-allow-overwrite-output`
- [x] fix for `query_filename` being None/empty in `-U` branch
- [x] documentation update for changed output behavior for multigather:
'-' => using md5sum
- [x] documentation update for changed output behavior for multigather:
fails if overwrite happens
- [x] fix multigather link in docs

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Olga Botvinnik <olga.botvinnik@gmail.com>
Co-authored-by: Keya Barve <53328492+keyabarve@users.noreply.github.com>
Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Taylor Reiter <taylorreiter@gmail.com>
Co-authored-by: Erik Young <eeyoung@ucdavis.edu>
Co-authored-by: David Koslicki <dmkoslicki@gmail.com>
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
Co-authored-by: Colton Baumler <baumlerc@farm.ucdavis.edu>
Co-authored-by: Luiz Irber <contact+github@luizirber.org>
Co-authored-by: N. Tessa Pierce-Ward <ntpierce@gmail.com>
Co-authored-by: Peter Cock <p.j.a.cock@googlemail.com>
Co-authored-by: Francesco Beghini <francesco.beghini@yale.edu>
Co-authored-by: Jason Stajich <jason.stajich@ucr.edu>
Co-authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
18 people authored Feb 29, 2024
1 parent 0d2359d commit c7a1265
Show file tree
Hide file tree
Showing 5 changed files with 302 additions and 60 deletions.
45 changes: 45 additions & 0 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,10 @@ metagenome and genome bin analysis. (See
[Classifying Signatures](classifying-signatures.md) for more
information on the different approaches that can be used here.)

`sourmash gather` takes exactly one query and one or more
[collections of signatures](#storing-and-searching-signatures). Please see
[`sourmash multigather`](#sourmash-multigather-do-gather-with-many-queries) if you have multiple queries!

If the input signature was created with `-p abund`, output
will be abundance weighted (unless `--ignore-abundances` is
specified). `-o/--output` will create a CSV file containing the
Expand Down Expand Up @@ -534,6 +538,47 @@ This combination of commands ensures that the more time- and
memory-intensive `gather` step is run only on a small set of relevant
signatures, rather than all the signatures in the database.

### `sourmash multigather` - do gather with many queries

The `multigather` subcommand runs `sourmash gather` on multiple
queries. (See
[`sourmash gather` docs](#sourmash-gather-find-metagenome-members) for
specifics on what gather does, and how!)

Usage:
```
sourmash multigather --query <queries ...> --db <collections>
```

Note that multigather is single threaded, so it offers no substantial
efficiency gains over just running gather multiple times! Nontheless, it
is useful for situations where you have many sketches organized in a
combined file, e.g. sketches built with `sourmash sketch
... --singleton`).

#### `multigather` output files

multigather produces three output files for each
query:

* `<output_base>.csv` - gather CSV output
* `<output_base>.matches.sig` - all matching outputs
* `<output_base>.unassigned.sig` - all remaining unassigned hashes

As of sourmash v4.8.7, `<output_base>` is set as follows:
* the filename attribute of the query sketch, if it is not empty or `-`;
* the query sketch md5sum, if the query filename is empty or `-`;
* the query filename + the query sketch md5sum
(`<query_file>.<md5sum>`), if `-U/--output-add-query-md5sum` is
specified;

By default, `multigather` will complain and exit with an error if
the same `<output_base>` is used repeatedly and an output file is
going to be overwritten. With `-U/--output-add-query-md5sum` this
should only happen when identical sketches are present in a query
database. Use `--force-allow-overwrite-output`
to allow overwriting of output files without an error.

## `sourmash tax` subcommands for integrating taxonomic information into gather results

The `sourmash tax` subcommands support taxonomic analysis of genomes
Expand Down
18 changes: 18 additions & 0 deletions src/sourmash/cli/multigather.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,11 @@ def subparser(subparsers):
action="store_true",
help="stop at databases that contain no compatible signatures",
)
subparser.add_argument(
"--force-allow-overwrite-output",
action="store_true",
help="allow output files to be overwritten",
)
subparser.add_argument(
"--no-fail-on-empty-database",
action="store_false",
Expand All @@ -92,6 +97,19 @@ def subparser(subparsers):
"--outdir",
help="output CSV results to this directory",
)
subparser.add_argument(
"-U",
"--output-add-query-md5sum",
action="store_true",
help="add md5sum of each query to ensure unique output file names",
)
subparser.add_argument(
"-E",
"--extension",
type=str,
default=".sig",
help="write signature files with this extension ('.sig' by default)",
)

add_ksize_arg(subparser)
add_moltype_args(subparser)
Expand Down
71 changes: 47 additions & 24 deletions src/sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -1162,6 +1162,7 @@ def multigather(args):
# run gather on all the queries.
n = 0
size_may_be_inaccurate = False
output_base_tracking = set() # make sure we are not reusing 'output_base'
for queryfile in inp_files:
# load the query signature(s) & figure out all the things
for query in sourmash_args.load_file_as_signatures(
Expand Down Expand Up @@ -1228,21 +1229,42 @@ def multigather(args):
result = None

query_filename = query.filename
if not query_filename:
if not query_filename or query_filename == "-":
# use md5sum if query.filename not properly set
query_filename = query.md5sum()
output_base = query.md5sum()
elif args.output_add_query_md5sum:
# Uniquify the output file if all signatures were made from the same file (e.g. with --singleton)
assert query_filename and query_filename != "-" # first branch
output_base = os.path.basename(query_filename) + "." + query.md5sum()
else:
output_base = os.path.basename(query_filename)

output_base = os.path.basename(query_filename)
if args.output_dir:
output_base = os.path.join(args.output_dir, output_base)
output_csv = output_base + ".csv"

# track overwrites of output files!
if output_base in output_base_tracking:
error(
f"ERROR: detected overwritten outputs! '{output_base}' has already been used. Failing."
)
if args.force_allow_overwrite_output:
error("continuing because --force-allow-overwrite was specified")
else:
error(
"Consider using '-U/--output-add-query-md5sum' to build unique outputs"
)
error("and/or '--force-allow-overwrite-output'")
sys.exit(-1)

output_base_tracking.add(output_base)

output_matches = output_base + ".matches.sig"
save_sig_obj = SaveSignaturesToLocation(output_matches)
save_sig = save_sig_obj.__enter__()
notify(f"saving all matching signatures to '{output_matches}'")

# track matches
# write out basic CSV file
output_csv = output_base + ".csv"
notify(f'saving all CSV matches to "{output_csv}"')
csv_out_obj = FileOutputCSV(output_csv)
csv_outfp = csv_out_obj.__enter__()
Expand Down Expand Up @@ -1330,31 +1352,32 @@ def multigather(args):
notify("nothing found... skipping.")
continue

output_unassigned = output_base + ".unassigned.sig"
with open(output_unassigned, "w"):
remaining_query = gather_iter.query
if noident_mh:
remaining_mh = remaining_query.minhash.to_mutable()
remaining_mh += noident_mh.downsample(scaled=remaining_mh.scaled)
remaining_query.minhash = remaining_mh
output_unassigned = output_base + f".unassigned{args.extension}"
remaining_query = gather_iter.query
if noident_mh:
remaining_mh = remaining_query.minhash.to_mutable()
remaining_mh += noident_mh.downsample(scaled=remaining_mh.scaled)
remaining_query.minhash = remaining_mh

if is_abundance:
abund_query_mh = remaining_query.minhash.inflate(orig_query_mh)
remaining_query.minhash = abund_query_mh
if is_abundance:
abund_query_mh = remaining_query.minhash.inflate(orig_query_mh)
remaining_query.minhash = abund_query_mh

if found == 0:
notify("nothing found - entire query signature unassigned.")
elif not remaining_query:
notify("no unassigned hashes! not saving.")
else:
notify(f'saving unassigned hashes to "{output_unassigned}"')
if found == 0:
notify("nothing found - entire query signature unassigned.")
elif not remaining_query:
notify("no unassigned hashes! not saving.")
else:
notify(f'saving unassigned hashes to "{output_unassigned}"')

with SaveSignaturesToLocation(output_unassigned) as save_sig:
save_sig.add(remaining_query)

with SaveSignaturesToLocation(output_unassigned) as save_sig:
# CTB: note, multigather does not save abundances
save_sig.add(remaining_query)
n += 1

# fini, next query!

# done! report at end.
notify(f"\nconducted gather searches on {n} signatures")
if size_may_be_inaccurate:
notify(
Expand Down
5 changes: 5 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,11 @@ def sig_save_extension(request):
return request.param


@pytest.fixture(params=["sig", "sig.gz", "zip", ".d/"])
def sig_save_extension_abund(request):
return request.param


# --- BEGIN - Only run tests using a particular fixture --- #
# Cribbed from: http://pythontesting.net/framework/pytest/pytest-run-tests-using-particular-fixture/
def pytest_collection_modifyitems(items, config):
Expand Down
Loading

0 comments on commit c7a1265

Please sign in to comment.