Skip to content

Commit

Permalink
[MRG] implement a simple ZipFileLinearIndex class (#1349)
Browse files Browse the repository at this point in the history
* implement a simple ZipFileLinearIndex class

* fix load_file_as_signatures

* add tests for zipfile searching etc.

* add sig describe test for loading from zipfile

* fix load_file_as_index to support zipfiles

* rename force; add docstrings

* add an IndexOfIndexes class

* rename to MultiIndex

* switch to using MultiIndex for loading from a directory

* some more MultiIndex tests

* add test of MultiIndex.signatures

* add docstring for MultiIndex

* stop special-casing SIGLISTs

* fix test to match more informative error message

* switch to using LinearIndex.load for stdin, too

* add __len__ to MultiIndex

* add check_csv to check for appropriate filename loading info

* add comment

* fix databases load

* more tests needed

* add tests for incompatible signatures

* add filter to LinearIndex and MultiIndex

* clean up sourmash_args some more

* shift loading over to Index classes

* refactor, fix tests

* switch to a list of loader functions

* comments, docstrings, and tests passing

* update to use f strings throughout sourmash_args.py

* add docstrings

* update comments

* remove unnecessary changes

* revert to original test

* remove unneeded comment

* clean up a bit

* debugging update

* better exception raising and capture for signature parsing

* more specific error message

* revert change in favor of creating new issue

* add commentary => TODO

* add tests for MultiIndex.load_from_directory; fix traverse code

* switch lca summarize over to usig MultiIndex

* switch to using MultiIndex in categorize

* remove LoadSingleSignatures

* test errors in lca database loading

* remove unneeded categorize code

* add testme info

* verified that this was tested

* remove testme comments

* add tests for MultiIndex.load_from_file_list

* refactor select, add scaled/num/abund

* more work

* catch ValueError from db.select

* update debug print to sys.stder

* fix scaled check for LCA database

* add debug_literal

* break things when filter returns empty Index

* fix scaled check for SBT

* fix a few tests

* fix LCA database ksize message & test

* flag for removal

* add 'containment' to 'select'

* fix remaining tests

* update comments

* remove all the cruft, yay

* added 'is_database' flag for nicer UX

* remove overly broad exception catching

* add docstrings

* document downsampling foo

* update for additional test files

* update ZipFileLinearIndex for new selector criteria

* remove leftover code fragment

* add zipfile API tests; use .location

* update docs to include zipfile collections

* add zipfile loading tests

* add __len__ to ZipFileLinearIndex and test MultiIndex load of zipfile

* Update doc/command-line.md

Co-authored-by: Tessa Pierce <bluegenes@users.noreply.github.com>

* add test of incompatible sig search for zipfile

Co-authored-by: Tessa Pierce <bluegenes@users.noreply.github.com>
  • Loading branch information
ctb and bluegenes authored Apr 3, 2021
1 parent 1dc9426 commit beb3d59
Show file tree
Hide file tree
Showing 14 changed files with 396 additions and 49 deletions.
47 changes: 29 additions & 18 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -842,7 +842,9 @@ scaled values will be made compatible.
### Storing (and searching) signatures

Backing up a little, there are many ways to store and search
signatures.
signatures. `sourmash` supports storing and loading signatures from JSON
files, directories, lists of files, Zip files, and indexed databases.
These can all be used interchangeably for sourmash operations.

The simplest is one signature in a single JSON file. You can also put
many signatures in a single JSON file, either by building them that
Expand All @@ -851,7 +853,29 @@ commands. Searching or comparing these files involves loading them
sequentially and iterating across all of the signatures - which can be
slow, especially for many (100s or 1000s) of signatures.

Indexed databases can make searching signatures a lot faster. SBT
### Zip files

All of the `sourmash` commands support loading collections of
signatures from zip files. You can create a compressed collection of
signatures using `zip -r collection.zip *.sig` and then specify
`collections.zip` on the command line.

### Loading all signatures under a directory

All of the `sourmash` commands support loading signatures from
beneath directories; provide the paths on the command line.

#### Passing in lists of files

Most sourmash commands will also take `--from-file` or
`--query-from-file`, which will take a path to a text file containing
a list of file paths. This can be useful for situations where you want
to specify thousands of queries, or a subset of signatures produced by
some other command.

#### Indexed databases

Indexed databases can make searching signatures much faster. SBT
databases are low memory and disk-intensive databases that allow for
fast searches using a tree structure, while LCA databases are higher
memory and (after a potentially significant load time) are quite fast.
Expand All @@ -869,19 +893,6 @@ will complain. In contrast, signature files can
contain many different types of signatures, and compatible ones will
be discovered automatically.

### Passing in lists of files

Various sourmash commands will also take `--from-file` or
`--query-from-file`, which will take a path to a text file containing
a list of file paths. This can be useful for situations where you want
to specify thousands of queries, or a subset of signatures produced by
some other command.

### Loading all signatures under a directory

All of the `sourmash` commands support loading signatures from
beneath directories; provide the paths on the command line.

### Combining search databases on the command line

All of the commands in sourmash operate in "online" mode, so you can
Expand All @@ -902,9 +913,9 @@ been useful. :)

### Using stdin

Most commands will take stdin via the usual UNIX convention, `-`.
Moreover, `sourmash sketch` and the `sourmash sig` commands will
output to stdout. So, for example,
Most commands will take signature JSON data via stdin using the usual
UNIX convention, `-`. Moreover, `sourmash sketch` and the `sourmash
sig` commands will output to stdout. So, for example,

`sourmash sketch ... -o - | sourmash sig describe -` will describe the
signatures that were just created.
Expand Down
68 changes: 64 additions & 4 deletions src/sourmash/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import sourmash
from abc import abstractmethod, ABC
from collections import namedtuple
import zipfile
import os


Expand Down Expand Up @@ -83,7 +84,7 @@ def search(self, query, threshold=None,
for ss in self.signatures():
score = query_match(ss)
if score >= threshold:
matches.append((score, ss, self.filename))
matches.append((score, ss, self.location))

# sort!
matches.sort(key=lambda x: -x[0])
Expand Down Expand Up @@ -119,7 +120,7 @@ def gather(self, query, *args, **kwargs):
for ss in self.signatures():
cont = query.minhash.contained_by(ss.minhash, True)
if cont and cont >= threshold:
results.append((cont, ss, self.filename))
results.append((cont, ss, self.location))

results.sort(reverse=True, key=lambda x: (x[0], x[1].md5sum()))

Expand Down Expand Up @@ -182,7 +183,7 @@ def __init__(self, _signatures=None, filename=None):
self._signatures = []
if _signatures:
self._signatures = list(_signatures)
self.filename = filename
self.location = filename

def signatures(self):
return iter(self._signatures)
Expand Down Expand Up @@ -219,7 +220,66 @@ def select(self, **kwargs):
if select_signature(ss, **kwargs):
siglist.append(ss)

return LinearIndex(siglist, self.filename)
return LinearIndex(siglist, self.location)


class ZipFileLinearIndex(Index):
"""\
A read-only collection of signatures in a zip file.
Does not support `insert` or `save`.
"""
is_database = True

def __init__(self, zf, selection_dict=None,
traverse_yield_all=False):
self.zf = zf
self.selection_dict = selection_dict
self.traverse_yield_all = traverse_yield_all

def __len__(self):
return len(list(self.signatures()))

@property
def location(self):
return self.zf.filename

def insert(self, signature):
raise NotImplementedError

def save(self, path):
raise NotImplementedError

@classmethod
def load(cls, location, traverse_yield_all=False):
"Class method to load a zipfile."
zf = zipfile.ZipFile(location, 'r')
return cls(zf, traverse_yield_all=traverse_yield_all)

def signatures(self):
"Load all signatures in the zip file."
from .signature import load_signatures
for zipinfo in self.zf.infolist():
# should we load this file? if it ends in .sig OR we are forcing:
if zipinfo.filename.endswith('.sig') or \
zipinfo.filename.endswith('.sig.gz') or \
self.traverse_yield_all:
fp = self.zf.open(zipinfo)

# now load all the signatures and select on ksize/moltype:
selection_dict = self.selection_dict
for ss in load_signatures(fp):
if selection_dict:
if select_signature(ss, **self.selection_dict):
yield ss
else:
yield ss

def select(self, **kwargs):
"Select signatures in zip file based on ksize/moltype/etc."
return ZipFileLinearIndex(self.zf,
selection_dict=kwargs,
traverse_yield_all=self.traverse_yield_all)


class MultiIndex(Index):
Expand Down
4 changes: 1 addition & 3 deletions src/sourmash/sbt.py
Original file line number Diff line number Diff line change
Expand Up @@ -754,9 +754,7 @@ def load(cls, location, *, leaf_loader=None, storage=None, print_version_warning

if storage:
sbts = storage.list_sbts()
if len(sbts) != 1:
print("no SBT, or too many SBTs!")
else:
if len(sbts) == 1:
tree_data = storage.load(sbts[0])

tempfile = NamedTemporaryFile()
Expand Down
29 changes: 22 additions & 7 deletions src/sourmash/sourmash_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from . import signature
from .logging import notify, error, debug_literal

from .index import LinearIndex, MultiIndex
from .index import (LinearIndex, ZipFileLinearIndex, MultiIndex)
from . import signature as sig
from .sbt import SBT
from .sbtmh import SigLeaf
Expand Down Expand Up @@ -143,7 +143,7 @@ def traverse_find_sigs(filenames, yield_all_files=False):

def load_dbs_and_sigs(filenames, query, is_similarity_query, *, cache_size=None):
"""
Load one or more SBTs, LCAs, and/or signatures.
Load one or more SBTs, LCAs, and/or collections of signatures.
Check for compatibility with query.
Expand Down Expand Up @@ -251,7 +251,7 @@ def _load_sbt(filename, **kwargs):

try:
db = load_sbt_index(filename, cache_size=cache_size)
except FileNotFoundError as exc:
except (FileNotFoundError, TypeError) as exc:
raise ValueError(exc)

return db
Expand All @@ -263,6 +263,16 @@ def _load_revindex(filename, **kwargs):
return db


def _load_zipfile(filename, **kwargs):
"Load collection from a .zip file."
db = None
if filename.endswith('.zip'):
traverse_yield_all = kwargs['traverse_yield_all']
db = ZipFileLinearIndex.load(filename,
traverse_yield_all=traverse_yield_all)
return db


# all loader functions, in order.
_loader_functions = [
("load from stdin", _load_stdin),
Expand All @@ -271,6 +281,7 @@ def _load_revindex(filename, **kwargs):
("load from file list", _multiindex_load_from_pathlist),
("load SBT", _load_sbt),
("load revindex", _load_revindex),
("load collection from zipfile", _load_zipfile),
]


Expand Down Expand Up @@ -328,8 +339,10 @@ def _load_database(filename, traverse_yield_all, *, cache_size=None):
def load_file_as_index(filename, yield_all_files=False):
"""Load 'filename' as a database; generic database loader.
If 'filename' contains an SBT or LCA indexed database, will return
the appropriate objects.
If 'filename' contains an SBT or LCA indexed database, or a regular
Zip file, will return the appropriate objects. If a Zip file and
yield_all_files=True, will try to load all files within zip, not just
.sig files.
If 'filename' is a JSON file containing one or more signatures, will
return an Index object containing those signatures.
Expand All @@ -346,8 +359,10 @@ def load_file_as_signatures(filename, select_moltype=None, ksize=None,
progress=None):
"""Load 'filename' as a collection of signatures. Return an iterable.
If 'filename' contains an SBT or LCA indexed database, will return
a signatures() generator.
If 'filename' contains an SBT or LCA indexed database, or a regular
Zip file, will return a signatures() generator. If a Zip file and
yield_all_files=True, will try to load all files within zip, not just
.sig files.
If 'filename' is a JSON file containing one or more signatures, will
return a list of those signatures.
Expand Down
Binary file added tests/test-data/prot/all.zip
Binary file not shown.
Binary file added tests/test-data/prot/dayhoff.zip
Binary file not shown.
1 change: 1 addition & 0 deletions tests/test-data/prot/dna-sig.noext

Large diffs are not rendered by default.

Binary file added tests/test-data/prot/dna-sig.sig.gz
Binary file not shown.
Binary file added tests/test-data/prot/hp.zip
Binary file not shown.
Binary file added tests/test-data/prot/protein.zip
Binary file not shown.
16 changes: 16 additions & 0 deletions tests/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,22 @@ def test_load_index_3():
assert len(sigs) == 2


def test_load_index_4():
testfile = utils.get_test_data('prot/all.zip')
idx = sourmash.load_file_as_index(testfile)

sigs = list(idx.signatures())
assert len(sigs) == 7


def test_load_index_4_b():
testfile = utils.get_test_data('prot/protein.zip')
idx = sourmash.load_file_as_index(testfile)

sigs = list(idx.signatures())
assert len(sigs) == 2


def test_load_fasta_as_signature():
# try loading a fasta file - should fail with informative exception
testfile = utils.get_test_data('short.fa')
Expand Down
21 changes: 21 additions & 0 deletions tests/test_cmd_signature.py
Original file line number Diff line number Diff line change
Expand Up @@ -1421,6 +1421,27 @@ def test_sig_describe_1_dir(c):
assert line.strip() in out


@utils.in_tempdir
def test_sig_describe_1_zipfile(c):
# get basic info on multiple signatures in a zipfile
sigs = utils.get_test_data('prot/all.zip')
c.run_sourmash('sig', 'describe', sigs)

out = c.last_result.out
print(c.last_result)

expected_output = """\
k=19 molecule=dayhoff num=0 scaled=100 seed=42 track_abundance=0
k=19 molecule=dayhoff num=0 scaled=100 seed=42 track_abundance=0
k=19 molecule=hp num=0 scaled=100 seed=42 track_abundance=0
k=19 molecule=hp num=0 scaled=100 seed=42 track_abundance=0
k=19 molecule=protein num=0 scaled=100 seed=42 track_abundance=0
k=19 molecule=protein num=0 scaled=100 seed=42 track_abundance=0
""".splitlines()
for line in expected_output:
assert line.strip() in out


@utils.in_thisdir
def test_sig_describe_stdin(c):
sig = utils.get_test_data('prot/protein/GCA_001593925.1_ASM159392v1_protein.faa.gz.sig')
Expand Down
Loading

0 comments on commit beb3d59

Please sign in to comment.