Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash sig extract will create empty zip file from directsketch databases #3191

Closed
ccbaumler opened this issue Jun 6, 2024 · 12 comments · Fixed by #3214
Closed

sourmash sig extract will create empty zip file from directsketch databases #3191

ccbaumler opened this issue Jun 6, 2024 · 12 comments · Fixed by #3214

Comments

@ccbaumler
Copy link
Contributor

ccbaumler commented Jun 6, 2024

tl;dr sourmash_plugin_directsketch creates sketch databases that interact poorly with extract.

I think extract should not create a empty zip file if no matches are found. Thanks for the help figuring this out @bluegenes

Here is why!

$ sourmash sig summarize genbank-20240604-vertebrate_other.zip

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'genbank-20240604-vertebrate_other.zip'
path filetype: ZipFileLinearIndex
location: /home/baumlerc/genbank-database/workflow/genbank-20240604-vertebrate_other.zip
is database? yes
has manifest? yes
num signatures: 159
** examining manifest...
total hashes: 141896441
summary of sketches:
   53 sketches with dna, k=31, scaled=1000, abund     47415819 total hashes
   53 sketches with dna, k=51, scaled=1000, abund     50340570 total hashes
   53 sketches with dna, k=21, scaled=1000, abund     44140052 total hashes

From summarize, I expected to extract 53 matches, but extract found 0.

$ sourmash sig extract genbank-20240604-vertebrate_other.zip -k 21 --dna -o genbank-20240604-vertebrate_other-k21.zip

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 0 total that matched ksize & molecule type
no matching signatures to save!

And extract still creates the output file

unzip -v genbank-20240604-vertebrate_other-k21.zip
Archive:  genbank-20240604-vertebrate_other-k21.zip
warning [genbank-20240604-vertebrate_other-k21.zip]:  zipfile is empty
which leads to this error report!
sourmash sig cat genbank-20240604-vertebrate_other.zip -k 21 -o genbank-20240604-vertebrate_other-k21.zip

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Traceback (most recent call last):
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sbt_storage.py", line 309, in _load_from_zf
    return zf.read(path)
           ^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1550, in read
    with self.open(name, "r", pwd) as fp:
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1587, in open
    zinfo = self.getinfo(name)
            ^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1516, in getinfo
    raise KeyError(
KeyError: "There is no item named 'SOURMASH-MANIFEST.csv' in the archive"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sbt_storage.py", line 316, in load
    return self._load_from_zf(self.zipfile, path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sbt_storage.py", line 312, in _load_from_zf
    return zf.read(path)
           ^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1550, in read
    with self.open(name, "r", pwd) as fp:
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1587, in open
    zinfo = self.getinfo(name)
            ^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1516, in getinfo
    raise KeyError(
KeyError: "There is no item named 'signatures/SOURMASH-MANIFEST.csv' in the archive"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sbt_storage.py", line 309, in _load_from_zf
    return zf.read(path)
           ^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1550, in read
    with self.open(name, "r", pwd) as fp:
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1587, in open
    zinfo = self.getinfo(name)
            ^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1516, in getinfo
    raise KeyError(
KeyError: "There is no item named 'SOURMASH-MANIFEST.csv' in the archive"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/save_load.py", line 483, in open
    manifest_data = storage.load('SOURMASH-MANIFEST.csv')
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sbt_storage.py", line 154, in load
    return self.__inner.load(path)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sbt_storage.py", line 319, in load
    return self._load_from_zf(self.bufferzip, path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sbt_storage.py", line 312, in _load_from_zf
    return zf.read(path)
           ^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1550, in read
    with self.open(name, "r", pwd) as fp:
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1587, in open
    zinfo = self.getinfo(name)
            ^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/zipfile/__init__.py", line 1516, in getinfo
    raise KeyError(
KeyError: "There is no item named 'signatures/SOURMASH-MANIFEST.csv' in the archive"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/baumlerc/miniforge3/envs/sourmash/bin/sourmash", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/__main__.py", line 19, in main
    retval = mainmethod(args)
             ^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/cli/sig/cat.py", line 58, in main
    return sourmash.sig.__main__.cat(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sig/__main__.py", line 116, in cat
    save_sigs.open()
  File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/save_load.py", line 487, in open
    raise ValueError(f"Cannot add to existing zipfile '{self.location}' without a manifest")
ValueError: Cannot add to existing zipfile 'genbank-20240604-vertebrate_other-k21.zip' without a manifest

As opposed to running the same sig cat command but without an empty zip file with a matching name

$ sourmash sig cat genbank-20240604-vertebrate_other.zip -k 31 -o genbank-20240604-vertebrate_other-k31.zip

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 53 signatures total, from 1 files
loaded 53 signatures total.
output 53 signatures
@ctb
Copy link
Contributor

ctb commented Jun 6, 2024

thanks!

I'm slightly confused about the results above tho - is the problem that extract isn't finding any sketches, or that extract is creating a .sig file with no matches, or both?

If you have a simple replication with a smaller set of sketches, that would be easier for us to debug. There are some example sketches in tests/test-data in the sourmash repo that you could use :)

@ccbaumler
Copy link
Contributor Author

The initial problem is extract does not find any of the expected sketches. extract does reports that it founding 0 matches, but it creates an empty zip file without a signatures/ or SOURMASH-MANIFEST.csv. The following rust error is due to that empty zip file existing and sourmash not knowing what to do without finding a SOURMASH-MANIFEST.csv in the zip archive.

I trialed these same command on the test/test-data and they all worked.

This example was using a sourmash database created with https://github.com/sourmash-bio/sourmash_plugin_directsketch/. It may be a bug with the way extract runs with this new plugin. sourmash cat works perfectly fine with directsketch databases as far as I can tell.

I think there are four options:

  1. If extract does not find any matches, it does not create output
  2. If extract does not find any matches, it creates an output with an empty signatures/ and SOURMASH-MANIFEST.csv
  3. Run sig cat code under sig extract name
  4. Figure out why directsketch is interacting oddly with sig extract

@ctb
Copy link
Contributor

ctb commented Jun 6, 2024

k thx!

@ctb
Copy link
Contributor

ctb commented Jun 7, 2024

it is indeed directsketch specific - if I run the quickstart example, I get:

sourmash sig extract -k 10 --dna test-gbsketch.zip -o /dev/null

== This is sourmash version 4.8.9.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 0 total that matched ksize & molecule type
no matching signatures to save!

I found sourmash-bio/sourmash_plugin_directsketch#48 first, but that doesn't seem to cause problems. At least one real issue that does cause some problems is that, in the manifest, moltype is dna and not DNA, ref sourmash-bio/sourmash_plugin_directsketch#49.

There are multiple fails in the sourmash code happening here, and I'll have to spend some time disentangling them. But at least for now one major issue is that directsketch is creating files that violate (hidden, untested) assumptions about manifest contents :).

Thanks for finding and reporting this, @ccbaumler!

@ccbaumler ccbaumler changed the title sourmash sig extract will create empty zip file sourmash sig extract will create empty zip file from [directsketch](https://github.com/sourmash-bio/sourmash_plugin_directsketch) databases Jun 7, 2024
@ctb
Copy link
Contributor

ctb commented Jun 8, 2024

Note that it appears that manysketch from the branchwater plugin does the same thing as directsketch - dna vs DNA in manifests.

@ctb
Copy link
Contributor

ctb commented Jun 8, 2024

aaand looks like this change 2feb5f7 (currently over in #3193) fixes the uppercase-vs-lowercase issue, at least.

specifically, adjusting the case in this line in encodings.rs:

HashFunctions::Murmur64Dna => "dna",

@ctb ctb changed the title sourmash sig extract will create empty zip file from [directsketch](https://github.com/sourmash-bio/sourmash_plugin_directsketch) databases sourmash sig extract will create empty zip file from directsketch databases Jun 8, 2024
@ctb
Copy link
Contributor

ctb commented Jun 8, 2024

ok, slowly getting a handle on this. A bit more of a summary of where we're at:

FIRST, at least one big problem is that the Rust manifest code is creating sketches with lowercase dna instead of uppercase DNA in the moltype field. This is unexpected, and, moreover, limited moltypes are not checked anywhere in the code. Interestingly you can see it in the output of sourmash sig summarize even (lowercase vs uppercase)! Because it's in sourmash-rs core, both directsketch and manysketch generate manifests with this feature.

SECOND, when you specify the --dna selector sourmash sig cat --dna or sourmash sig extract --dnaon sketches with moltype dna (not DNA), nothing is selected. This is technically correct! But the behavior is not specific to sig extract - any script that uses the standard sourmash CLI selectors of --dna will fall afoul of this.

THIRD, sourmash sig extract and sourmash sig cat both are more than happy to produce empty zip files in this case. This suggests that the sourmash signature saving code to zips is borked when 0 signatures are saved.

Perhaps the most urgent thing to do IMO is to fix the rust manifest generation code. But I think that will be aided by putting better checks in when loading manifests.

I'll create the usual flurry of issues and PRs as I address things. Thanks again for reporting @ccbaumler!

@ccbaumler
Copy link
Contributor Author

That makes perfect sense! Thanks for the thorough walkthrough.

What are your thoughts on allowing lowercase AND uppercase DNA in the moltype field? OR allow the moltype field to be case insensitive?

Would this allow a more flexible user experience and be worth the time to write?

@ctb
Copy link
Contributor

ctb commented Jun 8, 2024

That makes perfect sense! Thanks for the thorough walkthrough.

What are your thoughts on allowing lowercase AND uppercase DNA in the moltype field? OR allow the moltype field to be case insensitive?

Would this allow a more flexible user experience and be worth the time to write?

One interesting way to think about this is Postel's Law -

Be liberal in what you accept, and conservative in what you send.

This is usually interpreted with respect to standards documents - adhere closely to the standard in what you output, but take common failures on input as long as they can be rectified.

With respect to this particular issue, where the "standard" is really "whatever we can get sourmash to do", it would correspond to "accept any string irrespective of case, but only emit the uppercase."

Realistically, though, the sourmash team is the only one writing software that produces sketches, and it would just be easier all around if non-canonical strings were not allowed. Since we control both the producer and consumer, I'm tempted to just go with that. I don't see any real advantage to supporting messy inputs right now, and I don't know who is asking for it or why they would ;)

IMO, the real failure is on the sourmash sketch loading side, where unknown moltypes aren't flagged - they should be caught by sketch loading functions and database selection/selectors, and reported to the user. This is related to the ksize issue you ran across, where non-int ksizes were happily accepted and then just didn't work:

On the other hand, there's some real interest in custom moltypes/hash functions, and I would expect us to expand both the tests and the freedom to Make Mistakes.

@ctb
Copy link
Contributor

ctb commented Jun 8, 2024

if you need a script to fix all the directsketch outputs, that I can produce quite easily ;).

@ctb
Copy link
Contributor

ctb commented Jun 8, 2024

(but if you don't mind it taking some time, it should be simple as sig cat <broken>.zip -o <fixed>.zip.)

@ctb
Copy link
Contributor

ctb commented Jun 16, 2024

trying to understand what's left -

ok, there appear to be two different fun things still happening here 😅 .

sig cat and sig extract behave differently / sig extract is broken

first:

sourmash sig extract --dna tests/test-data/prot/protein.zip -o xxx2.zip  

== This is sourmash version 4.8.10.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 0 total that matched ksize & molecule type
no matching signatures to save!

creates an empty zip file, which sourmash fails to load, e.g. with sig summarize we get:

sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called `Result::unwrap()` on an `Err` value: InvalidArchive("Too small for anything but End Of Central Directory Record")' at src/core/src/storage.rs:367

so that's weird. Turns out it's because I'm not using the context manager properly in sig extract, sigh - sixed by #3214.

sourmash borks on loading empty zip files

And, separately, sourmash is not happy with empty zip files (which is legit) and returns a confusing error (which is less so).

This second issue is being punted to #3213.

ctb added a commit that referenced this issue Jun 17, 2024
This fixes `sourmash sig extract` to properly close the
`sourmash_args.SaveSignaturesToLocation` upon error exit.

The proper thing to do is really to use the context manager tho 😭 . But
I want to fix this problem first.

Fixes #3191

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@ctb ctb closed this as completed in #3214 Jun 17, 2024
ctb added a commit that referenced this issue Jun 26, 2024
…#3212)

This PR checks the types and values of parameters to `Index.select`.
Specifically, it makes sure that:
* ksize, num, and scaled are ints
* containment is a bool
* abund is a bool
* moltype is 'DNA', 'protein', 'hp', or 'dayhoff'
* there are no other parameters other than 'picklist'

Also adds manifest column type enforcement in general, & explicit
manifest content checking to `sig manifest` and `sig summarize`.

Related issues:
* Helps debug manifest issues per
#3191 - invalid manifest
rows will be revealed by `sig manifest` and `sig summarize`
* Fixes #3107

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants