-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sourmash sig extract
will create empty zip file from directsketch databases
#3191
Comments
thanks! I'm slightly confused about the results above tho - is the problem that If you have a simple replication with a smaller set of sketches, that would be easier for us to debug. There are some example sketches in |
The initial problem is I trialed these same command on the This example was using a sourmash database created with https://github.com/sourmash-bio/sourmash_plugin_directsketch/. It may be a bug with the way extract runs with this new plugin. I think there are four options:
|
k thx! |
it is indeed
I found sourmash-bio/sourmash_plugin_directsketch#48 first, but that doesn't seem to cause problems. At least one real issue that does cause some problems is that, in the manifest, There are multiple fails in the sourmash code happening here, and I'll have to spend some time disentangling them. But at least for now one major issue is that directsketch is creating files that violate (hidden, untested) assumptions about manifest contents :). Thanks for finding and reporting this, @ccbaumler! |
sourmash sig extract
will create empty zip filesourmash sig extract
will create empty zip file from [directsketch](https://github.com/sourmash-bio/sourmash_plugin_directsketch) databases
Note that it appears that |
aaand looks like this change 2feb5f7 (currently over in #3193) fixes the uppercase-vs-lowercase issue, at least. specifically, adjusting the case in this line in sourmash/src/core/src/encodings.rs Line 61 in a133e68
|
sourmash sig extract
will create empty zip file from [directsketch](https://github.com/sourmash-bio/sourmash_plugin_directsketch) databasessourmash sig extract
will create empty zip file from directsketch databases
ok, slowly getting a handle on this. A bit more of a summary of where we're at: FIRST, at least one big problem is that the Rust manifest code is creating sketches with lowercase SECOND, when you specify the THIRD, Perhaps the most urgent thing to do IMO is to fix the rust manifest generation code. But I think that will be aided by putting better checks in when loading manifests. I'll create the usual flurry of issues and PRs as I address things. Thanks again for reporting @ccbaumler! |
That makes perfect sense! Thanks for the thorough walkthrough. What are your thoughts on allowing lowercase AND uppercase DNA in the moltype field? OR allow the moltype field to be case insensitive? Would this allow a more flexible user experience and be worth the time to write? |
One interesting way to think about this is Postel's Law -
This is usually interpreted with respect to standards documents - adhere closely to the standard in what you output, but take common failures on input as long as they can be rectified. With respect to this particular issue, where the "standard" is really "whatever we can get sourmash to do", it would correspond to "accept any string irrespective of case, but only emit the uppercase." Realistically, though, the sourmash team is the only one writing software that produces sketches, and it would just be easier all around if non-canonical strings were not allowed. Since we control both the producer and consumer, I'm tempted to just go with that. I don't see any real advantage to supporting messy inputs right now, and I don't know who is asking for it or why they would ;) IMO, the real failure is on the sourmash sketch loading side, where unknown moltypes aren't flagged - they should be caught by sketch loading functions and database selection/selectors, and reported to the user. This is related to the ksize issue you ran across, where non-int ksizes were happily accepted and then just didn't work: On the other hand, there's some real interest in custom moltypes/hash functions, and I would expect us to expand both the tests and the freedom to Make Mistakes. |
if you need a script to fix all the directsketch outputs, that I can produce quite easily ;). |
(but if you don't mind it taking some time, it should be simple as |
trying to understand what's left - ok, there appear to be two different fun things still happening here 😅 . sig cat and sig extract behave differently / sig extract is brokenfirst:
creates an empty zip file, which sourmash fails to load, e.g. with
so that's weird. Turns out it's because I'm not using the context manager properly in sourmash borks on loading empty zip filesAnd, separately, sourmash is not happy with empty zip files (which is legit) and returns a confusing error (which is less so). This second issue is being punted to #3213. |
This fixes `sourmash sig extract` to properly close the `sourmash_args.SaveSignaturesToLocation` upon error exit. The proper thing to do is really to use the context manager tho 😭 . But I want to fix this problem first. Fixes #3191 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…#3212) This PR checks the types and values of parameters to `Index.select`. Specifically, it makes sure that: * ksize, num, and scaled are ints * containment is a bool * abund is a bool * moltype is 'DNA', 'protein', 'hp', or 'dayhoff' * there are no other parameters other than 'picklist' Also adds manifest column type enforcement in general, & explicit manifest content checking to `sig manifest` and `sig summarize`. Related issues: * Helps debug manifest issues per #3191 - invalid manifest rows will be revealed by `sig manifest` and `sig summarize` * Fixes #3107 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
tl;dr sourmash_plugin_directsketch creates sketch databases that interact poorly with
extract
.I think extract should not create a empty zip file if no matches are found. Thanks for the help figuring this out @bluegenes
Here is why!
From
summarize
, I expected to extract 53 matches, butextract
found 0.And
extract
still creates the output filewhich leads to this error report!
As opposed to running the same
sig cat
command but without an empty zip file with a matching nameThe text was updated successfully, but these errors were encountered: