-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV output for sourmash search
needs upgrading
#1390
Comments
Oh, and also, we're inconsistent with md5sum output per #1346 (review) |
we could/should also consider including the metric used - jaccard, containment, max containment. and/or just, like, calculate all of those. |
(sigh, for scaled sketches; more than jaccard not possible with regular MinHash) |
CSVs are hard (impossible?) to version, but we should have some way of doing that too. Or do we just keep ever growing the CSV and never removing columns? 🙃 |
thoughts on approach in #1555? Basically, I think it's OK to pin column names to sourmash versions, with appropriate deprecation approaches and command-line upgrade flags. That fits with their use in workflows. In manifests, we are using:
but I'm pretty confident that this breaks pandas/Python header detection, sigh. IMO it was OK to do this for manifests because these are not intended to be end-user-consumable. #416 has the idea of building standard pandas/CSV loading functions for sourmash output, which is something I'm trying out over in genome-grist for gather output - dib-lab/genome-grist#176. But I'd be loathe to break all CSV readers everywhere :(. I guess... we could include a "version for this CSV format" in the first column in the first row, and leave that column blank, or something? or do the same but for the last column in the first row (so, less visible, but leaving it blank is less annoying for manual inspection of the CSV). This would make it a header but that's ok. |
(Some of this might be 5.0 material, because they change the file format in backwards-incompatible ways)
A few issues --
MinHash
class. #1346 (review), @bluegenes notes that CSV output contains the header 'similarity' and sez "It would be nice to modify similarity to containment / max_containment for csv output" when--max-containment
or--containment
are specifiedrelated to #1247, #410, and #448.
It's not really clear what to do here. The addition of prefetch #1370 might provide a useful alternative here, and/or we could provide JSON output that has more ...flexibility per #448.
The text was updated successfully, but these errors were encountered: