Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option in the zstd CLI to verify that a given .zst file matches an uncompressed file #3287

Closed
FRex opened this issue Oct 12, 2022 · 12 comments
Assignees

Comments

@FRex
Copy link
Contributor

FRex commented Oct 12, 2022

Is your feature request related to a problem? Please describe.
I'd like a switch or two in the CLI to verify that a given file matches the compressed .zst file hash and/or content, to verify if given .zst file is a compressed version of a given normal file.

Describe the solution you'd like
A new switch or two, that'd look and work as such:

$ zstd --verify somefile.txt.zst somefile.txt
somefile.txt.zst and somefile.txt have matching xxhash.

$ zstd --verify-data somefile.zst somefile.txt
somefile.txt.zst and somefile.txt have matching data.

It'd first check the filesize, and then hash/data (no point doing the latter if filesize doesn't match).

Describe alternatives you've considered
I've considered writing own C or Python program to do this, but I think it'd fit as part of zstd CLI and be useful in general. Zstd CLI also already has all the functionality: file IO, parsing zstd frames, xxhash, etc. Also zstd -l does display the frame count, sizes (human readable, not down to bytes), and that xxhash was used, but does not tell me the 4 low bytes of the 64-bit xxhash so I can't use that with xxhash myself either.

Additional context
My use case is that I often work with big text files that I get as .zst, and sometimes I modify them. When I need to free up some space I go delete some of the unmodified files, but wouldn't want to delete a modified one. This option would let me check if given file and the same file + .zst are 'same', and if it's safe to delete the uncompressed one or not.

Another use case could be someone who is paranoid and wants to verify that, maybe it could be part of some extra --rm option for very careful people too (I don't know if --rm now verifies the written file is correct or not).

@Cyan4973
Copy link
Contributor

What about:

zstd -d -c FILE.zst | cmp FILE -

@FRex
Copy link
Contributor Author

FRex commented Oct 17, 2022

If FILE's size changed (very common when editing text) it will do (potentially a lot, if the change is very deep into file) needless work, instead of just checking size of FILE on disk vs. size stored in FILE.zst

Right now -v -l reports original filesize but not the hash (it only says Check: XXH64), if it printed the hash, that'd enable writing a script that does all I said I'm looking for and more.

@Cyan4973
Copy link
Contributor

OK, so you are looking for zstd -lv to report the actual value of the content hash, not just the fact that it exists.
This is likely achievable.

@FRex
Copy link
Contributor Author

FRex commented Oct 17, 2022 via email

@FRex
Copy link
Contributor Author

FRex commented Oct 20, 2022

@Cyan4973

I've written the code to add printing checksum for single frame files with -v -l here: dev...FRex:zstd:feature-zstd-cli-print-xxhash

If you find it acceptable I can create a PR or you can just copy paste it. If it's not acceptable let me know what I can fix to make it fit this requirements.

I skimmed https://github.com/facebook/zstd/blob/dev/CONTRIBUTING.md and tried to run make test but it contains (unrelated) errors in ../lib/dictBuilder/divsufsort.c that also happen in current dev branch.

I don't have any idea for how to print this hash per frame (and I personally don't think I need it, even for multi-gig files zstd seems to produce single frame file?).

Here's an example of running it (notice 618814bc in the output):

$ xxhsum.exe /e/one.txt ; echo ;  ./programs/zstd.exe -v -l /e/one.txt.zst /e/two.txt.zst
695a3bd7618814bc  E:/one.txt

*** zstd command line interface 64-bits v1.5.3, by Yann Collet ***
E:/one.txt.zst
# Zstandard Frames: 1
DictID: 0
Window Size: 194 KiB (198246 B)
Compressed Size: 38 B (38 B)
Decompressed Size: 194 KiB (198246 B)
Ratio: 5217.0000
Check: XXH64 618814bc

E:/two.txt.zst
# Zstandard Frames: 2
DictID: 0
Window Size: 194 KiB (198246 B)
Compressed Size: 76 B (76 B)
Decompressed Size: 387 KiB (396492 B)
Ratio: 5217.0000
Check: XXH64

@Cyan4973
Copy link
Contributor

Cyan4973 commented Dec 2, 2022

I've written the code to add printing checksum for single frame files with -v -l here: dev...FRex:zstd:feature-zstd-cli-print-xxhash

I like it, this is a good PR

I skimmed https://github.com/facebook/zstd/blob/dev/CONTRIBUTING.md and tried to run make test but it contains (unrelated) errors in ../lib/dictBuilder/divsufsort.c that also happen in current dev branch.

It's obviously unrelated to your work.
And it's surprising. divsufsort.c is a hosted 3rd party library.
We generally don't touch it, except for some minor edits to pass some stringent compilations warnings.
It's tested, as part of our CIs, so we would have expected to catch these issues before they reach your side.
If necessary, we could take a look, in order to fix the issues you experienced, but that's a separate effort.

I don't have any idea for how to print this hash per frame (and I personally don't think I need it, even for multi-gig files zstd seems to produce single frame file?).

The normal scenario is one single frame, whatever the size of input.
Multi-frames is more advanced. Typically, it happens when the content is produced in multiple sessions, or watermarks are added, or random access capabilities are added, etc.

It's fine if your PR only solves the "1-frame" scenario,
it's the more important one,
and the one that solves your issue.

@FRex
Copy link
Contributor Author

FRex commented Dec 6, 2022

I'm happy to hear the PR is good. Would you like to merge it? Do I need to reassign (c) to you or Facebook for this purpose?

I noticed concatenating two zst files with cat creates a multi-frame zst file that uncompresses to original two files, concatenated. I guess it can be useful for concatenating files without recompression in between.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Dec 6, 2022

I noticed concatenating two zst files with cat creates a multi-frame zst file that uncompresses to original two files, concatenated. I guess it can be useful for concatenating files without recompression in between.

Yes,
a classical scenario would be an append-only database, like a log system.
In which case, new content is added every day or every hour.
It's generally easier to simply append a new frame into the same file.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Dec 6, 2022

I'm happy to hear the PR is good. Would you like to merge it? Do I need to reassign (c) to you or Facebook for this purpose?

Generally, the authors of the patches push the PR themselves,

for this case though, I created : #3332
which tracks your patch from your fork.

@FRex
Copy link
Contributor Author

FRex commented Dec 6, 2022

Thank you. Is there anything else I have to do?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Dec 6, 2022

A nb of CI tests have been failing on #3332,
but they don't seem related to the patch itself,
just give us some time to sort that out.

@Cyan4973 Cyan4973 self-assigned this Dec 20, 2022
@Cyan4973
Copy link
Contributor

Patch merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants