Add an option in the zstd CLI to verify that a given .zst file matches an uncompressed file #3287

FRex · 2022-10-12T18:08:17Z

Is your feature request related to a problem? Please describe.
I'd like a switch or two in the CLI to verify that a given file matches the compressed .zst file hash and/or content, to verify if given .zst file is a compressed version of a given normal file.

Describe the solution you'd like
A new switch or two, that'd look and work as such:

$ zstd --verify somefile.txt.zst somefile.txt
somefile.txt.zst and somefile.txt have matching xxhash.

$ zstd --verify-data somefile.zst somefile.txt
somefile.txt.zst and somefile.txt have matching data.

It'd first check the filesize, and then hash/data (no point doing the latter if filesize doesn't match).

Describe alternatives you've considered
I've considered writing own C or Python program to do this, but I think it'd fit as part of zstd CLI and be useful in general. Zstd CLI also already has all the functionality: file IO, parsing zstd frames, xxhash, etc. Also zstd -l does display the frame count, sizes (human readable, not down to bytes), and that xxhash was used, but does not tell me the 4 low bytes of the 64-bit xxhash so I can't use that with xxhash myself either.

Additional context
My use case is that I often work with big text files that I get as .zst, and sometimes I modify them. When I need to free up some space I go delete some of the unmodified files, but wouldn't want to delete a modified one. This option would let me check if given file and the same file + .zst are 'same', and if it's safe to delete the uncompressed one or not.

Another use case could be someone who is paranoid and wants to verify that, maybe it could be part of some extra --rm option for very careful people too (I don't know if --rm now verifies the written file is correct or not).

The text was updated successfully, but these errors were encountered:

Cyan4973 · 2022-10-17T21:15:43Z

What about:

zstd -d -c FILE.zst | cmp FILE -

FRex · 2022-10-17T21:43:20Z

If FILE's size changed (very common when editing text) it will do (potentially a lot, if the change is very deep into file) needless work, instead of just checking size of FILE on disk vs. size stored in FILE.zst

Right now -v -l reports original filesize but not the hash (it only says Check: XXH64), if it printed the hash, that'd enable writing a script that does all I said I'm looking for and more.

Cyan4973 · 2022-10-17T21:50:31Z

OK, so you are looking for zstd -lv to report the actual value of the content hash, not just the fact that it exists.
This is likely achievable.

FRex · 2022-10-17T21:56:05Z

That'd enable scripts to use zstd cli to do what I mentioned. Only problem I can imagine is multi-frame file with hash per frame. El lun, 17 oct 2022 23:50, Yann Collet ***@***.***> escribió:

…

OK, so you are looking for zstd -lv to report the actual value of the content hash, not just the fact that it exists. This is likely achievable. — Reply to this email directly, view it on GitHub <#3287 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASNTZ2CRYCWGKWXLAXDKALWDXC3DANCNFSM6AAAAAARDQ57VY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

FRex · 2022-10-20T19:59:51Z

@Cyan4973

I've written the code to add printing checksum for single frame files with -v -l here: dev...FRex:zstd:feature-zstd-cli-print-xxhash

If you find it acceptable I can create a PR or you can just copy paste it. If it's not acceptable let me know what I can fix to make it fit this requirements.

I skimmed https://github.com/facebook/zstd/blob/dev/CONTRIBUTING.md and tried to run make test but it contains (unrelated) errors in ../lib/dictBuilder/divsufsort.c that also happen in current dev branch.

I don't have any idea for how to print this hash per frame (and I personally don't think I need it, even for multi-gig files zstd seems to produce single frame file?).

Here's an example of running it (notice 618814bc in the output):

$ xxhsum.exe /e/one.txt ; echo ;  ./programs/zstd.exe -v -l /e/one.txt.zst /e/two.txt.zst
695a3bd7618814bc  E:/one.txt

*** zstd command line interface 64-bits v1.5.3, by Yann Collet ***
E:/one.txt.zst
# Zstandard Frames: 1
DictID: 0
Window Size: 194 KiB (198246 B)
Compressed Size: 38 B (38 B)
Decompressed Size: 194 KiB (198246 B)
Ratio: 5217.0000
Check: XXH64 618814bc

E:/two.txt.zst
# Zstandard Frames: 2
DictID: 0
Window Size: 194 KiB (198246 B)
Compressed Size: 76 B (76 B)
Decompressed Size: 387 KiB (396492 B)
Ratio: 5217.0000
Check: XXH64

Cyan4973 · 2022-12-02T23:43:08Z

I've written the code to add printing checksum for single frame files with -v -l here: dev...FRex:zstd:feature-zstd-cli-print-xxhash

I like it, this is a good PR

I skimmed https://github.com/facebook/zstd/blob/dev/CONTRIBUTING.md and tried to run make test but it contains (unrelated) errors in ../lib/dictBuilder/divsufsort.c that also happen in current dev branch.

It's obviously unrelated to your work.
And it's surprising. divsufsort.c is a hosted 3rd party library.
We generally don't touch it, except for some minor edits to pass some stringent compilations warnings.
It's tested, as part of our CIs, so we would have expected to catch these issues before they reach your side.
If necessary, we could take a look, in order to fix the issues you experienced, but that's a separate effort.

I don't have any idea for how to print this hash per frame (and I personally don't think I need it, even for multi-gig files zstd seems to produce single frame file?).

The normal scenario is one single frame, whatever the size of input.
Multi-frames is more advanced. Typically, it happens when the content is produced in multiple sessions, or watermarks are added, or random access capabilities are added, etc.

It's fine if your PR only solves the "1-frame" scenario,
it's the more important one,
and the one that solves your issue.

FRex · 2022-12-06T16:56:26Z

I'm happy to hear the PR is good. Would you like to merge it? Do I need to reassign (c) to you or Facebook for this purpose?

I noticed concatenating two zst files with cat creates a multi-frame zst file that uncompresses to original two files, concatenated. I guess it can be useful for concatenating files without recompression in between.

Cyan4973 · 2022-12-06T17:30:09Z

I noticed concatenating two zst files with cat creates a multi-frame zst file that uncompresses to original two files, concatenated. I guess it can be useful for concatenating files without recompression in between.

Yes,
a classical scenario would be an append-only database, like a log system.
In which case, new content is added every day or every hour.
It's generally easier to simply append a new frame into the same file.

Cyan4973 · 2022-12-06T18:41:03Z

I'm happy to hear the PR is good. Would you like to merge it? Do I need to reassign (c) to you or Facebook for this purpose?

Generally, the authors of the patches push the PR themselves,

for this case though, I created : #3332
which tracks your patch from your fork.

FRex · 2022-12-06T20:15:31Z

Thank you. Is there anything else I have to do?

Cyan4973 · 2022-12-06T20:27:59Z

A nb of CI tests have been failing on #3332,
but they don't seem related to the patch itself,
just give us some time to sort that out.

Cyan4973 · 2022-12-20T01:35:07Z

Patch merged

Cyan4973 added the feature request label Oct 17, 2022

Cyan4973 self-assigned this Dec 20, 2022

Cyan4973 closed this as completed Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option in the zstd CLI to verify that a given .zst file matches an uncompressed file #3287

Add an option in the zstd CLI to verify that a given .zst file matches an uncompressed file #3287

FRex commented Oct 12, 2022

Cyan4973 commented Oct 17, 2022

FRex commented Oct 17, 2022

Cyan4973 commented Oct 17, 2022

FRex commented Oct 17, 2022 via email

FRex commented Oct 20, 2022

Cyan4973 commented Dec 2, 2022 •

edited

Loading

FRex commented Dec 6, 2022

Cyan4973 commented Dec 6, 2022 •

edited

Loading

Cyan4973 commented Dec 6, 2022

FRex commented Dec 6, 2022

Cyan4973 commented Dec 6, 2022

Cyan4973 commented Dec 20, 2022

Add an option in the zstd CLI to verify that a given .zst file matches an uncompressed file #3287

Add an option in the zstd CLI to verify that a given .zst file matches an uncompressed file #3287

Comments

FRex commented Oct 12, 2022

Cyan4973 commented Oct 17, 2022

FRex commented Oct 17, 2022

Cyan4973 commented Oct 17, 2022

FRex commented Oct 17, 2022 via email

FRex commented Oct 20, 2022

Cyan4973 commented Dec 2, 2022 • edited Loading

FRex commented Dec 6, 2022

Cyan4973 commented Dec 6, 2022 • edited Loading

Cyan4973 commented Dec 6, 2022

FRex commented Dec 6, 2022

Cyan4973 commented Dec 6, 2022

Cyan4973 commented Dec 20, 2022

Cyan4973 commented Dec 2, 2022 •

edited

Loading

Cyan4973 commented Dec 6, 2022 •

edited

Loading