Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression ratio regression in dictionary + streaming API mode (src size unknown) #2442

Closed
phiresky opened this issue Dec 23, 2020 · 6 comments · Fixed by #2451
Closed

Compression ratio regression in dictionary + streaming API mode (src size unknown) #2442

phiresky opened this issue Dec 23, 2020 · 6 comments · Fixed by #2451
Assignees

Comments

@phiresky
Copy link

phiresky commented Dec 23, 2020

When using the streaming compression API using a dictionary, there were two regressions between 1.4.5 and 1.4.7 that make dictionary compression unusable at least for my use case:

A 120kB file compressed with a 20kB dictionary gives the following sizes (all at level 19):

  • 1.4.5 without a dictionary: 29517 bytes
  • 1.4.5 with a dictionary: 24177 bytes
  • 1.4.8 with a dictionary when using ZSTD_compress_usingCDict api: 23866 bytes
  • 1.4.8 with a dictionary when using streaming api: 76455 bytes (!!)

In total, in my use case of a sqlite extension for transparent row-level compression https://github.com/phiresky/sqlite-zstd , this causes a dataset of 2GB of rows of around 30kB each compressed individually to only compress to 1GB instead of down to ~100MB as it did before.

I did a bisect on the code base and then investigated for a while, and the reason for the regression are these two commits:

After 48ef15f, the result goes up from 23000 bytes to 28904 bytes, then after d5c688e (both by @terrelln ), the size goes up to 76455.

The reason is that when the streaming API is used, the srcSize is set to ZSTD_CONTENTSIZE_UNKNOWN. Then, in ZSTD_adjustCParams_internal the srcSize is set to 513 bytes, and the dictSize is set to 0 (because the mode is ZSTD_cpm_attachDict:

if (dictSize && srcSize == ZSTD_CONTENTSIZE_UNKNOWN)
srcSize = minSrcSize;
switch (mode) {
case ZSTD_cpm_noAttachDict:
case ZSTD_cpm_unknown:
case ZSTD_cpm_createCDict:
break;
case ZSTD_cpm_attachDict:
dictSize = 0;
break;

Setting these values then causes the windowSize to go down to 1024, which means that the 120kB file is compressed in individual 1024 byte segments.

Removing the dictSize = 0 assignment above in the current dev branch in causes the windowSize to be 20kB (exactly to fit the dictionary) which reduces the compressed size to 29kB again, but to get it down to 23819 bytes something like this is needed:

diff --git a/lib/compress/zstd_compress.c b/lib/compress/zstd_compress.c
index 386b051d..c84c0b25 100644
--- a/lib/compress/zstd_compress.c
+++ b/lib/compress/zstd_compress.c
@@ -1189,7 +1189,7 @@ ZSTD_adjustCParams_internal(ZSTD_compressionParameters cPar,
     assert(ZSTD_checkCParams(cPar)==0);
 
     if (dictSize && srcSize == ZSTD_CONTENTSIZE_UNKNOWN)
-        srcSize = minSrcSize;
+        srcSize = 100 * dictSize;
 
     switch (mode) {
     case ZSTD_cpm_noAttachDict:
@@ -1197,7 +1197,7 @@ ZSTD_adjustCParams_internal(ZSTD_compressionParameters cPar,
     case ZSTD_cpm_createCDict:
         break;
     case ZSTD_cpm_attachDict:
-        dictSize = 0;
+        // dictSize = 0;
         break;
     default:
         assert(0);

Though really I don't see why the size of the source dictionary should influence the window size at all, and I also don't see why when a dictionary is there the source size is assumed to be 513 bytes.

I originally reported this here: gyscos/zstd-rs#100 But seems like it's unrelated to the Rust wrapper.

The CLI does not have this problem since it uses known source sizes (I guess).

@Cyan4973
Copy link
Contributor

Cyan4973 commented Dec 23, 2020

Thanks for the feedback @phiresky .

We have automated tests which pay attention to compression ratio evolutions at each update,
but it seems this specific scenario slipped through.
Since we missed this regression, this scenario is probably not present in the test suite
To be fixed.

@Cyan4973 Cyan4973 changed the title Compression size regression in streaming API when src size is unknown Compression ratio regression in dictionary + streaming API mode (src size unknown) Dec 23, 2020
@phiresky
Copy link
Author

Here's a test file and dictionary where the problem happens:
test.tar.gz

But I think it should apply to pretty much any source file larger than 513 bytes when using the streaming api with a dictionary, with worse results the larger the input file is.

@terrelln terrelln self-assigned this Jan 4, 2021
@terrelln
Copy link
Contributor

terrelln commented Jan 4, 2021

@phiresky I am looking into a fix which will be present in the next zstd release.

In the meantime, you can work around this issue using the ZSTD_c_srcSizeHint flag. When set, and when zstd doesn't know the source size in advance (in streaming mode for example), zstd will assume the source size is the source size hint when selecting parameters.

ZSTD_CCtx_setParameter(cctx, ZSTD_c_srcSizeHint, 120 * 1024);

See the documentation here:

zstd/lib/zstd.h

Lines 1674 to 1678 in bc0a1e4

/* User's best guess of source size.
* Hint is not valid when srcSizeHint == 0.
* There is no guarantee that hint is close to actual source size,
* but compression ratio may regress significantly if guess considerably underestimates */
#define ZSTD_c_srcSizeHint ZSTD_c_experimentalParam7

@felixhandte
Copy link
Contributor

To expand on @terrelln's comment: this is a tricky scenario that's hard to get right. What is optimal for dictionary compression on small inputs is not for large inputs, and vice-a-versa. In the absence of information, we have to make a guess, and sometimes that guess is going to be wrong. We may look into whether there are more balanced guesses that might work better on average and ship that in the future. But as @terrelln suggests, if you even approximately know the input size that you're going to stream, giving zstd that information via the srcSizeHint will allow zstd to make the right selection for the specific compression in question.

I hope that helps!

terrelln added a commit to terrelln/zstd that referenced this issue Jan 4, 2021
Fixes facebook#2442.

1. When creating a dictionary keep the same behavior as before.
   Assume the source size is 513 bytes when adjusting parameters.
2. When calling ZSTD_getCParams() or ZSTD_adjustCParams() keep
   the same behavior as before.
3. When attaching a dictionary keep the same behavior of ignoring
   the dictionary size. When streaming this will select the
   largest parameters and not adjust them down. But, the CDict
   will use the correctly sized parameters, which seems like the
   right tradeoff.
4. When not attaching a dictionary (either forced not to, or
   using a prefix dictionary) we select parameters based on the
   dictionary size + source size, and assume the source size is
   small, which is the same behavior as before. But, now we don't
   adjust the window log (and hash and chain log) down when the
   source size is unknown.

When the source size is unknown all cdicts should attach, except
when the user disables attaching, or `forceWindow` is used. This
means that when streaming with a CDict we end up in the good case
where we get small CDict parameters, and large source parameters.

TODO: Add a streaming + dictionary regression test case.
terrelln added a commit to terrelln/zstd that referenced this issue Jan 4, 2021
Fixes facebook#2442.

1. When creating a dictionary keep the same behavior as before.
   Assume the source size is 513 bytes when adjusting parameters.
2. When calling ZSTD_getCParams() or ZSTD_adjustCParams() keep
   the same behavior as before.
3. When attaching a dictionary keep the same behavior of ignoring
   the dictionary size. When streaming this will select the
   largest parameters and not adjust them down. But, the CDict
   will use the correctly sized parameters, which seems like the
   right tradeoff.
4. When not attaching a dictionary (either forced not to, or
   using a prefix dictionary) we select parameters based on the
   dictionary size + source size, and assume the source size is
   small, which is the same behavior as before. But, now we don't
   adjust the window log (and hash and chain log) down when the
   source size is unknown.

When the source size is unknown all cdicts should attach, except
when the user disables attaching, or `forceWindow` is used. This
means that when streaming with a CDict we end up in the good case
where we get small CDict parameters, and large source parameters.

TODO: Add a streaming + dictionary regression test case.
@phiresky
Copy link
Author

phiresky commented Jan 4, 2021

if you even approximately know the input size that you're going to stream

Thanks, that's what I'm doing now, though the Rust wrapper (and I'm guessing many other frontends for zstd) don't have an option for that.

Considering using a heuristic here can always lead to worse compression I'd say the most fitting default would be to either refuse doing anything without having a size estimate (or warn when the heuristics turn out to have been a bad idea), or better use the same default as used without a dictionary (keep the windowLog as given) and add a comment to the documentation. Especially since when you have small data you probably load it fully into RAM and thus don't need the streaming API.

@terrelln
Copy link
Contributor

terrelln commented Jan 4, 2021

Thanks, that's what I'm doing now, though the Rust wrapper (and I'm guessing many other frontends for zstd) don't have an option for that.

This is a somewhat new parameter, but as of about 1 year ago we've settled on the advanced API that we currently provide. It makes it very easy for us to add new parameters, since they all go through the same function ZSTD_CCtx_setParameter(). Hopefully, all wrappers will eventually provide access to this advanced API, so their users can use new functionality as we add it.

Considering using a heuristic here can always lead to worse compression I'd say the most fitting default would be to either refuse doing anything without having a size estimate (or warn when the heuristics turn out to have been a bad idea), or better use the same default as used without a dictionary (keep the windowLog as given) and add a comment to the documentation. Especially since when you have small data you probably load it fully into RAM and thus don't need the streaming API.

I'm modifying the heuristic to be much less aggressive in shrinking the parameters in #2451. It was (incorrectly) shrinking the window size down to 1KB in your case, when we believe the "correct" choice for your dictionary size (20KB) would be to use the 128KB parameters.

terrelln added a commit to terrelln/zstd that referenced this issue Jan 4, 2021
Fixes facebook#2442.

1. When creating a dictionary keep the same behavior as before.
   Assume the source size is 513 bytes when adjusting parameters.
2. When calling ZSTD_getCParams() or ZSTD_adjustCParams() keep
   the same behavior as before.
3. When attaching a dictionary keep the same behavior of ignoring
   the dictionary size. When streaming this will select the
   largest parameters and not adjust them down. But, the CDict
   will use the correctly sized parameters, which seems like the
   right tradeoff.
4. When not attaching a dictionary (either forced not to, or
   using a prefix dictionary) we select parameters based on the
   dictionary size + source size, and assume the source size is
   small, which is the same behavior as before. But, now we don't
   adjust the window log (and hash and chain log) down when the
   source size is unknown.

When the source size is unknown all cdicts should attach, except
when the user disables attaching, or `forceWindow` is used. This
means that when streaming with a CDict we end up in the good case
where we get small CDict parameters, and large source parameters.

TODO: Add a streaming + dictionary regression test case.
terrelln added a commit to terrelln/zstd that referenced this issue Jan 4, 2021
Fixes facebook#2442.

1. When creating a dictionary keep the same behavior as before.
   Assume the source size is 513 bytes when adjusting parameters.
2. When calling ZSTD_getCParams() or ZSTD_adjustCParams() keep
   the same behavior as before.
3. When attaching a dictionary keep the same behavior of ignoring
   the dictionary size. When streaming this will select the
   largest parameters and not adjust them down. But, the CDict
   will use the correctly sized parameters, which seems like the
   right tradeoff.
4. When not attaching a dictionary (either forced not to, or
   using a prefix dictionary) we select parameters based on the
   dictionary size + source size, and assume the source size is
   small, which is the same behavior as before. But, now we don't
   adjust the window log (and hash and chain log) down when the
   source size is unknown.

When the source size is unknown all cdicts should attach, except
when the user disables attaching, or `forceWindow` is used. This
means that when streaming with a CDict we end up in the good case
where we get small CDict parameters, and large source parameters.

TODO: Add a streaming + dictionary regression test case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants