Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For ZSTD_compressStream2(), let cdict take compression level priority #2303

Merged
merged 3 commits into from
Oct 9, 2020

Conversation

senhuang42
Copy link
Contributor

@senhuang42 senhuang42 commented Sep 10, 2020

This PR does resolves two issues brought up in #2300, discovered by @animalize:

  1. When using ZSTD_compressStream2(), if a cdict was used during compression, that cdict's compression level was not respected. Now the cdict will take priority in terms of compression level, but will not change cctx->requestedParams.
  2. Fixes ZSTD_createCDict_byReference() to use the overloaded compressionLevel = 0 behavior.

No unit tests included in this PR, but tested locally (and by @animalize in #2300) to confirm that using ZSTD_compressStream2() does in fact take on the cdict compression level when compressing files that would cause a dictionary reload, and that everything else works as usual.

@@ -3469,9 +3469,12 @@ ZSTD_CDict* ZSTD_createCDict(const void* dict, size_t dictSize, int compressionL
ZSTD_CDict* ZSTD_createCDict_byReference(const void* dict, size_t dictSize, int compressionLevel)
{
ZSTD_compressionParameters cParams = ZSTD_getCParams_internal(compressionLevel, ZSTD_CONTENTSIZE_UNKNOWN, dictSize);
return ZSTD_createCDict_advanced(dict, dictSize,
ZSTD_CDict* cdict = ZSTD_createCDict_advanced(dict, dictSize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : coding style :
prefer ZSTD_CDict* const cdict =
which indicates that, once created, the cdict pointer will never change.

@@ -3989,6 +3992,8 @@ size_t ZSTD_compressStream2( ZSTD_CCtx* cctx,
FORWARD_IF_ERROR( ZSTD_initLocalDict(cctx) , ""); /* Init the local dict if present. */
ZSTD_memset(&cctx->prefixDict, 0, sizeof(cctx->prefixDict)); /* single usage */
assert(prefixDict.dict==NULL || cctx->cdict==NULL); /* only one can be set */
if (cctx->cdict)
cctx->requestedParams.compressionLevel = cctx->cdict->compressionLevel; /* let cdict take priority in terms of compression level */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation overwrites a parameter set by the user.
We generally try to avoid that,
it can have surprising impacts in later usages of cctx since these parameters are sticky.

That's also the reason for the existence of appliedParams,
which is where dynamic adaptations are performed,
there they do not pollute user's explicit requests, nor survive the current session.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into this issue some more - it seems like this behavior also isn't consistent between ZSTD_compress_usingCDict() and ZSTD_compressStream2(). ZSTD_compress_usingCDict() doesn't seem to have this issue, and always will use the CDict compression level.

The reason is: in ZSTD_compress_usingCDict(), we essentially have:

ZSTD_CCtx_params params = cctx->requestedParams
[...]
params.cParams = ZSTD_getCParamsFromCDict() OR params.cParams = ZSTD_getCParams()
// where both these functions take cdict compression level into account

Whereas, in ZSTD_compressStream2(), we have

ZSTD_CCtx_params params = cctx->requestedParams
params.cParams = ZSTD_getCParamsFromCCtxParams(&cctx->requestedParams, cctx->pledgedSrcSizePlusOne-1, 0 /*dictSize*/));
// followed immediately (within ZSTD_resetCStream_internal) by
params.cParams = ZSTD_getCParamsFromCCtxParams(&params, pledgedSrcSize, dictSize);
// in neither case is cdict compression level taken into account

I guess I'm a little perplexed here - what's the point of calling ZSTD_getCParamsFromCCtxParams() twice back-to-back, with what are essentially the same arguments (with the exception of dictSize)? Doesn't the second call just overwrite?

Copy link
Contributor

@Cyan4973 Cyan4973 Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's likely redundant in this case.

But params is also used for other purposes when ZSTD_MULTITHREAD is defined :

  • It makes it possible to adjust params.nbWorkers without modifying cctx->requestedParams
  • It's a parameter of ZSTDMT_initCStream_internal(), where it's presumed correctly adjusted.

Also, consider that calling ZSTD_getCParamsFromCCtxParams() twice is not very significant in term of performance, making this issue minor.

Nevertheless, if you believe you can improve the situation, you are welcomed.

@Cyan4973
Copy link
Contributor

2 minor comments regarding the PR's presentation :

  • reference the issues that are solved by this PR (it will also automatically close them on merge)
  • name the contributor which inspired these changes

@@ -3989,10 +3992,11 @@ size_t ZSTD_compressStream2( ZSTD_CCtx* cctx,
FORWARD_IF_ERROR( ZSTD_initLocalDict(cctx) , ""); /* Init the local dict if present. */
ZSTD_memset(&cctx->prefixDict, 0, sizeof(cctx->prefixDict)); /* single usage */
assert(prefixDict.dict==NULL || cctx->cdict==NULL); /* only one can be set */
if (cctx->cdict)
Copy link
Contributor Author

@senhuang42 senhuang42 Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more digging, the reason for the cdict passing on the correct clevel for filesizes + dictsizes that do not qualify for reloading is that downstream in ZSTD_compressBegin_internal(), the CCtx is readjusted using ZSTD_resetCCtx_usingCDict() with the compression level of the cdict.

However, in the case of reloading, the function called to adjust the cctx is ZSTD_resetCCtx_internal() which doesn't take into account the cdict.

So I think the mitigation is to adjust the params upstream (which is a copy of requestedParams), and later on if we need to reload the dict and call ZSTD_resetCCtx_internal(), this very params which may have the cdict compression level, will be copied into the cctx->appliedParams during ZSTD_resetCCtx_internal(), and in this case, we don't modify the cctx->requestedParams if the cctx needs to be used in the future for some reason.

I'm not sure if this is exactly what you meant by using the appliedParams, but it does seem like the appliedParams are mainly touched in the ZSTD_resetCCtx_internal(), and in this case we are essentially modifying appliedParams by modifying params.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine.

appliedParams is the place where the final result of parameter adjustments will be stored.

Before reaching that point, appliedParams could also be used as a form of "temporary storage", where intermediate adjustment results are saved.
However, this could prove messy and confusing, hence separating the 2 topics, by creating and using a dedicated params structure for the purpose of temporary storage and intermediate adjustment is fine, probably clearer.

@senhuang42 senhuang42 linked an issue Sep 14, 2020 that may be closed by this pull request
@senhuang42
Copy link
Contributor Author

Are there any further follow-up/changes required on this PR to merge?

@Cyan4973 Cyan4973 merged commit 6fdb0cb into facebook:dev Oct 9, 2020
@ghost ghost mentioned this pull request Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compressionLevel is not respected in this case
3 participants