-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For ZSTD_compressStream2(), let cdict take compression level priority #2303
For ZSTD_compressStream2(), let cdict take compression level priority #2303
Conversation
…hen using compressStream2
lib/compress/zstd_compress.c
Outdated
@@ -3469,9 +3469,12 @@ ZSTD_CDict* ZSTD_createCDict(const void* dict, size_t dictSize, int compressionL | |||
ZSTD_CDict* ZSTD_createCDict_byReference(const void* dict, size_t dictSize, int compressionLevel) | |||
{ | |||
ZSTD_compressionParameters cParams = ZSTD_getCParams_internal(compressionLevel, ZSTD_CONTENTSIZE_UNKNOWN, dictSize); | |||
return ZSTD_createCDict_advanced(dict, dictSize, | |||
ZSTD_CDict* cdict = ZSTD_createCDict_advanced(dict, dictSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor : coding style :
prefer ZSTD_CDict* const cdict =
which indicates that, once created, the cdict
pointer will never change.
lib/compress/zstd_compress.c
Outdated
@@ -3989,6 +3992,8 @@ size_t ZSTD_compressStream2( ZSTD_CCtx* cctx, | |||
FORWARD_IF_ERROR( ZSTD_initLocalDict(cctx) , ""); /* Init the local dict if present. */ | |||
ZSTD_memset(&cctx->prefixDict, 0, sizeof(cctx->prefixDict)); /* single usage */ | |||
assert(prefixDict.dict==NULL || cctx->cdict==NULL); /* only one can be set */ | |||
if (cctx->cdict) | |||
cctx->requestedParams.compressionLevel = cctx->cdict->compressionLevel; /* let cdict take priority in terms of compression level */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation overwrites a parameter set by the user.
We generally try to avoid that,
it can have surprising impacts in later usages of cctx
since these parameters are sticky.
That's also the reason for the existence of appliedParams
,
which is where dynamic adaptations are performed,
there they do not pollute user's explicit requests, nor survive the current session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked into this issue some more - it seems like this behavior also isn't consistent between ZSTD_compress_usingCDict()
and ZSTD_compressStream2()
. ZSTD_compress_usingCDict()
doesn't seem to have this issue, and always will use the CDict compression level.
The reason is: in ZSTD_compress_usingCDict()
, we essentially have:
ZSTD_CCtx_params params = cctx->requestedParams
[...]
params.cParams = ZSTD_getCParamsFromCDict() OR params.cParams = ZSTD_getCParams()
// where both these functions take cdict compression level into account
Whereas, in ZSTD_compressStream2()
, we have
ZSTD_CCtx_params params = cctx->requestedParams
params.cParams = ZSTD_getCParamsFromCCtxParams(&cctx->requestedParams, cctx->pledgedSrcSizePlusOne-1, 0 /*dictSize*/));
// followed immediately (within ZSTD_resetCStream_internal) by
params.cParams = ZSTD_getCParamsFromCCtxParams(¶ms, pledgedSrcSize, dictSize);
// in neither case is cdict compression level taken into account
I guess I'm a little perplexed here - what's the point of calling ZSTD_getCParamsFromCCtxParams()
twice back-to-back, with what are essentially the same arguments (with the exception of dictSize)? Doesn't the second call just overwrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's likely redundant in this case.
But params
is also used for other purposes when ZSTD_MULTITHREAD
is defined :
- It makes it possible to adjust
params.nbWorkers
without modifyingcctx->requestedParams
- It's a parameter of
ZSTDMT_initCStream_internal()
, where it's presumed correctly adjusted.
Also, consider that calling ZSTD_getCParamsFromCCtxParams()
twice is not very significant in term of performance, making this issue minor.
Nevertheless, if you believe you can improve the situation, you are welcomed.
2 minor comments regarding the PR's presentation :
|
@@ -3989,10 +3992,11 @@ size_t ZSTD_compressStream2( ZSTD_CCtx* cctx, | |||
FORWARD_IF_ERROR( ZSTD_initLocalDict(cctx) , ""); /* Init the local dict if present. */ | |||
ZSTD_memset(&cctx->prefixDict, 0, sizeof(cctx->prefixDict)); /* single usage */ | |||
assert(prefixDict.dict==NULL || cctx->cdict==NULL); /* only one can be set */ | |||
if (cctx->cdict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more digging, the reason for the cdict
passing on the correct clevel for filesizes + dictsizes that do not qualify for reloading is that downstream in ZSTD_compressBegin_internal()
, the CCtx is readjusted using ZSTD_resetCCtx_usingCDict()
with the compression level of the cdict
.
However, in the case of reloading, the function called to adjust the cctx
is ZSTD_resetCCtx_internal()
which doesn't take into account the cdict
.
So I think the mitigation is to adjust the params
upstream (which is a copy of requestedParams), and later on if we need to reload the dict and call ZSTD_resetCCtx_internal()
, this very params
which may have the cdict
compression level, will be copied into the cctx->appliedParams
during ZSTD_resetCCtx_internal()
, and in this case, we don't modify the cctx->requestedParams
if the cctx
needs to be used in the future for some reason.
I'm not sure if this is exactly what you meant by using the appliedParams
, but it does seem like the appliedParams
are mainly touched in the ZSTD_resetCCtx_internal()
, and in this case we are essentially modifying appliedParams
by modifying params
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine.
appliedParams
is the place where the final result of parameter adjustments will be stored.
Before reaching that point, appliedParams
could also be used as a form of "temporary storage", where intermediate adjustment results are saved.
However, this could prove messy and confusing, hence separating the 2 topics, by creating and using a dedicated params
structure for the purpose of temporary storage and intermediate adjustment is fine, probably clearer.
Are there any further follow-up/changes required on this PR to merge? |
This PR does resolves two issues brought up in #2300, discovered by @animalize:
cdict
was used during compression, that cdict's compression level was not respected. Now thecdict
will take priority in terms of compression level, but will not changecctx->requestedParams
.ZSTD_createCDict_byReference()
to use the overloadedcompressionLevel = 0
behavior.No unit tests included in this PR, but tested locally (and by @animalize in #2300) to confirm that using
ZSTD_compressStream2()
does in fact take on thecdict
compression level when compressing files that would cause a dictionary reload, and that everything else works as usual.