Add prefetchCDictTables CCtxParam (+10-20% cold dict compression speed) #3177
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Description of the optimization
In some situations, zstd uses CDict tables in-place rather than copying them into the working context. (See docs on ZSTD_dictAttachPref_e for details). In such situations, compression speed is seriously impacted when CDict tables are "cold" (outside CPU cache).
This PR adds a CCtxParam (prefetchCDictTables) which instructs zstd to prefetch CDict tables when they are used in-place (specifically in level 1-4 dictMatchState matchfinders). For sufficiently small inputs, the cost of the prefetch will outweigh the benefit. For sufficiently large inputs, zstd will by default memcpy() CDict tables into the working context, so there is no need to prefetch. This parameter is targeted at a middle range of input sizes, where a prefetch is cheap enough to be useful but memcpy() is too expensive.
The exact range of input sizes where this makes sense is best determined by careful experimentation (see below for measurements on one particular machine / dataset which demonstrate 10-20% wins for a particular working set size and input size). Rather than enabling this param for all inputs, the code which calls
ZSTD_compress2()
should use a size cutoff (tuned via experimentation) to select the best prefetch strategy for each input.Measurements
I measured the effect of this param on the HTML dataset. I benchmarked on a Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz machine with core isolation and turbo disabled.
We can see that the param is harmful for level3 even in the cold CDict scenario if the inputs are small enough (0-8K). For larger inputs (8-16K) at the same level, we see up to 20% wins. This demonstrates the need for selective application of this param.