Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prefetchCDictTables CCtxParam (+10-20% cold dict compression speed) #3177

Merged
merged 5 commits into from
Jun 24, 2022

Conversation

embg
Copy link
Contributor

@embg embg commented Jun 23, 2022

Summary

  • Adds a new CCtxParam (prefetchCDictTables)
  • Exposes it in the largeNbDicts benchmark
  • Adds it to fuzzers

Description of the optimization

In some situations, zstd uses CDict tables in-place rather than copying them into the working context. (See docs on ZSTD_dictAttachPref_e for details). In such situations, compression speed is seriously impacted when CDict tables are "cold" (outside CPU cache).

This PR adds a CCtxParam (prefetchCDictTables) which instructs zstd to prefetch CDict tables when they are used in-place (specifically in level 1-4 dictMatchState matchfinders). For sufficiently small inputs, the cost of the prefetch will outweigh the benefit. For sufficiently large inputs, zstd will by default memcpy() CDict tables into the working context, so there is no need to prefetch. This parameter is targeted at a middle range of input sizes, where a prefetch is cheap enough to be useful but memcpy() is too expensive.

The exact range of input sizes where this makes sense is best determined by careful experimentation (see below for measurements on one particular machine / dataset which demonstrate 10-20% wins for a particular working set size and input size). Rather than enabling this param for all inputs, the code which calls ZSTD_compress2() should use a size cutoff (tuned via experimentation) to select the best prefetch strategy for each input.

Measurements

I measured the effect of this param on the HTML dataset. I benchmarked on a Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz machine with core isolation and turbo disabled.

newplot (18)

We can see that the param is harmful for level3 even in the cold CDict scenario if the inputs are small enough (0-8K). For larger inputs (8-16K) at the same level, we see up to 20% wins. This demonstrates the need for selective application of this param.

Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a couple nits.

lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_compress.c Outdated Show resolved Hide resolved
Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming tests pass, ship it!

@embg embg merged commit e9d6fc8 into facebook:dev Jun 24, 2022
@embg embg deleted the dms_prefetch2 branch June 24, 2022 15:24
@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants