Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Short cache" optimization for level 1-4 DMS (+5-30% compression speed) #3152

Merged
merged 26 commits into from
Jun 21, 2022

Conversation

embg
Copy link
Contributor

@embg embg commented Jun 2, 2022

TLDR

This PR increases dictionary compression speed for small inputs (< 8KB at levels 1-2, < 16KB at levels 3-4) by 5-30%. The win is especially large when dictionaries are cold, i.e. in L3 cache or main memory.

Description of the optimization

Short cache is a change to fast/dfast CDict hashtable construction which allows the corresponding matchfinders to avoid unnecessary memory loads. A picture is worth 2^10 words:

short_cache_diagram

The circle at the bottom of the diagram is where matchfinders use a dictionary index (loaded from the CDict hashtable) to load some position from the dictionary and compare it to the current position. Short cache allows us to prevent that load with high probability when the current position and dictionary position do not match. This is the common case in nearly all scenarios, and will usually prevent an L2 or even L3 cache miss.

How do we prevent the load? When the CDict hashtable is constructed, we insert an 8-bit independent hash (called a "tag") into the lower bits of each entry. We can do that since dictionary indices are less than 24 bits (this PR adds code to guarantee that), so we can pack an index and tag into a single U32. In the matchfinder loop, we unpack the index and tag from the CDict hashtable, compute a tag at the current position, and only load the dictionary position if the tags match.

Preliminary win measurements

Level 1:
results0518c

Level 3:
newplot (2)

Final win measurements

The final measurements are comparable to the preliminary ones. Note that fast/hot/gcc/110K goes from 0% to -1% (this is on top of my previous fast DMS pipeline which increased the speed of that scenario by more than 5%). I didn't find any other regressions relative to dev. Regressions from preliminary to final seem to roughly cancel out improvements, and I don't think it's worth digging into which are measurement noise vs real changes. These are good win graphs and I think we should land them.

Benchmarking environment (machine, core isolation, etc) was the same as for #3086 -- see that PR for details.

Level 1:
newplot (7)

Level 3:
newplot (6)

Regression

The tags added by short cache need to be removed from the CDict hashtable when it is copied into the CCtx for extDict compression. This turns a normal memcpy into a vectorized shift-and-write loop. In a microbenchmark, I found that SSE2 shift-and-write is 2x slower than memcpy, while AVX2 is 1.5x slower. Note that we only pay this cost when using extDict for dictionary compression. Streaming compression is not affected.

I measured the regression for level2 and level4 extDict. I chose those levels because they use at least 2x larger hashtables than the much more common levels 1 and 3. Thus, level2 and level4 upper-bound the regression (I confirmed this with measurements on level1 and level3).

The AVX2 numbers are for binaries compiled with -march=core-avx2. I will add AVX2 dynamic dispatch before our next open source release to ensure that binaries not compiled with -march=core-avx2 can still benefit from those instructions when available.

newplot (13)


  • Respond to initial code review
  • Update the PR description with a summary of the changes and more information on measurements.
  • Add discussion of the (negligible) ratio impact.
  • Add documentation to the code itself (see marked location in zstd_compress_internal.h).
  • Add measurements to the PR description regarding the byCopyingCDict regression.

Out of scope (will be a separate PR):

  • Add AVX2 dynamic dispatch for the (minor) performance regression in ZSTD_resetCCtx_byCopyingCDict.

lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved
@embg embg marked this pull request as draft June 2, 2022 18:44
@embg embg changed the title "Short cache" optimization for level 1-4 dictMatchState (5% to 30% speed win for a typical usecase) "Short cache" optimization for level 1-4 DMS (5% to 30% speed win for a typical usecase) Jun 2, 2022
Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I only got a chance to look at the dfast version for now. I have a few questions. It might be worth finding a time to chat about them? Monday maybe?

You should also benchmark a couple negative levels, since this effects them as well.

lib/compress/zstd_compress_internal.h Show resolved Hide resolved
lib/compress/zstd_compress_internal.h Show resolved Hide resolved
lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_compress_internal.h Outdated Show resolved Hide resolved
lib/compress/zstd_double_fast.c Show resolved Hide resolved
lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_compress.c Outdated Show resolved Hide resolved
lib/compress/zstd_compress.c Outdated Show resolved Hide resolved
lib/compress/zstd_compress.c Outdated Show resolved Hide resolved
@terrelln
Copy link
Contributor

terrelln commented Jun 3, 2022

Before merging you'll want to squash this into one commit, so we don't end up with many non-building commits in the repo.

You can either do that manually, or use one of the "squash" merge methods.

lib/compress/zstd_compress_internal.h Show resolved Hide resolved
lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_compress_internal.h Show resolved Hide resolved
lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Outdated Show resolved Hide resolved
lib/compress/zstd_fast.c Show resolved Hide resolved
@embg embg force-pushed the dms_short_cache2 branch from 7f133dc to a040b1d Compare June 13, 2022 19:42
@embg
Copy link
Contributor Author

embg commented Jun 13, 2022

See changes since last review by @felixhandte and @terrelln here. Responded to all feedback:

  • Cherry-pick add short cache to ip+1 long search
  • Truncate dictionaries over 16 MB in ZSTD_loadDictionaryContent
  • Move tag size into hashBits variables
  • Added ZSTD_ prefix to writeTaggedIndex
  • Added assert hBits <= 32 to ZSTD_hashPtr
  • Scale hashLog down to 24 in ZSTD_adjustCParams_internal
  • Split memory and arithmetic in ZSTD_resetCCtx_byCopyingCDict
  • Moved ZSTD_tableFillPurpose_e next to ZSTD_dictTableLoadMethod_e
  • Pull out tag comparison logic into ZSTD_comparePackedTags
  • Convert dictTagsMatch boolean variables from size_t to int

See PR summary for remaining blockers.

@embg embg changed the title "Short cache" optimization for level 1-4 DMS (5% to 30% speed win for a typical usecase) "Short cache" optimization for level 1-4 DMS (5% to 30% typical speed win) Jun 13, 2022
@embg embg changed the title "Short cache" optimization for level 1-4 DMS (5% to 30% typical speed win) "Short cache" optimization for level 1-4 DMS (5-30% typical compression speed win) Jun 13, 2022
@embg embg changed the title "Short cache" optimization for level 1-4 DMS (5-30% typical compression speed win) "Short cache" optimization for level 1-4 DMS (+5-30% compression speed) Jun 13, 2022
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good!

lib/compress/zstd_compress.c Outdated Show resolved Hide resolved
lib/compress/zstd_compress_internal.h Outdated Show resolved Hide resolved
lib/compress/zstd_compress_internal.h Outdated Show resolved Hide resolved
lib/compress/zstd_compress.c Outdated Show resolved Hide resolved
lib/compress/zstd_compress.c Outdated Show resolved Hide resolved
lib/compress/zstd_compress.c Outdated Show resolved Hide resolved
@embg
Copy link
Contributor Author

embg commented Jun 16, 2022

Responded to all feedback from @terrelln. Will respond to feedback from @felixhandte tomorrow (just have to measure the proposed change to fast_DMS and fix a nit). See PR summary for remaining blockers.

@embg embg marked this pull request as ready for review June 17, 2022 19:01
@embg
Copy link
Contributor Author

embg commented Jun 17, 2022

Responded to all feedback. Will start fuzzers on my devserver and update PR summary with extDict regression graphs.

Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

@embg
Copy link
Contributor Author

embg commented Jun 21, 2022

Fuzzers didn't find anything (100 seconds * 10 workers for each target). Merging!

@embg embg merged commit f6ef143 into facebook:dev Jun 21, 2022
@nadavrot
Copy link

nadavrot commented Sep 7, 2022

Cool optimization. I love the chart.

@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants