"Short cache" optimization for level 1-4 DMS (+5-30% compression speed) #3152

embg · 2022-06-02T18:40:15Z

TLDR

This PR increases dictionary compression speed for small inputs (< 8KB at levels 1-2, < 16KB at levels 3-4) by 5-30%. The win is especially large when dictionaries are cold, i.e. in L3 cache or main memory.

Description of the optimization

Short cache is a change to fast/dfast CDict hashtable construction which allows the corresponding matchfinders to avoid unnecessary memory loads. A picture is worth 2^10 words:

The circle at the bottom of the diagram is where matchfinders use a dictionary index (loaded from the CDict hashtable) to load some position from the dictionary and compare it to the current position. Short cache allows us to prevent that load with high probability when the current position and dictionary position do not match. This is the common case in nearly all scenarios, and will usually prevent an L2 or even L3 cache miss.

How do we prevent the load? When the CDict hashtable is constructed, we insert an 8-bit independent hash (called a "tag") into the lower bits of each entry. We can do that since dictionary indices are less than 24 bits (this PR adds code to guarantee that), so we can pack an index and tag into a single U32. In the matchfinder loop, we unpack the index and tag from the CDict hashtable, compute a tag at the current position, and only load the dictionary position if the tags match.

Preliminary win measurements

Level 1:

Level 3:

Final win measurements

The final measurements are comparable to the preliminary ones. Note that fast/hot/gcc/110K goes from 0% to -1% (this is on top of my previous fast DMS pipeline which increased the speed of that scenario by more than 5%). I didn't find any other regressions relative to dev. Regressions from preliminary to final seem to roughly cancel out improvements, and I don't think it's worth digging into which are measurement noise vs real changes. These are good win graphs and I think we should land them.

Benchmarking environment (machine, core isolation, etc) was the same as for #3086 -- see that PR for details.

Level 1:

Level 3:

Regression

The tags added by short cache need to be removed from the CDict hashtable when it is copied into the CCtx for extDict compression. This turns a normal memcpy into a vectorized shift-and-write loop. In a microbenchmark, I found that SSE2 shift-and-write is 2x slower than memcpy, while AVX2 is 1.5x slower. Note that we only pay this cost when using extDict for dictionary compression. Streaming compression is not affected.

I measured the regression for level2 and level4 extDict. I chose those levels because they use at least 2x larger hashtables than the much more common levels 1 and 3. Thus, level2 and level4 upper-bound the regression (I confirmed this with measurements on level1 and level3).

The AVX2 numbers are for binaries compiled with -march=core-avx2. I will add AVX2 dynamic dispatch before our next open source release to ensure that binaries not compiled with -march=core-avx2 can still benefit from those instructions when available.

Respond to initial code review
Update the PR description with a summary of the changes and more information on measurements.
Add discussion of the (negligible) ratio impact.
Add documentation to the code itself (see marked location in zstd_compress_internal.h).
Add measurements to the PR description regarding the byCopyingCDict regression.

Out of scope (will be a separate PR):

~~Add AVX2 dynamic dispatch for the (minor) performance regression in ZSTD_resetCCtx_byCopyingCDict.~~

lib/compress/zstd_double_fast.c

felixhandte

Ok I only got a chance to look at the dfast version for now. I have a few questions. It might be worth finding a time to chat about them? Monday maybe?

You should also benchmark a couple negative levels, since this effects them as well.

lib/compress/zstd_compress_internal.h

lib/compress/zstd_double_fast.c

lib/compress/zstd_compress_internal.h

lib/compress/zstd_double_fast.c

lib/compress/zstd_compress.c

terrelln · 2022-06-03T18:13:49Z

Before merging you'll want to squash this into one commit, so we don't end up with many non-building commits in the repo.

You can either do that manually, or use one of the "squash" merge methods.

lib/compress/zstd_compress_internal.h

lib/compress/zstd_double_fast.c

lib/compress/zstd_compress_internal.h

lib/compress/zstd_double_fast.c

lib/compress/zstd_fast.c

embg · 2022-06-13T19:52:40Z

See changes since last review by @felixhandte and @terrelln here. Responded to all feedback:

Cherry-pick add short cache to ip+1 long search
Truncate dictionaries over 16 MB in ZSTD_loadDictionaryContent
Move tag size into hashBits variables
Added ZSTD_ prefix to writeTaggedIndex
Added assert hBits <= 32 to ZSTD_hashPtr
Scale hashLog down to 24 in ZSTD_adjustCParams_internal
Split memory and arithmetic in ZSTD_resetCCtx_byCopyingCDict
Moved ZSTD_tableFillPurpose_e next to ZSTD_dictTableLoadMethod_e
Pull out tag comparison logic into ZSTD_comparePackedTags
Convert dictTagsMatch boolean variables from size_t to int

See PR summary for remaining blockers.

terrelln

This is looking good!

lib/compress/zstd_compress.c

lib/compress/zstd_compress_internal.h

lib/compress/zstd_compress.c

lib/compress/zstd_double_fast.c

lib/compress/zstd_fast.c

embg · 2022-06-16T22:29:05Z

Responded to all feedback from @terrelln. Will respond to feedback from @felixhandte tomorrow (just have to measure the proposed change to fast_DMS and fix a nit). See PR summary for remaining blockers.

embg · 2022-06-17T20:25:30Z

Responded to all feedback. Will start fuzzers on my devserver and update PR summary with extDict regression graphs.

felixhandte

Ship it!

embg · 2022-06-21T21:26:59Z

Fuzzers didn't find anything (100 seconds * 10 workers for each target). Merging!

nadavrot · 2022-09-07T21:57:51Z

Cool optimization. I love the chart.

facebook-github-bot added the CLA Signed label Jun 2, 2022

embg requested review from terrelln and felixhandte June 2, 2022 18:41

embg commented Jun 2, 2022

View reviewed changes

lib/compress/zstd_double_fast.c Outdated Show resolved Hide resolved

embg marked this pull request as draft June 2, 2022 18:44

embg changed the title ~~"Short cache" optimization for level 1-4 dictMatchState (5% to 30% speed win for a typical usecase)~~ "Short cache" optimization for level 1-4 DMS (5% to 30% speed win for a typical usecase) Jun 2, 2022

felixhandte reviewed Jun 3, 2022

View reviewed changes

terrelln reviewed Jun 3, 2022

View reviewed changes

felixhandte added the optimization label Jun 4, 2022

embg added 20 commits June 9, 2022 16:39

first attempt at fast DMS short cache

75b9b27

significant wins for some scenarios

21aae09

fix all clang regressions

9d66e1e

nits

4963a86

fix 1.5% gcc11 regression on hot 110Kdict scenario

6058ab5

fix CI

7169dc5

nit

7735783

Add tags to doublefast hash table

bad55b3

use tags in doublefast DMS

67de62d

Fix CI

e481f5e

Clean up some hardcoded logic / constants

aad9329

Switch forCCtx to an enum

848fc07

nit

a1063e1

add short cache to ip+1 long search

b2628d5

Move tag size into hashLog

4350b64

Minor nits

92e554c

Truncate dictionaries greater than 16MB in short cache mode

c358f9f

Helper function for tag comparison

a943e12

Cap short cache hashLog at 24 to prevent overflow

a9a83dd

size_t dictTagsMatch -> int dictTagsMatch

a040b1d

embg force-pushed the dms_short_cache2 branch from 7f133dc to a040b1d Compare June 13, 2022 19:42

nit

911ae42

embg changed the title ~~"Short cache" optimization for level 1-4 DMS (5% to 30% speed win for a typical usecase)~~ "Short cache" optimization for level 1-4 DMS (5% to 30% typical speed win) Jun 13, 2022

embg changed the title ~~"Short cache" optimization for level 1-4 DMS (5% to 30% typical speed win)~~ "Short cache" optimization for level 1-4 DMS (5-30% typical compression speed win) Jun 13, 2022

embg changed the title ~~"Short cache" optimization for level 1-4 DMS (5-30% typical compression speed win)~~ "Short cache" optimization for level 1-4 DMS (+5-30% compression speed) Jun 13, 2022

terrelln reviewed Jun 14, 2022

View reviewed changes

felixhandte reviewed Jun 15, 2022

View reviewed changes

lib/compress/zstd_double_fast.c Show resolved Hide resolved

lib/compress/zstd_fast.c Show resolved Hide resolved

embg added 3 commits June 16, 2022 11:23

Clean up and comment dictionary truncation

827493b

Move ZSTD_tableFillPurpose_e next to ZSTD_dictTableLoadMethod_e

06c2019

Comment and expand helper functions

2b55df5

embg added 2 commits June 17, 2022 12:06

Asserts and documentation

40bb85e

nit

de0167e

embg marked this pull request as ready for review June 17, 2022 19:01

terrelln approved these changes Jun 20, 2022

View reviewed changes

felixhandte approved these changes Jun 21, 2022

View reviewed changes

embg merged commit f6ef143 into facebook:dev Jun 21, 2022

embg mentioned this pull request Dec 8, 2022

Support AVX2 dynamic dispatch #3335

Open

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Short cache" optimization for level 1-4 DMS (+5-30% compression speed) #3152

"Short cache" optimization for level 1-4 DMS (+5-30% compression speed) #3152

embg commented Jun 2, 2022 •

edited

Loading

felixhandte left a comment

terrelln commented Jun 3, 2022

embg commented Jun 13, 2022 •

edited

Loading

terrelln left a comment

embg commented Jun 16, 2022

embg commented Jun 17, 2022

felixhandte left a comment

embg commented Jun 21, 2022

nadavrot commented Sep 7, 2022

"Short cache" optimization for level 1-4 DMS (+5-30% compression speed) #3152

"Short cache" optimization for level 1-4 DMS (+5-30% compression speed) #3152

Conversation

embg commented Jun 2, 2022 • edited Loading

TLDR

Description of the optimization

Preliminary win measurements

Final win measurements

Regression

felixhandte left a comment

Choose a reason for hiding this comment

terrelln commented Jun 3, 2022

embg commented Jun 13, 2022 • edited Loading

terrelln left a comment

Choose a reason for hiding this comment

embg commented Jun 16, 2022

embg commented Jun 17, 2022

felixhandte left a comment

Choose a reason for hiding this comment

embg commented Jun 21, 2022

nadavrot commented Sep 7, 2022

embg commented Jun 2, 2022 •

edited

Loading

embg commented Jun 13, 2022 •

edited

Loading