-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Short cache" optimization for level 1-4 DMS (+5-30% compression speed) #3152
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I only got a chance to look at the dfast version for now. I have a few questions. It might be worth finding a time to chat about them? Monday maybe?
You should also benchmark a couple negative levels, since this effects them as well.
Before merging you'll want to squash this into one commit, so we don't end up with many non-building commits in the repo. You can either do that manually, or use one of the "squash" merge methods. |
See changes since last review by @felixhandte and @terrelln here. Responded to all feedback:
See PR summary for remaining blockers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good!
Responded to all feedback from @terrelln. Will respond to feedback from @felixhandte tomorrow (just have to measure the proposed change to fast_DMS and fix a nit). See PR summary for remaining blockers. |
Responded to all feedback. Will start fuzzers on my devserver and update PR summary with extDict regression graphs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it!
Fuzzers didn't find anything (100 seconds * 10 workers for each target). Merging! |
Cool optimization. I love the chart. |
TLDR
This PR increases dictionary compression speed for small inputs (< 8KB at levels 1-2, < 16KB at levels 3-4) by 5-30%. The win is especially large when dictionaries are cold, i.e. in L3 cache or main memory.
Description of the optimization
Short cache is a change to fast/dfast CDict hashtable construction which allows the corresponding matchfinders to avoid unnecessary memory loads. A picture is worth 2^10 words:
The circle at the bottom of the diagram is where matchfinders use a dictionary index (loaded from the CDict hashtable) to load some position from the dictionary and compare it to the current position. Short cache allows us to prevent that load with high probability when the current position and dictionary position do not match. This is the common case in nearly all scenarios, and will usually prevent an L2 or even L3 cache miss.
How do we prevent the load? When the CDict hashtable is constructed, we insert an 8-bit independent hash (called a "tag") into the lower bits of each entry. We can do that since dictionary indices are less than 24 bits (this PR adds code to guarantee that), so we can pack an index and tag into a single
U32
. In the matchfinder loop, we unpack the index and tag from the CDict hashtable, compute a tag at the current position, and only load the dictionary position if the tags match.Preliminary win measurements
Level 1:
![results0518c](https://user-images.githubusercontent.com/12179121/171701211-be08f830-a048-45dd-a8b1-6263dc6f19f4.png)
Level 3:
![newplot (2)](https://user-images.githubusercontent.com/12179121/171701439-d283afd0-e874-4710-b818-d9c817a1bd3a.png)
Final win measurements
The final measurements are comparable to the preliminary ones. Note that fast/hot/gcc/110K goes from 0% to -1% (this is on top of my previous fast DMS pipeline which increased the speed of that scenario by more than 5%). I didn't find any other regressions relative to dev. Regressions from preliminary to final seem to roughly cancel out improvements, and I don't think it's worth digging into which are measurement noise vs real changes. These are good win graphs and I think we should land them.
Benchmarking environment (machine, core isolation, etc) was the same as for #3086 -- see that PR for details.
Level 1:
![newplot (7)](https://user-images.githubusercontent.com/12179121/173674863-644d3cac-f63e-47c5-a96e-6ca3bc43c3c3.png)
Level 3:
![newplot (6)](https://user-images.githubusercontent.com/12179121/173671619-ca5c3e6c-b571-42e1-982d-d665f67f3359.png)
Regression
The tags added by short cache need to be removed from the CDict hashtable when it is copied into the CCtx for extDict compression. This turns a normal memcpy into a vectorized shift-and-write loop. In a microbenchmark, I found that SSE2 shift-and-write is 2x slower than memcpy, while AVX2 is 1.5x slower. Note that we only pay this cost when using extDict for dictionary compression. Streaming compression is not affected.
I measured the regression for level2 and level4 extDict. I chose those levels because they use at least 2x larger hashtables than the much more common levels 1 and 3. Thus, level2 and level4 upper-bound the regression (I confirmed this with measurements on level1 and level3).
The AVX2 numbers are for binaries compiled with
-march=core-avx2
. I will add AVX2 dynamic dispatch before our next open source release to ensure that binaries not compiled with-march=core-avx2
can still benefit from those instructions when available.zstd_compress_internal.h
).byCopyingCDict
regression.Out of scope (will be a separate PR):
Add AVX2 dynamic dispatch for the (minor) performance regression inZSTD_resetCCtx_byCopyingCDict
.