[lazy] Skip over incompressible data #3552

terrelln · 2023-03-14T22:42:43Z

Every 256 bytes the lazy match finders process without finding a match, they will increase their step size by 1. So for bytes [0, 256) they search every position, for bytes [256, 512) they search every other position, and so on. However, they currently still insert every position into their hash tables. This is different from fast & dfast, which only insert the positions they search.

This PR changes that, so now after we've searched 2KB without finding any matches, at which point we'll only be searching one in 9 positions, we'll stop inserting every position, and only insert the positions we search. The exact cutoff of 2KB isn't terribly important, I've just selected a cutoff that is reasonably large, to minimize the impact on "normal" data.

This PR only adds skipping to greedy, lazy, and lazy2, but does not touch btlazy2.

Dataset	Level	Compiler	CSize ∆	Speed ∆
Random	5	clang-14.0.6	0.0%	+704%
Random	5	gcc-12.2.0	0.0%	+670%
Random	7	clang-14.0.6	0.0%	+679%
Random	7	gcc-12.2.0	0.0%	+657%
Random	12	clang-14.0.6	0.0%	+1355%
Random	12	gcc-12.2.0	0.0%	+1331%
Silesia	5	clang-14.0.6	+0.002%	+0.35%
Silesia	5	gcc-12.2.0	+0.002%	+2.45%
Silesia	7	clang-14.0.6	+0.001%	-1.40%
Silesia	7	gcc-12.2.0	+0.007%	+0.13%
Silesia	12	clang-14.0.6	+0.011%	+22.70%
Silesia	12	gcc-12.2.0	+0.011%	-6.68%
Enwik8	5	clang-14.0.6	0.0%	-1.02%
Enwik8	5	gcc-12.2.0	0.0%	+0.34%
Enwik8	7	clang-14.0.6	0.0%	-1.22%
Enwik8	7	gcc-12.2.0	0.0%	-0.72%
Enwik8	12	clang-14.0.6	0.0%	+26.19%
Enwik8	12	gcc-12.2.0	0.0%	-5.70%

The speed difference for clang at level 12 is real, but is probably caused by some sort of alignment or codegen issues. clang is significantly slower than gcc before this PR, but gets up to parity with it.

I also measured the ratio difference for the HC match finder, and it looks basically the same as the row-based match finder. The speedup on random data looks similar. And performance is about neutral, without the big difference at level 12 for either clang or gcc.

Fixes #3539.

Every 256 bytes the lazy match finders process without finding a match, they will increase their step size by 1. So for bytes [0, 256) they search every position, for bytes [256, 512) they search every other position, and so on. However, they currently still insert every position into their hash tables. This is different from fast & dfast, which only insert the positions they search. This PR changes that, so now after we've searched 2KB without finding any matches, at which point we'll only be searching one in 9 positions, we'll stop inserting every position, and only insert the positions we search. The exact cutoff of 2KB isn't terribly important, I've just selected a cutoff that is reasonably large, to minimize the impact on "normal" data. This PR only adds skipping to greedy, lazy, and lazy2, but does not touch btlazy2. | Dataset | Level | Compiler | CSize ∆ | Speed ∆ | |---------|-------|--------------|---------|---------| | Random | 5 | clang-14.0.6 | 0.0% | +704% | | Random | 5 | gcc-12.2.0 | 0.0% | +670% | | Random | 7 | clang-14.0.6 | 0.0% | +679% | | Random | 7 | gcc-12.2.0 | 0.0% | +657% | | Random | 12 | clang-14.0.6 | 0.0% | +1355% | | Random | 12 | gcc-12.2.0 | 0.0% | +1331% | | Silesia | 5 | clang-14.0.6 | +0.002% | +0.35% | | Silesia | 5 | gcc-12.2.0 | +0.002% | +2.45% | | Silesia | 7 | clang-14.0.6 | +0.001% | -1.40% | | Silesia | 7 | gcc-12.2.0 | +0.007% | +0.13% | | Silesia | 12 | clang-14.0.6 | +0.011% | +22.70% | | Silesia | 12 | gcc-12.2.0 | +0.011% | -6.68% | | Enwik8 | 5 | clang-14.0.6 | 0.0% | -1.02% | | Enwik8 | 5 | gcc-12.2.0 | 0.0% | +0.34% | | Enwik8 | 7 | clang-14.0.6 | 0.0% | -1.22% | | Enwik8 | 7 | gcc-12.2.0 | 0.0% | -0.72% | | Enwik8 | 12 | clang-14.0.6 | 0.0% | +26.19% | | Enwik8 | 12 | gcc-12.2.0 | 0.0% | -5.70% | The speed difference for clang at level 12 is real, but is probably caused by some sort of alignment or codegen issues. clang is significantly slower than gcc before this PR, but gets up to parity with it. I also measured the ratio difference for the HC match finder, and it looks basically the same as the row-based match finder. The speedup on random data looks similar. And performance is about neutral, without the big difference at level 12 for either clang or gcc.

yoniko · 2023-03-15T20:08:39Z

Overall looks good, I wonder if you have have a benchmark for compression ratio when we have a bunch of random data followed by compressible data?

facebook-github-bot added the CLA Signed label Mar 14, 2023

terrelln force-pushed the 2023-03-09-fix-row-uncompressible-speed branch 3 times, most recently from acca036 to 6ddbf8d Compare March 14, 2023 23:36

Cyan4973 approved these changes Mar 16, 2023

View reviewed changes

terrelln merged commit a3c3a38 into facebook:dev Mar 20, 2023

Cyan4973 mentioned this pull request Apr 1, 2023

Preparation for release v1.5.5 #3585

Merged

MOHAMED19OS mentioned this pull request Apr 19, 2023

zstd: add Zstandard v1.5.5 crosstool-ng/crosstool-ng#1936

Closed

rincebrain mentioned this pull request Sep 8, 2023

Bump ZSTD to v1.5.0 openzfs/zfs#12081

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lazy] Skip over incompressible data #3552

[lazy] Skip over incompressible data #3552

terrelln commented Mar 14, 2023 •

edited

Loading

yoniko commented Mar 15, 2023

[lazy] Skip over incompressible data #3552

[lazy] Skip over incompressible data #3552

Conversation

terrelln commented Mar 14, 2023 • edited Loading

yoniko commented Mar 15, 2023

terrelln commented Mar 14, 2023 •

edited

Loading