-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor prefetching for the decoding loop #2547
Conversation
I have observed a similar problem, which caused me to suspend my attempt to manually inlining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the same performance instability on my i9-9900K.
- gcc level 1: -5% decompression speed
- clang level 1: +7%
But on my Macbook and devserver I see approximately neutral performance.
return prefixPos + sequence.matchLength; | ||
} | ||
|
||
/* This decoding function employs pre-prefetching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Did you mean prefetching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, good catch, I thought I fixed that ...
Following #2545, I noticed that one field in `seq_t` is optional, and only used in combination with prefetching. (This may have contributed to static analyzer failure to detect correct initialization). I then wondered if it would be possible to rewrite the code so that this optional part is handled directly by the prefetching code rather than delegated as an option into `ZSTD_decodeSequence()`. This resulted into this refactoring exercise where the prefetching responsibility is better isolated into its own function and `ZSTD_decodeSequence()` is streamlined to contain strictly Sequence decoding operations. Incidently, due to better code locality, it reduces the need to send information around, leading to simplified interface, and smaller state structures.
20d4520
to
f543466
Compare
On an indirectly related note : in testings, I noticed that extending the prefetching history from 4 to 8 matches I haven't tested enough yet (more files, more settings, more cpus and compilers) to be sure that the gain is generic, |
Branch updated, and rebased on
The total average difference is -0.82%, with It is not that much, I'm actually surprised, considering there are a few big outliers which are detrimental to Anyway, in spite of all these minuses, the average, while still negative, is "only" -0.82%. I'm still a bit embarrassed that some scenarios feature important speed losses, Maybe we have to update they way we think and measure performance differences. |
changed strategy, now unconditionally prefetch the first 2 cache lines, instead of cache lines corresponding to the first and last bytes of the match. This better corresponds to cpu expectation, which should auto-prefetch following cachelines on detecting the sequential nature of the read. This is globally positive, by +5%, though exact gains depend on compiler (from -2% to +15%). The only negative counter-example is gcc-9.
…_prefetch_refactor
Latest update changes the prefetching strategy, With this change, this PR becomes globally speed positive,
The average change is now positive by +0.42%. Changing perspective, I also really like the code simplification this refactoring exercise brings. |
This seems to bring an additional ~+1.2% decompression speed on average across 10 compilers x 6 scenarios.
I've updated the alignment rule of the main (fast) decoder hot loop. For reference :
The total average decompression speed gain compared to |
As an informational follow up : However, there are large differences between compilers. |
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
|
That is a bummer, since this is our main compiler internally. But, I think that may just be instruction alignment, which anywhere near isn't as present on skylake. |
That's a good point ! edit : this exercise forced me to rescan benchmark measurements, |
Following #2545,
I noticed that one field (
match
) inseq_t
is optional,and only used in combination with prefetching.
(This may have contributed to static analyzer failure to detect correct initialization).
I then wondered if it would be possible to rewrite the code
so that this optional part is handled directly by the prefetching code
since it's the only one needing it
rather than delegating a part as an optional workload into a distant
ZSTD_decodeSequence()
.This resulted into this refactoring exercise
where the prefetching responsibility is better isolated into its own function
and
ZSTD_decodeSequence()
is streamlined to sequence decoding operations strictly.Incidentally, due to better code locality,
it reduces the need to send information around,
leading to simplified interface, and smaller state structures.
caveat :
While correctness is fine,
I measured a decompression speed regression
on both my laptop (clang) and desktop (gcc)
(up to -100 MB/s on a i7-9700k @ 5 GHz).
This is unexpected, as this PR preserves the same operations, mostly moving some code around.
That being said, I measured this performance regression even when using the non-prefetching mode,
and even when disabling it entirely (with
-DZSTD_FORCE_DECOMPRESS_SEQUENCES_SHORT
),which should not make sense since only the "long" prefetching mode has been modified.
This makes me believe that it could be an issue related to instruction alignment,
hence only indirectly related to this PR.
Nonetheless, it makes me worried about merging this PR as is.
A mitigation strategy to harness this instruction alignment issue would be welcome.
@terrelln : this proposed modification may impact the
__asm__(".p2align
strategy or valuesedit : actually, the decompression speed difference seems concentrated on the short (no prefetch) version. When I force usage of the long variant, the performance difference becomes much smaller (~3%).
edit 2 : measuring cycles spent in DSB (Decoded Stream Buffer) & MITE (Micro-instruction Translation Engine) with
perf
:Decompressing
enwik9
compressed with--ultra -22 --long=30
, thus ensuring presence of a lot of long distance matches:dev
Decompressing
enwik9
compressed with-1
, thus testing the "normal" short-distance decoderdev
edit 3 : experimenting with manual instruction alignment and compilation flags on
linux
+gcc 9.3.0
:-22 --long=30
dev
PR
.p2align 5
-march=skylake
.p2align 5
+-march=skylake