Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor prefetching for the decoding loop #2547

Merged
merged 7 commits into from
May 7, 2021
Merged

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Mar 18, 2021

Following #2545,
I noticed that one field (match) in seq_t is optional,
and only used in combination with prefetching.
(This may have contributed to static analyzer failure to detect correct initialization).

I then wondered if it would be possible to rewrite the code
so that this optional part is handled directly by the prefetching code
since it's the only one needing it
rather than delegating a part as an optional workload into a distant ZSTD_decodeSequence().

This resulted into this refactoring exercise
where the prefetching responsibility is better isolated into its own function
and ZSTD_decodeSequence() is streamlined to sequence decoding operations strictly.
Incidentally, due to better code locality,
it reduces the need to send information around,
leading to simplified interface, and smaller state structures.

caveat :
While correctness is fine,
I measured a decompression speed regression
on both my laptop (clang) and desktop (gcc)
(up to -100 MB/s on a i7-9700k @ 5 GHz).

This is unexpected, as this PR preserves the same operations, mostly moving some code around.

That being said, I measured this performance regression even when using the non-prefetching mode,
and even when disabling it entirely (with -DZSTD_FORCE_DECOMPRESS_SEQUENCES_SHORT),
which should not make sense since only the "long" prefetching mode has been modified.

This makes me believe that it could be an issue related to instruction alignment,
hence only indirectly related to this PR.

Nonetheless, it makes me worried about merging this PR as is.

A mitigation strategy to harness this instruction alignment issue would be welcome.

@terrelln : this proposed modification may impact the __asm__(".p2align strategy or values

edit : actually, the decompression speed difference seems concentrated on the short (no prefetch) version. When I force usage of the long variant, the performance difference becomes much smaller (~3%).

edit 2 : measuring cycles spent in DSB (Decoded Stream Buffer) & MITE (Micro-instruction Translation Engine) with perf:

Decompressing enwik9 compressed with --ultra -22 --long=30, thus ensuring presence of a lot of long distance matches:

branch DSB MITE comment
dev 1.46 G 2.07 G
this PR 1.81 G 1.75 G DSB share improved, yet performance down ~-3%

Decompressing enwik9 compressed with -1, thus testing the "normal" short-distance decoder

branch DSB MITE comment
dev 1.48 G 1.26 G
this PR 0.98 G 1.90 G MITE largely increased: consistent with instruction buffer issues

edit 3 : experimenting with manual instruction alignment and compilation flags on linux + gcc 9.3.0:

Variant dec. speed (L1) -22 --long=30
dev 1485 MB/s 941 MB/s
PR 1400 MB/s 912 MB/s
+ .p2align 5 1422 MB/s 907 MB/s
+ -march=skylake 1439 MB/s 868 MB/s
+ .p2align 5 + -march=skylake 1490 MB/s 870 MB/s

@ghost
Copy link

ghost commented Mar 19, 2021

This is surprising, as this PR preserves the same operations, mostly moving some code around.

I have observed a similar problem, which caused me to suspend my attempt to manually inlining.
See #2481 (comment)

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the same performance instability on my i9-9900K.

  • gcc level 1: -5% decompression speed
  • clang level 1: +7%

But on my Macbook and devserver I see approximately neutral performance.

return prefixPos + sequence.matchLength;
}

/* This decoding function employs pre-prefetching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Did you mean prefetching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, good catch, I thought I fixed that ...

Following #2545,
I noticed that one field in `seq_t` is optional,
and only used in combination with prefetching.
(This may have contributed to static analyzer failure to detect correct initialization).

I then wondered if it would be possible to rewrite the code
so that this optional part is handled directly by the prefetching code
rather than delegated as an option into `ZSTD_decodeSequence()`.

This resulted into this refactoring exercise
where the prefetching responsibility is better isolated into its own function
and `ZSTD_decodeSequence()` is streamlined to contain strictly Sequence decoding operations.
Incidently, due to better code locality,
it reduces the need to send information around,
leading to simplified interface, and smaller state structures.
@Cyan4973
Copy link
Contributor Author

On an indirectly related note :

in testings, I noticed that extending the prefetching history from 4 to 8 matches
was a performance win by as much as 10%
when decompressing enwik9 compressed with --ultra -22 --long=30
on a i7-9700k compiled with gcc v9.3.0

I haven't tested enough yet (more files, more settings, more cpus and compilers) to be sure that the gain is generic,
though that's a reasonable expectation.
I felt it was a different topic, that would be worth its own PR & discussion.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented May 7, 2021

Branch updated, and rebased on dev, so that they could be directly comparable.
Here is a fairly thorough benchmark comparison, measuring different compilers, files and compression levels:

dev            
Compiler silesia-L1 enwik8-L1 silesia-L5 enwik8-L5 enwik9-L22 enwik9-long30
gcc-7 1585 1436 1390 1157 983 982
gcc-8 1542 1368 1362 1121 1007 1004
gcc-9 1677 1495 1471 1200 1037 1036
gcc-10 1688 1501 1483 1208 997 995
clang-6.0 1480 1293 1336 1092 927 929
clang-7 1560 1389 1395 1146 958 957
clang-8 1541 1376 1326 1067 892 893
clang-9 1546 1365 1363 1106 935 932
clang-10 1510 1328 1334 1084 950 954
clang-11 1522 1337 1328 1076 954 952
d_prefetch_refactor            
Compiler silesia-L1 enwik8-L1 silesia-L5 enwik8-L5 enwik9-L22 enwik9-long30
gcc-7 1607 1433 1403 1148 962 967
gcc-8 1559 1345 1358 1086 863 862
gcc-9 1589 1418 1358 1112 991 990
gcc-10 1605 1424 1374 1120 898 897
clang-6.0 1541 1345 1381 1119 999 999
clang-7 1524 1352 1356 1117 965 963
clang-8 1566 1388 1360 1090 917 917
clang-9 1571 1395 1377 1111 944 942
clang-10 1569 1389 1371 1114 941 938
clang-11 1549 1355 1358 1099 977 976
COMPARISON            
Compiler silesia-L1 enwik8-L1 silesia-L5 enwik8-L5 enwik9-L22 enwik9-long30
gcc-7 1.39% -0.21% 0.94% -0.78% -2.14% -1.53%
gcc-8 1.10% -1.68% -0.29% -3.12% -14.30% -14.14%
gcc-9 -5.25% -5.15% -7.68% -7.33% -4.44% -4.44%
gcc-10 -4.92% -5.13% -7.35% -7.28% -9.93% -9.85%
clang-6.0 4.12% 4.02% 3.37% 2.47% 7.77% 7.53%
clang-7 -2.31% -2.66% -2.80% -2.53% 0.73% 0.63%
clang-8 1.62% 0.87% 2.56% 2.16% 2.80% 2.69%
clang-9 1.62% 2.20% 1.03% 0.45% 0.96% 1.07%
clang-10 3.91% 4.59% 2.77% 2.77% -0.95% -1.68%
clang-11 1.77% 1.35% 2.26% 2.14% 2.41% 2.52%
average 0.31% -0.18% -0.52% -1.11% -1.71% -1.72%

The total average difference is -0.82%, with d_prefetch_refactor globally slower than dev.

It is not that much, I'm actually surprised, considering there are a few big outliers which are detrimental to d_prefetch_refactor, such as gcc-8 on long scenarios (about -14%),
or even gcc-10 on same scenarios (about -10%).
In spite of this, the average of this category is "only" ~-1.7%.
Also notable, since it was my baseline, gcc-9 is substantially negative in all categories.

Anyway, in spite of all these minuses, the average, while still negative, is "only" -0.82%.
This is compensate by a few notable wins (like clang-6.0), and a lot of mostly neutral results.
This seems to confirm the hypothesis that speed impact is mostly noise.

I'm still a bit embarrassed that some scenarios feature important speed losses,
but given the average, it's not a strong position.

Maybe we have to update they way we think and measure performance differences.
A meta-lesson is that it's not enough to feature a sensible speed win on one compiler and one scenario.
Instead, this claim should be compared across a wider range of scenarios.

changed strategy,
now unconditionally prefetch the first 2 cache lines,
instead of cache lines corresponding to the first and last bytes of the match.

This better corresponds to cpu expectation,
which should auto-prefetch following cachelines on detecting the sequential nature of the read.

This is globally positive, by +5%,
though exact gains depend on compiler (from -2% to +15%).
The only negative counter-example is gcc-9.
@Cyan4973
Copy link
Contributor Author

Cyan4973 commented May 7, 2021

Latest update changes the prefetching strategy,
resulting in an average speed gain of +5% (measured across 10 compilers)
for decoding of frames with many large offsets.

With this change, this PR becomes globally speed positive,
across multiple compilers and multiple decoding scenarios,
as detailed below.

d_prefetch_refactor:64             
Compiler silesia-L1 enwik8-L1 silesia-L5 enwik8-L5 enwik9-L22 enwik9-long30
gcc-7 1605 1432 1386 1126 1023 1020
gcc-8 1558 1345 1347 1072 995 993
gcc-9 1588 1417 1361 1107 974 971
gcc-10 1601 1421 1356 1095 908 905
clang-6.0 1538 1338 1366 1088 1024 1021
clang-7 1520 1346 1337 1093 1005 1000
clang-8 1564 1383 1351 1076 1007 1007
clang-9 1574 1395 1361 1090 1023 1021
clang-10 1565 1386 1355 1092 1010 1005
clang-11 1549 1358 1345 1078 996 994
Comparison with dev             
Compiler silesia-L1 enwik8-L1 silesia-L5 enwik8-L5 enwik9-L22 enwik9-long30
gcc-7 1.26% -0.28% -0.29% -2.68% 4.07% 3.87%
gcc-8 1.04% -1.68% -1.10% -4.37% -1.19% -1.10%
gcc-9 -5.31% -5.22% -7.48% -7.75% -6.08% -6.27%
gcc-10 -5.15% -5.33% -8.56% -9.35% -8.93% -9.05%
clang-6.0 3.92% 3.48% 2.25% -0.37% 10.46% 9.90%
clang-7 -2.56% -3.10% -4.16% -4.62% 4.91% 4.49%
clang-8 1.49% 0.51% 1.89% 0.84% 12.89% 12.77%
clang-9 1.81% 2.20% -0.15% -1.45% 9.41% 9.55%
clang-10 3.64% 4.37% 1.57% 0.74% 6.32% 5.35%
clang-11 1.77% 1.57% 1.28% 0.19% 4.40% 4.41%
average 0.19% -0.35% -1.47% -2.88% 3.63% 3.39%

The average change is now positive by +0.42%.
We still have the issue that gcc-9 and gcc-10 are globally negative, which is a bummer.

Changing perspective, I also really like the code simplification this refactoring exercise brings.
It feels it corrects a misplaced responsibility, which introduced unwarranted complexity at the wrong place.
Just that seems like a good reason to proceed.

This seems to bring an additional ~+1.2% decompression speed
on average across 10 compilers x 6 scenarios.
@Cyan4973
Copy link
Contributor Author

Cyan4973 commented May 7, 2021

I've updated the alignment rule of the main (fast) decoder hot loop.
This seems to lead to an average speed improvement of +1.24%.
It doesn't change much for gcc-9 and gcc-10, which were my initial targets,
but it's a massive improvement for gcc-7 and gcc-8.
The only compiler that doesn't like the new alignment is clang-8, and not by much (~-2%).

For reference :

d_prefetch_refactor:align6+5+4        
Compiler silesia-L1 enwik8-L1 silesia-L5 enwik8-L5 enwik9-L22 enwik9-long30
gcc-7 1671 1518 1459 1209 1033 1029
gcc-8 1599 1414 1410 1160 1013 1011
gcc-9 1588 1416 1358 1108 972 969
gcc-10 1603 1424 1370 1118 921 917
clang-6.0 1542 1345 1380 1117 1037 1030
clang-7 1495 1334 1328 1100 1017 1011
clang-8 1539 1360 1318 1056 996 993
clang-9 1588 1408 1397 1132 1038 1033
clang-10 1587 1410 1397 1132 1022 1013
clang-11 1511 1334 1350 1106 1006 1004
Comparison with dev             
Compiler silesia-L1 enwik8-L1 silesia-L5 enwik8-L5 enwik9-L22 enwik9-long30
gcc-7 5.43% 5.71% 4.96% 4.49% 5.09% 4.79%
gcc-8 3.70% 3.36% 3.52% 3.48% 0.60% 0.70%
gcc-9 -5.31% -5.28% -7.68% -7.67% -6.27% -6.47%
gcc-10 -5.04% -5.13% -7.62% -7.45% -7.62% -7.84%
clang-6.0 4.19% 4.02% 3.29% 2.29% 11.87% 10.87%
clang-7 -4.17% -3.96% -4.80% -4.01% 6.16% 5.64%
clang-8 -0.13% -1.16% -0.60% -1.03% 11.66% 11.20%
clang-9 2.72% 3.15% 2.49% 2.35% 11.02% 10.84%
clang-10 5.10% 6.17% 4.72% 4.43% 7.58% 6.18%
clang-11 -0.72% -0.22% 1.66% 2.79% 5.45% 5.46%
average 0.58% 0.67% -0.01% -0.03% 4.55% 4.14%

The total average decompression speed gain compared to dev is now ~+1.65%.

@Cyan4973 Cyan4973 merged commit 5b6d38a into dev May 7, 2021
@Cyan4973
Copy link
Contributor Author

Cyan4973 commented May 7, 2021

As an informational follow up :
this new version improves decompression speed quite considerably over v1.4.9,
for an average of +5.5% over all compilers and scenarios.
--long mode gets most of the benefits, with average speed gains of +14%.

However, there are large differences between compilers.
The overall winner here is clang, with average speed gains in the +10% range, and up to +20% for --long mode.
On the other hand, gcc tends to lose decompression speed with this version,
worst offender being gcc-9 and gcc-10 which lose a substantial -7% speed on "normal" (non --long) scenarios.

@ghost
Copy link

ghost commented May 8, 2021

gcc has a hot function attribute, don't know if it's useful.

https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html

The hot attribute on a function is used to inform the compiler that the function
is a hot spot of the compiled program. The function is optimized more aggressively
and on many targets it is placed into a special subsection of the text section so all
hot functions appear close together, improving locality.

When profile feedback is available, via -fprofile-use, hot functions are automatically
detected and this attribute is ignored.

@terrelln
Copy link
Contributor

terrelln commented May 8, 2021

worst offender being gcc-9 and gcc-10 which lose a substantial -7% speed on "normal" (non --long) scenarios.

That is a bummer, since this is our main compiler internally. But, I think that may just be instruction alignment, which anywhere near isn't as present on skylake.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented May 8, 2021

gcc has a hot function attribute, don't know if it's useful.

That's a good point !
I tried this __attribute__((hot)) capability with gcc and clang,
unfortunately, the result was unconvincing,
with average decompression speed globally unchanged,
most differences being < 2%, close to noise level,
with a few occasional spots at +5% / -5%, cancelling each other.

edit : this exercise forced me to rescan benchmark measurements,
and it turns out that gcc-9 and gcc-10 where receiving lower figures that they should.
It's not changing fundamentally the picture,
but their new speed is now estimated ~5-6% slower (compared to v1.4.9), instead of 7%.

@Cyan4973 Cyan4973 deleted the d_prefetch_refactor branch December 9, 2021 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants