Refactor prefetching for the decoding loop #2547

Cyan4973 · 2021-03-18T20:10:45Z

Following #2545,
I noticed that one field (match) in seq_t is optional,
and only used in combination with prefetching.
(This may have contributed to static analyzer failure to detect correct initialization).

I then wondered if it would be possible to rewrite the code
so that this optional part is handled directly by the prefetching code
since it's the only one needing it
rather than delegating a part as an optional workload into a distant ZSTD_decodeSequence().

This resulted into this refactoring exercise
where the prefetching responsibility is better isolated into its own function
and ZSTD_decodeSequence() is streamlined to sequence decoding operations strictly.
Incidentally, due to better code locality,
it reduces the need to send information around,
leading to simplified interface, and smaller state structures.

caveat :
While correctness is fine,
I measured a decompression speed regression
on both my laptop (clang) and desktop (gcc)
(up to -100 MB/s on a i7-9700k @ 5 GHz).

This is unexpected, as this PR preserves the same operations, mostly moving some code around.

That being said, I measured this performance regression even when using the non-prefetching mode,
and even when disabling it entirely (with -DZSTD_FORCE_DECOMPRESS_SEQUENCES_SHORT),
which should not make sense since only the "long" prefetching mode has been modified.

This makes me believe that it could be an issue related to instruction alignment,
hence only indirectly related to this PR.

Nonetheless, it makes me worried about merging this PR as is.

A mitigation strategy to harness this instruction alignment issue would be welcome.

@terrelln : this proposed modification may impact the __asm__(".p2align strategy or values

edit : actually, the decompression speed difference seems concentrated on the short (no prefetch) version. When I force usage of the long variant, the performance difference becomes much smaller (~3%).

edit 2 : measuring cycles spent in DSB (Decoded Stream Buffer) & MITE (Micro-instruction Translation Engine) with perf:

Decompressing enwik9 compressed with --ultra -22 --long=30, thus ensuring presence of a lot of long distance matches:

branch	DSB	MITE	comment
`dev`	1.46 G	2.07 G
this PR	1.81 G	1.75 G	DSB share improved, yet performance down ~-3%

Decompressing enwik9 compressed with -1, thus testing the "normal" short-distance decoder

branch	DSB	MITE	comment
`dev`	1.48 G	1.26 G
this PR	0.98 G	1.90 G	MITE largely increased: consistent with instruction buffer issues

edit 3 : experimenting with manual instruction alignment and compilation flags on linux + gcc 9.3.0:

Variant	dec. speed (L1)	`-22 --long=30`
`dev`	1485 MB/s	941 MB/s
`PR`	1400 MB/s	912 MB/s
+ `.p2align 5`	1422 MB/s	907 MB/s
+ `-march=skylake`	1439 MB/s	868 MB/s
+ `.p2align 5` + `-march=skylake`	1490 MB/s	870 MB/s

ghost · 2021-03-19T11:31:04Z

This is surprising, as this PR preserves the same operations, mostly moving some code around.

I have observed a similar problem, which caused me to suspend my attempt to manually inlining.
See #2481 (comment)

terrelln

I see the same performance instability on my i9-9900K.

gcc level 1: -5% decompression speed
clang level 1: +7%

But on my Macbook and devserver I see approximately neutral performance.

terrelln · 2021-03-19T21:52:31Z

lib/decompress/zstd_decompress_block.c

+    return prefixPos + sequence.matchLength;
+}
+
+/* This decoding function employs pre-prefetching


nit: Did you mean prefetching?

yep, good catch, I thought I fixed that ...

Following #2545, I noticed that one field in `seq_t` is optional, and only used in combination with prefetching. (This may have contributed to static analyzer failure to detect correct initialization). I then wondered if it would be possible to rewrite the code so that this optional part is handled directly by the prefetching code rather than delegated as an option into `ZSTD_decodeSequence()`. This resulted into this refactoring exercise where the prefetching responsibility is better isolated into its own function and `ZSTD_decodeSequence()` is streamlined to contain strictly Sequence decoding operations. Incidently, due to better code locality, it reduces the need to send information around, leading to simplified interface, and smaller state structures.

Cyan4973 · 2021-03-19T22:53:33Z

On an indirectly related note :

in testings, I noticed that extending the prefetching history from 4 to 8 matches
was a performance win by as much as 10%
when decompressing enwik9 compressed with --ultra -22 --long=30
on a i7-9700k compiled with gcc v9.3.0

I haven't tested enough yet (more files, more settings, more cpus and compilers) to be sure that the gain is generic,
though that's a reasonable expectation.
I felt it was a different topic, that would be worth its own PR & discussion.

Cyan4973 · 2021-05-07T17:30:19Z

Branch updated, and rebased on dev, so that they could be directly comparable.
Here is a fairly thorough benchmark comparison, measuring different compilers, files and compression levels:

dev
Compiler	silesia-L1	enwik8-L1	silesia-L5	enwik8-L5	enwik9-L22	enwik9-long30
gcc-7	1585	1436	1390	1157	983	982
gcc-8	1542	1368	1362	1121	1007	1004
gcc-9	1677	1495	1471	1200	1037	1036
gcc-10	1688	1501	1483	1208	997	995
clang-6.0	1480	1293	1336	1092	927	929
clang-7	1560	1389	1395	1146	958	957
clang-8	1541	1376	1326	1067	892	893
clang-9	1546	1365	1363	1106	935	932
clang-10	1510	1328	1334	1084	950	954
clang-11	1522	1337	1328	1076	954	952

d_prefetch_refactor
Compiler	silesia-L1	enwik8-L1	silesia-L5	enwik8-L5	enwik9-L22	enwik9-long30
gcc-7	1607	1433	1403	1148	962	967
gcc-8	1559	1345	1358	1086	863	862
gcc-9	1589	1418	1358	1112	991	990
gcc-10	1605	1424	1374	1120	898	897
clang-6.0	1541	1345	1381	1119	999	999
clang-7	1524	1352	1356	1117	965	963
clang-8	1566	1388	1360	1090	917	917
clang-9	1571	1395	1377	1111	944	942
clang-10	1569	1389	1371	1114	941	938
clang-11	1549	1355	1358	1099	977	976

COMPARISON
Compiler	silesia-L1	enwik8-L1	silesia-L5	enwik8-L5	enwik9-L22	enwik9-long30
gcc-7	1.39%	-0.21%	0.94%	-0.78%	-2.14%	-1.53%
gcc-8	1.10%	-1.68%	-0.29%	-3.12%	-14.30%	-14.14%
gcc-9	-5.25%	-5.15%	-7.68%	-7.33%	-4.44%	-4.44%
gcc-10	-4.92%	-5.13%	-7.35%	-7.28%	-9.93%	-9.85%
clang-6.0	4.12%	4.02%	3.37%	2.47%	7.77%	7.53%
clang-7	-2.31%	-2.66%	-2.80%	-2.53%	0.73%	0.63%
clang-8	1.62%	0.87%	2.56%	2.16%	2.80%	2.69%
clang-9	1.62%	2.20%	1.03%	0.45%	0.96%	1.07%
clang-10	3.91%	4.59%	2.77%	2.77%	-0.95%	-1.68%
clang-11	1.77%	1.35%	2.26%	2.14%	2.41%	2.52%
average	0.31%	-0.18%	-0.52%	-1.11%	-1.71%	-1.72%

The total average difference is -0.82%, with d_prefetch_refactor globally slower than dev.

It is not that much, I'm actually surprised, considering there are a few big outliers which are detrimental to d_prefetch_refactor, such as gcc-8 on long scenarios (about -14%),
or even gcc-10 on same scenarios (about -10%).
In spite of this, the average of this category is "only" ~-1.7%.
Also notable, since it was my baseline, gcc-9 is substantially negative in all categories.

Anyway, in spite of all these minuses, the average, while still negative, is "only" -0.82%.
This is compensate by a few notable wins (like clang-6.0), and a lot of mostly neutral results.
This seems to confirm the hypothesis that speed impact is mostly noise.

I'm still a bit embarrassed that some scenarios feature important speed losses,
but given the average, it's not a strong position.

Maybe we have to update they way we think and measure performance differences.
A meta-lesson is that it's not enough to feature a sensible speed win on one compiler and one scenario.
Instead, this claim should be compared across a wider range of scenarios.

changed strategy, now unconditionally prefetch the first 2 cache lines, instead of cache lines corresponding to the first and last bytes of the match. This better corresponds to cpu expectation, which should auto-prefetch following cachelines on detecting the sequential nature of the read. This is globally positive, by +5%, though exact gains depend on compiler (from -2% to +15%). The only negative counter-example is gcc-9.

…_prefetch_refactor

Cyan4973 · 2021-05-07T18:39:40Z

Latest update changes the prefetching strategy,
resulting in an average speed gain of +5% (measured across 10 compilers)
for decoding of frames with many large offsets.

With this change, this PR becomes globally speed positive,
across multiple compilers and multiple decoding scenarios,
as detailed below.

d_prefetch_refactor:64
Compiler	silesia-L1	enwik8-L1	silesia-L5	enwik8-L5	enwik9-L22	enwik9-long30
gcc-7	1605	1432	1386	1126	1023	1020
gcc-8	1558	1345	1347	1072	995	993
gcc-9	1588	1417	1361	1107	974	971
gcc-10	1601	1421	1356	1095	908	905
clang-6.0	1538	1338	1366	1088	1024	1021
clang-7	1520	1346	1337	1093	1005	1000
clang-8	1564	1383	1351	1076	1007	1007
clang-9	1574	1395	1361	1090	1023	1021
clang-10	1565	1386	1355	1092	1010	1005
clang-11	1549	1358	1345	1078	996	994

Comparison with dev
Compiler	silesia-L1	enwik8-L1	silesia-L5	enwik8-L5	enwik9-L22	enwik9-long30
gcc-7	1.26%	-0.28%	-0.29%	-2.68%	4.07%	3.87%
gcc-8	1.04%	-1.68%	-1.10%	-4.37%	-1.19%	-1.10%
gcc-9	-5.31%	-5.22%	-7.48%	-7.75%	-6.08%	-6.27%
gcc-10	-5.15%	-5.33%	-8.56%	-9.35%	-8.93%	-9.05%
clang-6.0	3.92%	3.48%	2.25%	-0.37%	10.46%	9.90%
clang-7	-2.56%	-3.10%	-4.16%	-4.62%	4.91%	4.49%
clang-8	1.49%	0.51%	1.89%	0.84%	12.89%	12.77%
clang-9	1.81%	2.20%	-0.15%	-1.45%	9.41%	9.55%
clang-10	3.64%	4.37%	1.57%	0.74%	6.32%	5.35%
clang-11	1.77%	1.57%	1.28%	0.19%	4.40%	4.41%
average	0.19%	-0.35%	-1.47%	-2.88%	3.63%	3.39%

The average change is now positive by +0.42%.
We still have the issue that gcc-9 and gcc-10 are globally negative, which is a bummer.

Changing perspective, I also really like the code simplification this refactoring exercise brings.
It feels it corrects a misplaced responsibility, which introduced unwarranted complexity at the wrong place.
Just that seems like a good reason to proceed.

This seems to bring an additional ~+1.2% decompression speed on average across 10 compilers x 6 scenarios.

Cyan4973 · 2021-05-07T22:29:25Z

I've updated the alignment rule of the main (fast) decoder hot loop.
This seems to lead to an average speed improvement of +1.24%.
It doesn't change much for gcc-9 and gcc-10, which were my initial targets,
but it's a massive improvement for gcc-7 and gcc-8.
The only compiler that doesn't like the new alignment is clang-8, and not by much (~-2%).

For reference :

d_prefetch_refactor:align6+5+4
Compiler	silesia-L1	enwik8-L1	silesia-L5	enwik8-L5	enwik9-L22	enwik9-long30
gcc-7	1671	1518	1459	1209	1033	1029
gcc-8	1599	1414	1410	1160	1013	1011
gcc-9	1588	1416	1358	1108	972	969
gcc-10	1603	1424	1370	1118	921	917
clang-6.0	1542	1345	1380	1117	1037	1030
clang-7	1495	1334	1328	1100	1017	1011
clang-8	1539	1360	1318	1056	996	993
clang-9	1588	1408	1397	1132	1038	1033
clang-10	1587	1410	1397	1132	1022	1013
clang-11	1511	1334	1350	1106	1006	1004

Comparison with dev
Compiler	silesia-L1	enwik8-L1	silesia-L5	enwik8-L5	enwik9-L22	enwik9-long30
gcc-7	5.43%	5.71%	4.96%	4.49%	5.09%	4.79%
gcc-8	3.70%	3.36%	3.52%	3.48%	0.60%	0.70%
gcc-9	-5.31%	-5.28%	-7.68%	-7.67%	-6.27%	-6.47%
gcc-10	-5.04%	-5.13%	-7.62%	-7.45%	-7.62%	-7.84%
clang-6.0	4.19%	4.02%	3.29%	2.29%	11.87%	10.87%
clang-7	-4.17%	-3.96%	-4.80%	-4.01%	6.16%	5.64%
clang-8	-0.13%	-1.16%	-0.60%	-1.03%	11.66%	11.20%
clang-9	2.72%	3.15%	2.49%	2.35%	11.02%	10.84%
clang-10	5.10%	6.17%	4.72%	4.43%	7.58%	6.18%
clang-11	-0.72%	-0.22%	1.66%	2.79%	5.45%	5.46%
average	0.58%	0.67%	-0.01%	-0.03%	4.55%	4.14%

The total average decompression speed gain compared to dev is now ~+1.65%.

Cyan4973 · 2021-05-07T23:41:48Z

As an informational follow up :
this new version improves decompression speed quite considerably over v1.4.9,
for an average of +5.5% over all compilers and scenarios.
--long mode gets most of the benefits, with average speed gains of +14%.

However, there are large differences between compilers.
The overall winner here is clang, with average speed gains in the +10% range, and up to +20% for --long mode.
On the other hand, gcc tends to lose decompression speed with this version,
worst offender being gcc-9 and gcc-10 which lose a substantial -7% speed on "normal" (non --long) scenarios.

ghost · 2021-05-08T00:46:50Z

gcc has a hot function attribute, don't know if it's useful.

https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html

The hot attribute on a function is used to inform the compiler that the function
is a hot spot of the compiled program. The function is optimized more aggressively
and on many targets it is placed into a special subsection of the text section so all
hot functions appear close together, improving locality.

When profile feedback is available, via -fprofile-use, hot functions are automatically
detected and this attribute is ignored.

terrelln · 2021-05-08T00:50:45Z

worst offender being gcc-9 and gcc-10 which lose a substantial -7% speed on "normal" (non --long) scenarios.

That is a bummer, since this is our main compiler internally. But, I think that may just be instruction alignment, which anywhere near isn't as present on skylake.

Cyan4973 · 2021-05-08T05:07:25Z

gcc has a hot function attribute, don't know if it's useful.

That's a good point !
I tried this __attribute__((hot)) capability with gcc and clang,
unfortunately, the result was unconvincing,
with average decompression speed globally unchanged,
most differences being < 2%, close to noise level,
with a few occasional spots at +5% / -5%, cancelling each other.

edit : this exercise forced me to rescan benchmark measurements,
and it turns out that gcc-9 and gcc-10 where receiving lower figures that they should.
It's not changing fundamentally the picture,
but their new speed is now estimated ~5-6% slower (compared to v1.4.9), instead of 7%.

facebook-github-bot added the CLA Signed label Mar 18, 2021

terrelln approved these changes Mar 19, 2021

View reviewed changes

Cyan4973 force-pushed the d_prefetch_refactor branch from 20d4520 to f543466 Compare March 19, 2021 22:48

Merge branch 'dev' into d_prefetch_refactor

8cde167

Cyan4973 mentioned this pull request May 5, 2021

faster speed for decompressSequencesLong #2614

Merged

Cyan4973 and others added 2 commits May 6, 2021 19:49

Merge branch 'dev' into d_prefetch_refactor

ee425fa

Merge branch 'dev' into d_prefetch_refactor

a4d55c8

Cyan4973 added 2 commits May 7, 2021 11:26

Merge branch 'd_prefetch_refactor' of github.com:facebook/zstd into d…

4d9caa4

…_prefetch_refactor

update decoder hot loop alignment

6755baf

This seems to bring an additional ~+1.2% decompression speed on average across 10 compilers x 6 scenarios.

Cyan4973 merged commit 5b6d38a into dev May 7, 2021

Cyan4973 deleted the d_prefetch_refactor branch December 9, 2021 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor prefetching for the decoding loop #2547

Refactor prefetching for the decoding loop #2547

Cyan4973 commented Mar 18, 2021 •

edited

Loading

ghost commented Mar 19, 2021 •

edited by ghost

Loading

terrelln left a comment

terrelln Mar 19, 2021

Cyan4973 Mar 19, 2021

Cyan4973 commented Mar 19, 2021

Cyan4973 commented May 7, 2021 •

edited

Loading

Cyan4973 commented May 7, 2021

Cyan4973 commented May 7, 2021 •

edited

Loading

Cyan4973 commented May 7, 2021 •

edited

Loading

ghost commented May 8, 2021

terrelln commented May 8, 2021

Cyan4973 commented May 8, 2021 •

edited

Loading

Refactor prefetching for the decoding loop #2547

Refactor prefetching for the decoding loop #2547

Conversation

Cyan4973 commented Mar 18, 2021 • edited Loading

ghost commented Mar 19, 2021 • edited by ghost Loading

terrelln left a comment

Choose a reason for hiding this comment

terrelln Mar 19, 2021

Choose a reason for hiding this comment

Cyan4973 Mar 19, 2021

Choose a reason for hiding this comment

Cyan4973 commented Mar 19, 2021

Cyan4973 commented May 7, 2021 • edited Loading

Cyan4973 commented May 7, 2021

Cyan4973 commented May 7, 2021 • edited Loading

Cyan4973 commented May 7, 2021 • edited Loading

ghost commented May 8, 2021

terrelln commented May 8, 2021

Cyan4973 commented May 8, 2021 • edited Loading

Cyan4973 commented Mar 18, 2021 •

edited

Loading

ghost commented Mar 19, 2021 •

edited by ghost

Loading

Cyan4973 commented May 7, 2021 •

edited

Loading

Cyan4973 commented May 7, 2021 •

edited

Loading

Cyan4973 commented May 7, 2021 •

edited

Loading

Cyan4973 commented May 8, 2021 •

edited

Loading