[huff0] Add x86 specialisation of Decode4X #512

WojciechMula · 2022-03-03T15:55:02Z

Hi, first of all, thank you for such a great library! I have been working on speeding up the Zstd decompression, mainly by porting hot loops into the assembly. This is the first PR, that's pretty small and I'd like to make it an opportunity to discuss code shape. Is it something acceptable, or not.

I'm marking it as a draft because not all tests in Zstd pass now; I branched some time ago and seems there are were some changes I have to investigate.

Any way, below is comparison of decompressing speed for Zstd after applying the patch. Benchmarks were run on an Ice Lake machine.

benchmark                                                                 old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            5064292       5055128       -0.18%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        924146        889296        -3.77%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           12552928      12475253      -0.62%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         3884720       3815638       -1.78%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          5320410       5307378       -0.24%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1458895       1419301       -2.71%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       218670        219073        +0.18%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       129945        121948        -6.15%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             14234754      13921832      -2.20%
BenchmarkDecoder_DecoderSmall/html.zst-16                                 1028808       1002782       -2.53%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        83589         77477         -7.31%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               605687        603714        -0.33%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           111881        106144        -5.13%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              1445193       1424825       -1.41%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            481721        470827        -2.26%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             643764        641131        -0.41%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                234945        233164        -0.76%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          23411         23627         +0.92%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          11293         11302         +0.08%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                1644661       1592756       -3.16%
BenchmarkDecoder_DecodeAll/html.zst-16                                    126047        121406        -3.68%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           10396         9758          -6.14%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      1503668       1447556       -3.73%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      1526024       1498882       -1.78%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       1453765       1415595       -2.63%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         1497672       1473028       -1.65%
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9182          9186          +0.04%
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          355812        359097        +0.92%
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           271279        273078        +0.66%
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             186720        192951        +3.34%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              3131          3181          +1.60%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              2929          2949          +0.68%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               3387          3451          +1.89%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 9063          9140          +0.85%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 5390          4876          -9.54%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 7683          7746          +0.82%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  7679          7760          +1.05%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    7923          7905          -0.23%
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       91261         84536         -7.37%
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       94548         89934         -4.88%
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        87788         83488         -4.90%
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          99753         97126         -2.63%
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9213          9190          -0.25%
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         360974        364403        +0.95%
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          268990        269969        +0.36%
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            185887        192209        +3.40%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    27607         27162         -1.61%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    30728         30104         -2.03%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     25116         24681         -1.73%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       30037         29093         -3.14%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9178          9183          +0.05%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9170          9174          +0.04%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9171          9179          +0.09%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9174          9185          +0.12%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     160969        171452        +6.51%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     172487        157895        -8.46%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      159815        145387        -9.03%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        160579        155217        -3.34%
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         1077          1038          -3.62%
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         42039         43817         +4.23%
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          33769         34422         +1.93%
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            25179         25295         +0.46%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             493           507           +2.94%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             495           503           +1.68%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              517           550           +6.41%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                888           880           -0.89%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                756           693           -8.25%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                877           892           +1.66%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 834           875           +4.99%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   1001          941           -6.01%
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      13288         12076         -9.12%
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      14990         12745         -14.98%
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       12205         11149         -8.65%
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         13920         12165         -12.61%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        1039          1025          -1.35%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        43506         42534         -2.23%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         33428         33474         +0.14%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           25168         25283         +0.46%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   3781          3688          -2.46%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   3976          3873          -2.59%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    3204          3178          -0.81%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      3605          3329          -7.66%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    1031          1029          -0.19%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    1028          1081          +5.16%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     1034          1029          -0.48%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       1028          1038          +0.97%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       83577         71934         -13.93%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   16304         14373         -11.84%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      209034        177289        -15.19%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    67474         58313         -13.58%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     90482         78433         -13.32%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        29896         28681         -4.06%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  3488          3332          -4.47%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  1261          1246          -1.19%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        210072        182726        -13.02%
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            18952         16590         -12.46%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1496          1332          -10.96%

klauspost · 2022-03-03T17:35:55Z

@WojciechMula Thanks for the great work. I will take a look and do some tests here as well.

A few pre-review notes (ignoring what is reported by tests, assuming you'll fix that)

Please make a noasm build tag, exclude gcc and appengine. You can grab build tags from s2: https://github.com/klauspost/compress/blob/master/s2/encodeblock_amd64.go#L3-L4

For the Go version, I found that interleaving 2 streams would give better pipelining. You removed that from the non-asm version. I will test that there isn't a regression here.

Please run asmfmt on the assembly.

Is SHRXQ and SHLXQ the only bmi used? They are (in my experience) not really faster, so a generic amd64 version should be just as fast.

Try breaking dependency chains by interleaving operations more. Your assembly is pretty much "serial", making the cpu having to work hard to re-order your code.

Zstd tests can be noisy. Checking here, literal decoding takes op 9.7% of cpu time in DecodeAllParallel, so I doubt you are saving 10-15%. BenchmarkDecompress4XNoTable is the cleanest 4X benchmark for direct comparisons.

klauspost · 2022-03-03T17:53:25Z

Numbers are looking good 👍🏼

BenchmarkDecompress4XNoTable/gettysburg-32           593.83       641.27       1.08x
BenchmarkDecompress4XNoTable/twain-32                491.42       643.20       1.31x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       829.51       1.15x

(these are also the only ones that have tablelog >8)

Some small regressions without asm:

BenchmarkDecompress4XNoTable/gettysburg-32           593.83       565.72       0.95x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       683.05       0.95x

WojciechMula · 2022-03-03T18:06:32Z

For the Go version, I found that interleaving 2 streams would give better pipelining. You removed that from the non-asm version. I will test that there isn't a regression here.

Thank you for looking at this. Yeah, I restore that freshest version. Got lost with rebasing at some point.

Please run asmfmt on the assembly.

Sure!

Is SHRXQ and SHLXQ the only bmi used? They are (in my experience) not really faster, so a generic amd64 version should be just as fast.

Yes, I checked also SHR and SHL and the noticed BMI was faster, maybe not significantly.

Try breaking dependency chains by interleaving operations more. Your assembly is pretty much "serial", making the cpu having to work hard to re-order your code.

Sure, will do.

Zstd tests can be noisy. Checking here, literal decoding takes op 9.7% of cpu time in DecodeAllParallel, so I doubt you are saving 10-15%. BenchmarkDecompress4XNoTable is the cleanest 4X benchmark for direct comparisons.

Thank you for so quick response! I'll look at the regression for the plain Go version.

klauspost · 2022-03-03T18:12:00Z

Pure "amd64" speed is extremely similar, so for now bmi doesn't seem worth it:

BenchmarkDecompress4XNoTable/twain-32                491.42       638.31       1.30x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       819.40       1.14x

You can maybe have the entire func (d *Decoder) Decompress4X(dst, src []byte) in an _amd64.go, so you can just keep the existing ask is, with opposite build tags.

My quickly thrown together code here: https://gist.github.com/klauspost/82d5c9b85c067d06d606f1c12c82615c

klauspost · 2022-03-03T18:13:51Z

I will do a more detailed review tomorrow. Obviously we need to fix the bugs.

huff0/decompress_amd64.s

huff0/decompress.go

huff0/decompress_amd64.s

klauspost · 2022-03-04T09:34:22Z

Without the extra "AND":

BenchmarkDecompress4XNoTable/gettysburg-32           593.83       681.84       1.15x
BenchmarkDecompress4XNoTable/twain-32                491.42       680.16       1.38x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       870.23       1.21x

This also makes it "competitive" to replace that tablelog <= 8 specialized versions, though obviously dedicated versions would likely be even better.

Only very small payloads (unlikely) are worse.

WojciechMula · 2022-03-04T10:05:16Z

Without the extra "AND":
BenchmarkDecompress4XNoTable/gettysburg-32           593.83       681.84       1.15x
BenchmarkDecompress4XNoTable/twain-32                491.42       680.16       1.38x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       870.23       1.21x
This also makes it "competitive" to replace that tablelog <= 8 specialized versions, though obviously dedicated versions would likely be even better.

Only very small payloads (unlikely) are worse.

Thank you for the review. I changed almost everything that you asked for. The only big thing is to reshuffle instructions (if it's possible).

Perf results are similar on IceLake.

klauspost · 2022-03-04T10:18:42Z

The only big thing is to reshuffle instructions (if it's possible).

You can just work on that later. The improvement is already significant.

WojciechMula · 2022-03-04T12:28:23Z

The only big thing is to reshuffle instructions (if it's possible).

You can just work on that later. The improvement is already significant.

@klauspost I tried to interleave the operations as the Go code does. However, didn't notice any significant performance changes -- I'd rather say it's a noise. You can check what I did: https://github.com/WojciechMula/compress/tree/experiment. Any hints highly appreciate. :)

klauspost · 2022-03-04T12:54:50Z

Yeah, SSE <-> GPR usually is pretty slow. Don't worry about it, let's get it working with the current improvements, we can tweak later.

klauspost · 2022-03-04T13:03:21Z

Seems the file for other platforms is missing:

Error: huff0/decompress.go:207:26: s.Decoder().Decompress4X undefined (type *Decoder has no field or method Decompress4X)

WojciechMula · 2022-03-06T19:18:43Z

Yeah, SSE <-> GPR usually is pretty slow. Don't worry about it, let's get it working with the current improvements, we can tweak later.

I also checked if interleaving decoding of two streams is profitable. It's not. Yeah, I learned recently that using the stack is way faster (and easier) than trying to keep temp values in SSE or AVX512 kregs.

Hopefully, I fixed the build tags.

klauspost

This should fix the tags.

klauspost · 2022-03-07T07:48:37Z

huff0/decompress_generic.go

+//go:build noasm
+// +build noasm


Suggested change

//go:build noasm

// +build noasm

//go:build !amd64 || appengine || !gc || noasm

// +build !amd64 appengine !gc noasm

Thank you! Fixed.

huff0/decompress_amd64.s.in

klauspost · 2022-03-07T13:02:57Z

I have fuzz tested the current version, and it looks fine. 👍🏼 Once we get the build tags sorted, I will do a final benchmark and we can move to merging.

WojciechMula · 2022-03-07T17:54:58Z

I have fuzz tested the current version, and it looks fine. 👍🏼 Once we get the build tags sorted, I will do a final benchmark and we can move to merging.

Great! Thank you very much for checking this. And sorry for build tags problems, TBH I'm not familiar with them. So far never used them.

klauspost

👍🏼 Merging when tests complete.

WojciechMula · 2022-03-07T17:57:06Z

👍🏼 Merging when tests complete.

Let me squash the commits before.

BenchmarkDecompress4XNoTable/gettysburg-32 593.83 681.84 1.15x BenchmarkDecompress4XNoTable/twain-32 491.42 680.16 1.38x BenchmarkDecompress4XNoTable/pngdata.001-32 718.28 870.23 1.21x

WojciechMula · 2022-03-07T18:24:22Z

👍🏼 Merging when tests complete.

Let me squash the commits before.

@klauspost OK, squshed the changes and added perf results to the commit message

klauspost reviewed Mar 4, 2022

View reviewed changes

huff0/decompress_amd64.s Show resolved Hide resolved

huff0/decompress.go Outdated Show resolved Hide resolved

huff0/decompress.go Outdated Show resolved Hide resolved

huff0/decompress_amd64.s Outdated Show resolved Hide resolved

huff0/decompress_amd64.s Outdated Show resolved Hide resolved

WojciechMula changed the title ~~[huff0] Add x86 BMI1 specialisation of Decode4X~~ [huff0] Add x86 specialisation of Decode4X Mar 4, 2022

WojciechMula marked this pull request as ready for review March 4, 2022 12:28

klauspost reviewed Mar 7, 2022

View reviewed changes

klauspost approved these changes Mar 7, 2022

View reviewed changes

[huff0] Add x86 specialisation of Decode4X

20a106f

BenchmarkDecompress4XNoTable/gettysburg-32 593.83 681.84 1.15x BenchmarkDecompress4XNoTable/twain-32 491.42 680.16 1.38x BenchmarkDecompress4XNoTable/pngdata.001-32 718.28 870.23 1.21x

WojciechMula force-pushed the huff0-amd64 branch from 2a34ea9 to 20a106f Compare March 7, 2022 18:23

klauspost merged commit 76e0660 into klauspost:master Mar 8, 2022

WojciechMula deleted the huff0-amd64 branch March 10, 2022 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[huff0] Add x86 specialisation of Decode4X #512

[huff0] Add x86 specialisation of Decode4X #512

WojciechMula commented Mar 3, 2022

klauspost commented Mar 3, 2022 •

edited

Loading

klauspost commented Mar 3, 2022

WojciechMula commented Mar 3, 2022

klauspost commented Mar 3, 2022

klauspost commented Mar 3, 2022

klauspost commented Mar 4, 2022 •

edited

Loading

WojciechMula commented Mar 4, 2022

klauspost commented Mar 4, 2022

WojciechMula commented Mar 4, 2022

klauspost commented Mar 4, 2022

klauspost commented Mar 4, 2022

WojciechMula commented Mar 6, 2022

klauspost left a comment

klauspost Mar 7, 2022

WojciechMula Mar 7, 2022

klauspost commented Mar 7, 2022

WojciechMula commented Mar 7, 2022

klauspost left a comment

WojciechMula commented Mar 7, 2022

WojciechMula commented Mar 7, 2022

[huff0] Add x86 specialisation of Decode4X #512

[huff0] Add x86 specialisation of Decode4X #512

Conversation

WojciechMula commented Mar 3, 2022

klauspost commented Mar 3, 2022 • edited Loading

klauspost commented Mar 3, 2022

WojciechMula commented Mar 3, 2022

klauspost commented Mar 3, 2022

klauspost commented Mar 3, 2022

klauspost commented Mar 4, 2022 • edited Loading

WojciechMula commented Mar 4, 2022

klauspost commented Mar 4, 2022

WojciechMula commented Mar 4, 2022

klauspost commented Mar 4, 2022

klauspost commented Mar 4, 2022

WojciechMula commented Mar 6, 2022

klauspost left a comment

Choose a reason for hiding this comment

klauspost Mar 7, 2022

Choose a reason for hiding this comment

WojciechMula Mar 7, 2022

Choose a reason for hiding this comment

klauspost commented Mar 7, 2022

WojciechMula commented Mar 7, 2022

klauspost left a comment

Choose a reason for hiding this comment

WojciechMula commented Mar 7, 2022

WojciechMula commented Mar 7, 2022

klauspost commented Mar 3, 2022 •

edited

Loading

klauspost commented Mar 4, 2022 •

edited

Loading