zstd: x86 assembler implementation of sequenceDecs.decode #528

WojciechMula · 2022-03-10T15:01:12Z

This is plain x86 and x86 with BMI2 implementation of sequenceDecs.decode. Part of #515.

Since the benchmarks use decodeSync I temporarily replaced its implementation with one using decode and execute, at cost of allocation of the seqVals array every time.

There are some IMHO nice improvements and small regressions in few cases. From my previous experience can tell that we'll get quite big speedup when rewrite execute. And of course we'll get the biggest speedup when fuse decode and execute into a single procedure.

~~Marking PR as a draft as just one test TestNewDecoderBad/Reader-4/6f88497edbc9059998f9e6d0ea0d0eed8d8af38d.zst fails. Have to investigate why.~~ [fixed]

Below are benchmarks.

old.txt was produced by the command go generate && go test -tags noasm -run XYZ -bench BenchmarkDecoder.
new.txt was produced by the command go generate && go test -run XYZ -bench BenchmarkDecoder.
new-bmi2.txt was produced by the command go generate && GOAMD64=v3 go test -run XYZ -bench BenchmarkDecoder.

Comparison of old.txt with new.txt

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            237.60       238.72       1.00x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        764.22       868.49       1.14x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         192.39       197.70       1.03x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           226.37       233.74       1.03x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         211.46       216.24       1.02x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          189.03       190.01       1.01x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1743.29      1951.14      1.12x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3079.05      3309.04      1.07x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       6696.56      7926.76      1.18x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             337.14       365.98       1.09x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 613.59       687.53       1.12x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        345.78       374.54       1.08x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               244.27       241.55       0.99x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           785.09       912.82       1.16x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            200.18       203.32       1.02x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              237.47       239.27       1.01x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            211.75       214.06       1.01x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             190.52       188.88       0.99x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1397.55      1488.07      1.06x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          3428.09      3716.01      1.08x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          9548.90      10887.83     1.14x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                356.48       386.03       1.08x
BenchmarkDecoder_DecodeAll/html.zst-16                                    598.01       666.57       1.11x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           333.55       364.23       1.09x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      208.57       222.10       1.06x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      206.44       206.26       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       219.64       224.09       1.02x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         215.01       212.89       0.99x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9564.96      10855.81     1.13x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          230.88       242.77       1.05x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           300.04       357.40       1.19x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             481.01       629.48       1.31x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1153.28      1167.83      1.01x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1178.62      1197.84      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1061.04      1085.91      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 431.41       434.83       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 260.26       288.78       1.11x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 178.69       182.34       1.02x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  178.37       183.01       1.03x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    163.05       166.78       1.02x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       379.84       413.38       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       364.05       398.18       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        396.48       440.21       1.11x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          346.17       375.39       1.08x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9561.29      10873.23     1.14x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         224.96       240.51       1.07x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          303.94       364.00       1.20x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            476.76       635.21       1.33x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1602.64      1693.12      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1470.20      1556.52      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     1781.10      1894.80      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1542.00      1661.43      1.08x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9556.32      10879.80     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9571.19      10882.04     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9572.19      10887.11     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9566.54      10886.24     1.14x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     1343.71      1607.61      1.20x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     1311.12      1407.07      1.07x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      1401.88      1587.61      1.13x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        1402.53      1484.78      1.06x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         66679.48     94849.27     1.42x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         1306.06      1502.06      1.15x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          1927.42      2336.68      1.21x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            3472.36      4863.07      1.40x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             6276.41      6383.51      1.02x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             5490.46      5771.14      1.05x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              6008.30      6052.66      1.01x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                3799.77      3895.76      1.03x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1412.04      1441.62      1.02x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                962.89       949.33       0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 984.52       969.09       0.98x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   801.71       795.11       0.99x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      1738.74      1974.58      1.14x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      1633.98      1841.22      1.13x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       1817.99      2012.59      1.11x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         1717.94      1874.86      1.09x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        66526.60     96359.49     1.45x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        1268.55      1490.20      1.17x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         1947.34      2373.92      1.22x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           3458.55      4850.24      1.40x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   8243.76      8724.07      1.06x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   8197.25      8948.34      1.09x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    9020.42      9939.28      1.10x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      10641.45     11529.73     1.08x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    66560.21     95518.08     1.44x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    66587.20     94626.59     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     66651.43     94356.64     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       66512.25     95444.30     1.43x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       1466.81      1604.80      1.09x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   4831.35      5497.25      1.14x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    1215.11      1374.36      1.13x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      1464.34      1623.73      1.11x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    1240.22      1406.06      1.13x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1094.77      1206.81      1.10x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        10029.54     11377.73     1.13x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  19167.95     22324.31     1.16x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  66383.12     95910.86     1.44x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        2687.11      3090.49      1.15x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            3748.35      4307.61      1.15x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1959.71      2145.97      1.10x

Comparison of old.txt with new-bmi2.txt

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            237.60       238.72       1.00x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        764.22       868.49       1.14x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         192.39       197.70       1.03x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           226.37       233.74       1.03x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         211.46       216.24       1.02x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          189.03       190.01       1.01x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1743.29      1951.14      1.12x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3079.05      3309.04      1.07x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       6696.56      7926.76      1.18x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             337.14       365.98       1.09x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 613.59       687.53       1.12x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        345.78       374.54       1.08x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               244.27       241.55       0.99x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           785.09       912.82       1.16x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            200.18       203.32       1.02x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              237.47       239.27       1.01x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            211.75       214.06       1.01x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             190.52       188.88       0.99x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1397.55      1488.07      1.06x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          3428.09      3716.01      1.08x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          9548.90      10887.83     1.14x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                356.48       386.03       1.08x
BenchmarkDecoder_DecodeAll/html.zst-16                                    598.01       666.57       1.11x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           333.55       364.23       1.09x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      208.57       222.10       1.06x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      206.44       206.26       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       219.64       224.09       1.02x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         215.01       212.89       0.99x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9564.96      10855.81     1.13x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          230.88       242.77       1.05x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           300.04       357.40       1.19x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             481.01       629.48       1.31x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1153.28      1167.83      1.01x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1178.62      1197.84      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1061.04      1085.91      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 431.41       434.83       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 260.26       288.78       1.11x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 178.69       182.34       1.02x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  178.37       183.01       1.03x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    163.05       166.78       1.02x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       379.84       413.38       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       364.05       398.18       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        396.48       440.21       1.11x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          346.17       375.39       1.08x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9561.29      10873.23     1.14x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         224.96       240.51       1.07x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          303.94       364.00       1.20x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            476.76       635.21       1.33x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1602.64      1693.12      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1470.20      1556.52      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     1781.10      1894.80      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1542.00      1661.43      1.08x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9556.32      10879.80     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9571.19      10882.04     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9572.19      10887.11     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9566.54      10886.24     1.14x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     1343.71      1607.61      1.20x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     1311.12      1407.07      1.07x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      1401.88      1587.61      1.13x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        1402.53      1484.78      1.06x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         66679.48     94849.27     1.42x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         1306.06      1502.06      1.15x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          1927.42      2336.68      1.21x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            3472.36      4863.07      1.40x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             6276.41      6383.51      1.02x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             5490.46      5771.14      1.05x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              6008.30      6052.66      1.01x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                3799.77      3895.76      1.03x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1412.04      1441.62      1.02x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                962.89       949.33       0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 984.52       969.09       0.98x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   801.71       795.11       0.99x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      1738.74      1974.58      1.14x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      1633.98      1841.22      1.13x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       1817.99      2012.59      1.11x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         1717.94      1874.86      1.09x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        66526.60     96359.49     1.45x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        1268.55      1490.20      1.17x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         1947.34      2373.92      1.22x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           3458.55      4850.24      1.40x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   8243.76      8724.07      1.06x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   8197.25      8948.34      1.09x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    9020.42      9939.28      1.10x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      10641.45     11529.73     1.08x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    66560.21     95518.08     1.44x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    66587.20     94626.59     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     66651.43     94356.64     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       66512.25     95444.30     1.43x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       1466.81      1604.80      1.09x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   4831.35      5497.25      1.14x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    1215.11      1374.36      1.13x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      1464.34      1623.73      1.11x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    1240.22      1406.06      1.13x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1094.77      1206.81      1.10x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        10029.54     11377.73     1.13x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  19167.95     22324.31     1.16x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  66383.12     95910.86     1.44x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        2687.11      3090.49      1.15x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            3748.35      4307.61      1.15x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1959.71      2145.97      1.10x

klauspost · 2022-03-10T15:23:01Z

Looks great! Impressive numbers - is this comparing to a similar setup, ie not using "decodeSync"?

I wouldn't mind a version for vbmi and one without. I guess the difference is measurable.

Yes, I expect "execute" to be much better. Especially if we "overallocate" the destination by 7 bytes, so we can do non-overlapping copies in blocks of 8 bytes.

WojciechMula · 2022-03-10T15:37:02Z

Looks great! Impressive numbers - is this comparing to a similar setup, ie not using "decodeSync"?

From what I can gather there are no benchmarks that use async API, thus I had to hack decodeSync. BTW I compared a few days ago the new master (after split) with the code I forked from and there is 5-8% regression on IceLake.

Yes, I expect "execute" to be much better. Especially if we "overallocate" the destination by 7 bytes, so we can do non-overlapping copies in blocks of 8 bytes.

My experiments showed that if we copy in 32-byte blocks (i.e. using AVX2 registers) we're getting the biggest speedup. However, I optimized just the most common path: i.e. copy from literals + copy from s.out, and bail out to the Go code to handle more complex cases.

klauspost

Some quick questions.

zstd/seqdec_amd64.s

klauspost · 2022-03-10T16:53:05Z

From what I can gather there are no benchmarks that use async API, thus I had to hack decodeSync.

Yes, it is only useful for streams, and in that case the execute stage is usually the slowest. When we don't split decode/execute it is fastest to not write the intermediate values. What I meant was if "before" is using the old decodeSync, or the modified one?

I was actually thinking of making a microbench for decode and execute separately, so they could we tested independently. I can look into that.

BTW I compared a few days ago the new master (after split) with the code I forked from and there is 5-8% regression on IceLake.

Benchmarks can be really hard to interpret. It could be that it doesn't like this change - but it could also just be something like jump alignment - it can easily vary by that much with seemingly unrelated changes.

WojciechMula · 2022-03-10T18:48:02Z

What I meant was if "before" is using the old decodeSync, or the modified one?

The modified one.

I was actually thinking of making a microbench for decode and execute separately, so they could we tested independently. I can look into that.

It would be great!

Benchmarks can be really hard to interpret. It could be that it doesn't like this change - but it could also just be something like jump alignment - it can easily vary by that much with seemingly unrelated changes.

Yeah, I know. That Ice Lake machine I use is an AWS one, and sometimes exactly the same code run twice can be slower/faster for no clear reason. Thus I gave up any microoptimizations, as I couldn't tell if differences are due to my changes or the Moon's phase. BTW, speaking of loop alignment, I wanted to enforce it, but seems that is disabled right now for the x86 target: https://groups.google.com/g/golang-nuts/c/M86PTw1jl6w.

zstd/seqdec_amd64.s

zstd/seqdec.go

WojciechMula · 2022-03-11T09:18:43Z

The failed CI was due to a race test of flate/deflate_test.go: https://github.com/klauspost/compress/runs/5508309654?check_suite_focus=true, unrelated to this PR.

klauspost · 2022-03-11T09:39:59Z

The failed CI was due to a race test of flate/deflate_test.go: https://github.com/klauspost/compress/runs/5508309654?check_suite_focus=true, unrelated to this PR.

Yeah, sometimes we get a very slow machine. This is just a timeout. Restarting the actions.

klauspost · 2022-03-11T12:52:16Z

@WojciechMula I have added benchmarks for decode, execute and decodeSync in #530

I hope it doesn't conflict too much. I will merge when tests pass.

WojciechMula · 2022-03-11T13:10:39Z

@WojciechMula I have added benchmarks for decode, execute and decodeSync in #530

I hope it doesn't conflict too much. I will merge when tests pass.

@klauspost Great! I'll rebase to the master when you merge the benchmarks and then squash commits.

klauspost · 2022-03-11T13:25:31Z

Pure AMD64:

benchmark                                                                                          old ns/op     new ns/op     delta
Benchmark_seqdec_decode/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32            141344        105927        -25.06%
Benchmark_seqdec_decode/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32           146059        104225        -28.64%
Benchmark_seqdec_decode/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32                   132771        97027         -26.92%
Benchmark_seqdec_decode/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32            16573         10860         -34.47%
Benchmark_seqdec_decode/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32            40159         25812         -35.73%
Benchmark_seqdec_decode/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32             98292         69494         -29.30%
Benchmark_seqdec_decode/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                      9979          7078          -29.07%
Benchmark_seqdec_decode/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32         184877        133433        -27.83%
Benchmark_seqdec_decode/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                       58.7          138           +135.21%
Benchmark_seqdec_decode/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                        950           760           -19.91%
Benchmark_seqdec_decode/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                    9333          6521          -30.13%
Benchmark_seqdec_decode/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                       33926         21103         -37.80%
Benchmark_seqdec_decode/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32                  50962         31479         -38.23%
Benchmark_seqdec_decode/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-32                        99586         67130         -32.59%

With BMI:

benchmark                                                                                          old ns/op     new ns/op     delta
Benchmark_seqdec_decode/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32            141344        92330         -34.68%
Benchmark_seqdec_decode/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32           146059        95747         -34.45%
Benchmark_seqdec_decode/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32                   132771        85915         -35.29%
Benchmark_seqdec_decode/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32            16573         9315          -43.79%
Benchmark_seqdec_decode/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32            40159         22776         -43.29%
Benchmark_seqdec_decode/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32             98292         62976         -35.93%
Benchmark_seqdec_decode/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                      9979          5995          -39.92%
Benchmark_seqdec_decode/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32         184877        117648        -36.36%
Benchmark_seqdec_decode/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                       58.7          141           +139.82%
Benchmark_seqdec_decode/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                        950           678           -28.56%
Benchmark_seqdec_decode/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                    9333          5621          -39.77%
Benchmark_seqdec_decode/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                       33926         18070         -46.74%
Benchmark_seqdec_decode/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32                  50962         27972         -45.11%
Benchmark_seqdec_decode/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-32                        99586         60449         -39.30%

BMI speedup alone:

benchmark                                                                                          old ns/op     new ns/op     delta
Benchmark_seqdec_decode/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32            105927        92330         -12.84%
Benchmark_seqdec_decode/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32           104225        95747         -8.13%
Benchmark_seqdec_decode/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32                   97027         85915         -11.45%
Benchmark_seqdec_decode/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32            10860         9315          -14.23%
Benchmark_seqdec_decode/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32            25812         22776         -11.76%
Benchmark_seqdec_decode/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32             69494         62976         -9.38%
Benchmark_seqdec_decode/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                      7078          5995          -15.30%
Benchmark_seqdec_decode/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32         133433        117648        -11.83%
Benchmark_seqdec_decode/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                       138           141           +1.96%
Benchmark_seqdec_decode/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                        760           678           -10.80%
Benchmark_seqdec_decode/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                    6521          5621          -13.80%
Benchmark_seqdec_decode/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                       21103         18070         -14.37%
Benchmark_seqdec_decode/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32                  31479         27972         -11.14%

So it seems like it is worth it to have runtime BMI detection.

With only 2 sequences there is a small regressions. Good to know, but not worth special code IMO.

klauspost · 2022-03-11T13:32:50Z

@klauspost Great! I'll rebase to the master when you merge the benchmarks and then squash commits.

Merged. The conflict should be trivial. Please remove the decodeSync change for now. It looks good and we can move forward from here.

WojciechMula · 2022-03-11T14:42:03Z

These are results from Ice Lake (noasm vs GOAMD64=v3)

benchmark                                                                                          old ns/op     new ns/op     delta
Benchmark_seqdec_decode/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-16            163287        131585        -19.41%
Benchmark_seqdec_decode/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-16           173628        134655        -22.45%
Benchmark_seqdec_decode/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-16                   145739        131145        -10.01%
Benchmark_seqdec_decode/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-16            17244         13604         -21.11%
Benchmark_seqdec_decode/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-16            39179         31955         -18.44%
Benchmark_seqdec_decode/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-16             113786        86306         -24.15%
Benchmark_seqdec_decode/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-16                      10915         8957          -17.94%
Benchmark_seqdec_decode/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-16         203105        154422        -23.97%
Benchmark_seqdec_decode/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-16                       62.8          106           +69.24%

So it seems like it is worth it to have runtime BMI detection.

Do you think simple CPUID invocation just to detect BMI1/BMI2 would be sufficient? Or do you prefer to add to dependencies your library https://github.com/klauspost/cpuid?

zstd/seqdec_amd64.go

WojciechMula · 2022-03-17T08:22:35Z

@klauspost What do you plan to do with this PR? Merge or amend with your avo implementation?

klauspost · 2022-03-17T10:12:14Z

@WojciechMula Let's put in the avo version, add CPU tests and remove the decodeSync hack.

WojciechMula · 2022-03-17T10:50:41Z

@WojciechMula Let's put in the avo version, add CPU tests and remove the decodeSync hack.

@klauspost OK, I'm doing this right now.

Differences with the Go implementation: - check ml and mo in the main loop, - s.seqSize and litRemain are checked in the end.

Code by Klaus Post: https://gist.github.com/klauspost/8949f70d98dd94116392019f119087e5

klauspost · 2022-03-17T12:47:14Z

The failed test can be ignored. I will add a check for it separately.

klauspost

Great stuff. Do you want to add anything else before I merge?

WojciechMula · 2022-03-17T13:02:02Z

Thanks :) And thank you for such great support. I think there's nothing to add.

WojciechMula force-pushed the asm-seqdec-decode branch from 0dcdbd0 to f90b72c Compare March 10, 2022 15:03

klauspost reviewed Mar 10, 2022

View reviewed changes

zstd/seqdec_amd64.s Outdated Show resolved Hide resolved

zstd/seqdec_amd64.s Outdated Show resolved Hide resolved

zstd/seqdec_amd64.s Outdated Show resolved Hide resolved

zstd/seqdec_amd64.s Outdated Show resolved Hide resolved

zstd/seqdec_amd64.s Outdated Show resolved Hide resolved

WojciechMula marked this pull request as ready for review March 10, 2022 18:50

klauspost reviewed Mar 11, 2022

View reviewed changes

zstd/seqdec_amd64.s Outdated Show resolved Hide resolved

klauspost reviewed Mar 11, 2022

View reviewed changes

zstd/seqdec.go Outdated Show resolved Hide resolved

WojciechMula force-pushed the asm-seqdec-decode branch from 8a7fb0d to 1f0baa2 Compare March 11, 2022 14:33

WojciechMula mentioned this pull request Mar 11, 2022

zstd: x86 assembler implementation of sequenceDecs.executeSimple #531

Merged

klauspost reviewed Mar 14, 2022

View reviewed changes

zstd/seqdec_amd64.go Outdated Show resolved Hide resolved

WojciechMula added 6 commits March 17, 2022 12:03

zstd: x86 assembler implementation of sequenceDecs.decode

dd39c9c

Differences with the Go implementation: - check ml and mo in the main loop, - s.seqSize and litRemain are checked in the end.

Generate assembly routines with avo

23aa85e

Code by Klaus Post: https://gist.github.com/klauspost/8949f70d98dd94116392019f119087e5

Fix asm label

c4e0fdf

Remove text templates

a4a5a99

Generate read-only asm files

11be8b5

Add runtime selection between plain x86 and BMI2 routines

8ca6a66

WojciechMula force-pushed the asm-seqdec-decode branch from 1f0baa2 to 8ca6a66 Compare March 17, 2022 12:17

klauspost approved these changes Mar 17, 2022

View reviewed changes

klauspost merged commit 2d457e5 into klauspost:master Mar 17, 2022

WojciechMula deleted the asm-seqdec-decode branch March 17, 2022 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstd: x86 assembler implementation of sequenceDecs.decode #528

zstd: x86 assembler implementation of sequenceDecs.decode #528

WojciechMula commented Mar 10, 2022 •

edited

Loading

klauspost commented Mar 10, 2022

WojciechMula commented Mar 10, 2022

klauspost left a comment

klauspost commented Mar 10, 2022

WojciechMula commented Mar 10, 2022 •

edited

Loading

WojciechMula commented Mar 11, 2022

klauspost commented Mar 11, 2022

klauspost commented Mar 11, 2022 •

edited

Loading

WojciechMula commented Mar 11, 2022

klauspost commented Mar 11, 2022

klauspost commented Mar 11, 2022

WojciechMula commented Mar 11, 2022

WojciechMula commented Mar 17, 2022

klauspost commented Mar 17, 2022

WojciechMula commented Mar 17, 2022

klauspost commented Mar 17, 2022

klauspost left a comment

WojciechMula commented Mar 17, 2022

zstd: x86 assembler implementation of sequenceDecs.decode #528

zstd: x86 assembler implementation of sequenceDecs.decode #528

Conversation

WojciechMula commented Mar 10, 2022 • edited Loading

klauspost commented Mar 10, 2022

WojciechMula commented Mar 10, 2022

klauspost left a comment

Choose a reason for hiding this comment

klauspost commented Mar 10, 2022

WojciechMula commented Mar 10, 2022 • edited Loading

WojciechMula commented Mar 11, 2022

klauspost commented Mar 11, 2022

klauspost commented Mar 11, 2022 • edited Loading

WojciechMula commented Mar 11, 2022

klauspost commented Mar 11, 2022

klauspost commented Mar 11, 2022

WojciechMula commented Mar 11, 2022

WojciechMula commented Mar 17, 2022

klauspost commented Mar 17, 2022

WojciechMula commented Mar 17, 2022

klauspost commented Mar 17, 2022

klauspost left a comment

Choose a reason for hiding this comment

WojciechMula commented Mar 17, 2022

WojciechMula commented Mar 10, 2022 •

edited

Loading

WojciechMula commented Mar 10, 2022 •

edited

Loading

klauspost commented Mar 11, 2022 •

edited

Loading