zstd: asm version of decodeSync #545

WojciechMula · 2022-03-31T11:09:29Z

A little hacking in the current generator allowed us to reuse almost all code. That's nice!

Part of #515.

For now go test -run TestDecoder pass, I'm working on fixing the remaining tests. Another thing is that PR does not incorporate yet the history support (waiting for #542)

Benchmark results from an Ice Lake machine. Some quite good speed ups are there!

benchmark                                                                 old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            4948849       3115659       -37.04%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        884190        525662        -40.55%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         16820567      13477758      -19.87%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           12468563      9806714       -21.35%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         3829515       1815824       -52.58%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          5235045       2684060       -48.73%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1494667       1004488       -32.80%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       226176        180516        -20.19%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       126631        125630        -0.79%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             14001037      12040286      -14.00%
BenchmarkDecoder_DecoderSmall/html.zst-16                                 1013496       580645        -42.71%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        78160         62866         -19.57%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               592552        291325        -50.84%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           106829        62462         -41.53%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            1901163       902247        -52.54%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              1417726       673019        -52.53%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            475107        223363        -52.99%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             634311        277013        -56.33%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                299068        201638        -32.58%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          24184         18814         -22.20%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          11298         11294         -0.04%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                1568542       949901        -39.44%
BenchmarkDecoder_DecodeAll/html.zst-16                                    124118        70389         -43.29%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           9838          7870          -20.00%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      1452726       760812        -47.63%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      1495242       710737        -52.47%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       1407918       696247        -50.55%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         1459001       698224        -52.14%
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9190          9184          -0.07%
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          346024        169963        -50.88%
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           249534        148022        -40.68%
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             148253        132778        -10.44%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              3584          3385          -5.55%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              3293          2842          -13.70%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               3897          3796          -2.59%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 9021          11001         +21.95%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 4922          4361          -11.40%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 7764          6750          -13.06%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  7758          6751          -12.98%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    7987          6581          -17.60%
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       86160         46936         -45.52%
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       92630         47010         -49.25%
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        85408         44996         -47.32%
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          102427        46479         -54.62%
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9195          9192          -0.03%
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         351130        170379        -51.48%
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          245196        147856        -39.70%
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            146306        132574        -9.39%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    31964         25333         -20.75%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    33961         31073         -8.50%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     26667         24121         -9.55%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       34375         31844         -7.36%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9212          9188          -0.26%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9176          9176          +0.00%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9179          9182          +0.03%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9190          9193          +0.03%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     182557        85521         -53.15%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     160984        83577         -48.08%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      147865        78830         -46.69%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        157513        81901         -48.00%
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         1034          1029          -0.48%
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         45459         22683         -50.10%
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          29775         18967         -36.30%
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            18079         15816         -12.52%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             539           502           -6.97%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             546           524           -4.19%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              522           527           +0.88%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                883           854           -3.27%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                730           631           -13.46%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                846           680           -19.67%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 874           681           -22.07%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   970           707           -27.08%
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      12545         6806          -45.75%
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      13652         6875          -49.64%
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       11274         6569          -41.73%
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         12746         6791          -46.72%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        1044          1063          +1.82%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        40948         22787         -44.35%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         28497         18869         -33.79%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           17973         15798         -12.10%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   4327          2945          -31.94%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   4279          3075          -28.14%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    3562          2528          -29.03%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      4120          3120          -24.27%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    1045          1049          +0.38%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    1047          1039          -0.76%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     1046          1034          -1.15%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       1047          1044          -0.29%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       70555         35143         -50.19%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   14388         8518          -40.80%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    235269        108394        -53.93%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      176150        81790         -53.57%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    57606         27782         -51.77%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     77776         35865         -53.89%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        36842         22593         -38.68%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  3396          2428          -28.50%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  1260          1245          -1.19%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        173875        101967        -41.36%
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            16539         9267          -43.97%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1347          1081          -19.75%

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            297.96       473.27       1.59x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        1072.96      1804.78      1.68x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         229.18       286.02       1.25x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           273.81       348.13       1.27x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         261.50       551.50       2.11x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          232.42       453.31       1.95x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             2192.33      3262.16      1.49x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3621.95      4538.10      1.25x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       7776.50      7838.44      1.01x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             401.16       466.49       1.16x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 808.29       1410.85      1.75x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        417.20       518.69       1.24x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               311.06       632.70       2.03x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           1110.07      1898.57      1.71x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            253.46       534.07       2.11x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              301.01       634.09       2.11x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            263.48       560.43       2.13x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             239.77       549.03       2.29x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1369.59      2031.36      1.48x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          4234.23      5442.83      1.29x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          10894.85     10899.45     1.00x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                447.60       739.12       1.65x
BenchmarkDecoder_DecodeAll/html.zst-16                                    825.02       1454.76      1.76x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           414.31       517.91       1.25x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      267.06       509.93       1.91x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      259.47       545.86       2.10x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       275.56       557.22       2.02x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         265.91       555.64       2.09x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          10881.20     10888.55     1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          289.01       588.38       2.04x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           400.76       675.60       1.69x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             674.54       753.16       1.12x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1148.38      1215.84      1.06x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1249.77      1448.41      1.16x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1056.17      1084.20      1.03x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 456.25       374.13       0.82x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 314.49       354.95       1.13x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 199.38       229.32       1.15x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  199.54       229.31       1.15x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    193.81       235.24       1.21x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       516.22       947.62       1.84x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       480.16       946.12       1.97x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        520.76       988.46       1.90x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          434.23       956.93       2.20x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         10876.10     10879.70     1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         284.80       586.94       2.06x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          407.85       676.35       1.66x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            683.52       754.32       1.10x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1601.81      2021.07      1.26x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1507.63      1647.72      1.09x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     1919.97      2122.61      1.11x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1489.44      1607.82      1.08x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     10856.04     10884.37     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     10898.82     10898.93     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      10895.15     10891.77     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        10882.01     10878.53     1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     2125.17      4536.47      2.13x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     2409.95      4642.02      1.93x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      2623.78      4921.51      1.88x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        2463.05      4737.00      1.92x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         96723.30     97181.69     1.00x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         2199.84      4408.78      2.00x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          3358.67      5272.34      1.57x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            5531.33      6322.87      1.14x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             7633.80      8205.11      1.07x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             7531.33      7861.34      1.04x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              7878.01      7808.28      0.99x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                4663.70      4821.59      1.03x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                2121.74      2451.88      1.16x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                1828.81      2276.38      1.24x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1770.70      2272.06      1.28x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   1595.73      2188.35      1.37x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      3545.30      6534.77      1.84x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      3257.99      6469.33      1.99x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       3945.06      6770.89      1.72x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         3489.38      6549.74      1.88x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        95753.69     94059.51     0.98x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        2442.18      4388.66      1.80x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         3509.25      5299.73      1.51x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           5564.01      6330.20      1.14x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   11832.49     17385.80     1.47x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   11965.02     16651.74     1.39x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    14375.72     20255.73     1.41x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      12425.77     16411.38     1.32x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    95659.72     95287.85     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    95492.90     96289.36     1.01x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     95576.73     96741.03     1.01x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       95552.41     95804.44     1.00x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       2612.42      5244.90      2.01x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   8242.15      13921.55     1.69x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    2048.12      4445.45      2.17x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      2422.67      5217.70      2.15x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    2173.02      4505.70      2.07x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1955.48      4240.60      2.17x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        11117.71     18129.71     1.63x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  30154.29     42167.57     1.40x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  97717.39     98890.90     1.01x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        4037.89      6885.44      1.71x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            6191.33      11050.07     1.78x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   3025.76      3768.91      1.25x

klauspost · 2022-04-06T09:51:14Z

The current code appears stable and we can proceed with this one when you have the time.

WojciechMula · 2022-04-08T08:49:49Z

The current code appears stable and we can proceed with this one when you have the time.

I'm fighting right now with avo, after rebasing to the master. I wish your PR mmcloughlin/avo#105 got merged.

klauspost · 2022-04-08T09:52:24Z

@WojciechMula Yeah, I squeezed out the last registers, and it can sometimes we quite hard to figure out where allocs are failing.

If you need to free one or more registers, you can use:

		seqsBaseStash := AllocLocal(8)
		MOVQ(seqsBase, seqsBaseStash)
		seqsBase = nil
		//
		// Code that doesn't need seqsBase
		//
		// reload:
		seqsBase = GP64()
		MOVQ(seqsBaseStash, seqsBase)

This will allocate stack for the value.

A little hacking in current generator allowed to reuse almost all code. That's nice!

Seems we'll need to refactor generator not to get lost in an if-maze.

After spilling to the stack all execute-related values, we're still running out of registers. Need to figure out which values from decode can be spilled too. However, I'm unhappy with this, as my initial version of decodeSync didn't use any stack. I believe it can be done, but I'd like to share as many code as possible. On the other hand, having code full of short if-statements is not readable.

WojciechMula · 2022-04-10T19:17:54Z

@WojciechMula Yeah, I squeezed out the last registers, and it can sometimes we quite hard to figure out where allocs are failing.

If you need to free one or more registers, you can use:
		seqsBaseStash := AllocLocal(8)
		MOVQ(seqsBase, seqsBaseStash)
		seqsBase = nil
		//
		// Code that doesn't need seqsBase
		//
		// reload:
		seqsBase = GP64()
		MOVQ(seqsBaseStash, seqsBase)
This will allocate stack for the value.

Thank you. I spilled some values, but still, have one place where allocation fails. I will work on it on Monday. I hope it will finally compile, but I'm afraid performance improvement will drop when we put too many values on the stack.

I reclaimed three registers from adjustOffsets method, by reintroducing its old version that didn't cache values in registers. There are more and more if statements...

go test -run TestDecoder pass, multiframe tests still fail.

klauspost · 2022-04-19T06:44:29Z

I hope it will finally compile, but I'm afraid performance improvement will drop when we put too many values on the stack.

Don't worry too much about that. Stack is pretty much always in L1 and reads/writes can be done async to most other code.

Futhermore, some archs (Zen 2 for instance) can do memory to virtual register mapping, meaning that writing to non-changing addresses are kept in virtual CPU registers. There is so much magic going on with modern super-scalar CPUs :)

(the prev-offsets, that I moved to registers were not provable static to the CPU, since you used different addressing for them - that is why they benefitted to be in regs).

If you would like assistance, I can take a stab at finishing up the code. For now I will review the code and test it out.

klauspost

With these two fixes the tests pass. Not sure how often it will be used then, but at least this should get an idea of the direction.

zstd/_generate/gen.go

zstd/seqdec_amd64.go

klauspost · 2022-04-21T10:09:47Z

I think for now we should ditch the output reallocate logic.

For blocks that have the frame-content-size set we can allocate the output what we need, and we would never need to extend the output. We must fail these if we exceed frame-content-size anyway.

For decodes without frame content size we allocate the default and let it fall back once we have less than a full frame left on output. We can use d.o.lowMem to allocate at least maxCompressedBlockSizeAlloc when it is set to true.

WojciechMula · 2022-04-21T10:41:56Z

I think for now we should ditch the output reallocate logic.

For blocks that have the frame-content-size set we can allocate the output what we need, and we would never need to extend the output. We must fail these if we exceed frame-content-size anyway.

For decodes without frame content size we allocate the default and let it fall back once we have less than a full frame left on output. We can use d.o.lowMem to allocate at least maxCompressedBlockSizeAlloc when it is set to true.

Thank you very much for checking this. TBH I got stuck. Removing reallocation logic simplify a lot the whole code.

WojciechMula · 2022-04-21T11:37:04Z

There are significant regressions when compared to the initial version. My guess is that we use "precise" copy everywhere. Benchmark results from Ice Lake.

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            298.97       402.92       1.35x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        1080.81      1511.33      1.40x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         231.92       304.96       1.31x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           277.62       347.55       1.25x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         262.13       330.49       1.26x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          233.12       320.57       1.38x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             2307.14      2480.88      1.08x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3678.77      4256.91      1.16x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       7777.68      7687.15      0.99x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             408.93       478.49       1.17x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 829.92       905.19       1.09x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        422.25       496.58       1.18x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               310.70       379.80       1.22x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           1135.69      1156.62      1.02x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            253.92       309.99       1.22x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              301.65       365.59       1.21x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            265.22       270.16       1.02x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             241.13       315.82       1.31x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1778.35      1709.25      0.96x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          4308.74      4355.15      1.01x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          10892.78     10889.65     1.00x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                448.55       504.99       1.13x
BenchmarkDecoder_DecodeAll/html.zst-16                                    848.09       855.96       1.01x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           422.79       423.86       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      268.38       308.13       1.15x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      258.85       302.06       1.17x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       276.64       313.69       1.13x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         266.39       303.41       1.14x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          10876.64     10872.46     1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          292.36       301.26       1.03x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           410.18       419.76       1.02x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             659.68       662.11       1.00x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1314.22      1310.02      1.00x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1401.92      1407.34      1.00x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1210.26      1213.66      1.00x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 450.06       452.47       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 318.46       321.59       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 198.23       199.56       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  198.42       200.09       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    194.27       194.29       1.00x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       528.34       534.18       1.01x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       495.02       496.26       1.00x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        537.33       541.58       1.01x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          469.75       474.14       1.01x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         10875.56     10874.68     1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         288.77       297.37       1.03x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          415.44       426.39       1.03x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            658.64       660.88       1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1881.77      1901.21      1.01x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1705.56      1713.97      1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     2078.31      2093.55      1.01x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1759.75      1769.21      1.01x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     10873.65     10876.39     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     10888.91     10886.96     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      10879.28     10887.44     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        10890.26     10874.18     1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     2545.69      3023.27      1.19x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     2289.24      2975.88      1.30x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      2641.03      3138.38      1.19x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        2464.36      2991.30      1.21x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         97285.92     97543.96     1.00x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         2490.63      2238.33      0.90x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          3479.79      3533.35      1.02x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            5463.08      5490.45      1.01x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             8145.22      8710.44      1.07x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             7677.89      8234.17      1.07x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              8222.46      7182.99      0.87x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                4651.80      4730.17      1.02x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                2416.21      2375.82      0.98x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                1668.39      1739.80      1.04x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1772.96      1815.81      1.02x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   1550.65      1572.64      1.01x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      3578.07      3644.92      1.02x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      3616.48      2850.02      0.79x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       3752.95      3823.40      1.02x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         3384.83      3573.37      1.06x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        96364.71     92037.59     0.96x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        2532.67      2585.50      1.02x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         3524.44      3574.40      1.01x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           5459.39      5494.19      1.01x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   14122.42     13928.00     0.99x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   13864.08     13946.07     1.01x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    16849.07     16553.61     0.98x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      15335.58     15184.70     0.99x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    95783.03     97260.35     1.02x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    96839.55     97609.78     1.01x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     97174.75     97191.77     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       96662.30     97199.36     1.01x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       2622.77      3620.98      1.38x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   8443.45      8086.97      0.96x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    2039.41      3032.86      1.49x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      2424.77      3648.18      1.50x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    2185.83      2092.32      0.96x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1949.10      2987.74      1.53x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        14493.71     15985.46     1.10x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  31042.31     30409.04     0.98x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  97910.07     98384.66     1.00x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        3946.60      5134.47      1.30x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            6267.43      6056.68      0.97x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   3133.55      3127.34      1.00x

klauspost · 2022-04-21T12:44:59Z

@WojciechMula We need to make a few tweaks to allocations. It will not be used in some cases right now, basically falling back.

I can take over from here, if you'd like, so we can use it in the majority now.

I also see that with the changes we can now use the simple copying, but I will need to make some changes to safely enable it.

Should we try to merge and we can do improvement as separate PR(s)?

WojciechMula · 2022-04-21T13:01:35Z

@klauspost If you are happy with the current shape, we can merge the PR and improve the code in others PRs.

I'd be happy if you just point me out where to start, I'd like to do some coding and testing stuff. I guess you already have another job. :)

klauspost · 2022-04-21T13:13:43Z

@WojciechMula ok, yeah, the outline of what needs to be done:

decodeSyncSimple needs to know if we are trying to hit a specific FrameContentSize. The destination should be allocated accordingly. If we are, we can use it, if we have enough space to hit it. If it runs out of space we know we also exceeded the FrameContentSize and we can just return an error.

We should build a version of sequenceDecs_decodeSync_safe_xxx that does precise output copying. This will only be used if we are decoding a block and we don't have enough space for a full block and extra overhead. Otherwise we always have to allocate 128KB+16B for every block, which is rather bad.

With these two things in place we can always decode using asm when we have FrameContentSize, and most of the time when we don't, as longs as we have a maxBlock of output space available.

WojciechMula · 2022-04-21T13:31:28Z

@klauspost Thank you, that's clear.

klauspost · 2022-04-21T13:45:30Z

The changes are not super trivial. Let me see what I can cook up.

WojciechMula · 2022-04-21T13:58:07Z

The changes are not super trivial. Let me see what I can cook up.

Don't worry, I worked with legacy C++ code. :) Working with Go code is easy.

klauspost mentioned this pull request Apr 4, 2022

zstd: Store previous offsets in registers #548

Merged

WojciechMula added 7 commits April 8, 2022 13:45

[skip ci] zstd: asm version of decodeSync

9a8de84

A little hacking in current generator allowed to reuse almost all code. That's nice!

[skip ci] Few fixes, still some tests fail

11a94cf

Seems we'll need to refactor generator not to get lost in an if-maze.

[skip ci] Resize out if needed

13ed3e4

[skip ci] Forgot we didn't proceess the PR with history support...

7044c01

Revert for a while

dba9cee

Reapplying changes

2177891

WojciechMula force-pushed the asm-seqdec-decode-sync branch from 18e29ce to 2694088 Compare April 10, 2022 19:14

WojciechMula added 9 commits April 11, 2022 21:18

[skip ci] A little success - avo compiles the new code

5ac387b

I reclaimed three registers from adjustOffsets method, by reintroducing its old version that didn't cache values in registers. There are more and more if statements...

[skip ci] Use a register for outBase

6d9cfd0

[skip ci] Use a register for outPosition

1096f53

[skip ci] Use a register for literals

3f55abb

[skip ci] simplify helper functions

8405471

[skip ci] Fixed a few omissions

ef4f8ce

[skip ci] Add missing/wrong sanity checks

5cf31ea

[skip ci] Fixed another tests

b3ad687

[skip ci] More fixes to retrying code

7273347

go test -run TestDecoder pass, multiframe tests still fail.

klauspost reviewed Apr 20, 2022

View reviewed changes

zstd/_generate/gen.go Outdated Show resolved Hide resolved

zstd/seqdec_amd64.go Outdated Show resolved Hide resolved

[skip ci] Apply Klaus' fixes

5d8c6e5

WojciechMula added 2 commits April 21, 2022 13:05

Remove reallocation logic from asm implementation

df645f6

Fix error reported by go vet

5c4a187

WojciechMula marked this pull request as ready for review April 21, 2022 12:58

klauspost merged commit 4a51c29 into klauspost:master Apr 21, 2022

WojciechMula deleted the asm-seqdec-decode-sync branch April 21, 2022 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstd: asm version of decodeSync #545

zstd: asm version of decodeSync #545

WojciechMula commented Mar 31, 2022 •

edited

Loading

klauspost commented Apr 6, 2022

WojciechMula commented Apr 8, 2022

klauspost commented Apr 8, 2022

WojciechMula commented Apr 10, 2022

klauspost commented Apr 19, 2022 •

edited

Loading

klauspost left a comment

klauspost commented Apr 21, 2022 •

edited

Loading

WojciechMula commented Apr 21, 2022

WojciechMula commented Apr 21, 2022

klauspost commented Apr 21, 2022

WojciechMula commented Apr 21, 2022

klauspost commented Apr 21, 2022

WojciechMula commented Apr 21, 2022

klauspost commented Apr 21, 2022

WojciechMula commented Apr 21, 2022

zstd: asm version of decodeSync #545

zstd: asm version of decodeSync #545

Conversation

WojciechMula commented Mar 31, 2022 • edited Loading

klauspost commented Apr 6, 2022

WojciechMula commented Apr 8, 2022

klauspost commented Apr 8, 2022

WojciechMula commented Apr 10, 2022

klauspost commented Apr 19, 2022 • edited Loading

klauspost left a comment

Choose a reason for hiding this comment

klauspost commented Apr 21, 2022 • edited Loading

WojciechMula commented Apr 21, 2022

WojciechMula commented Apr 21, 2022

klauspost commented Apr 21, 2022

WojciechMula commented Apr 21, 2022

klauspost commented Apr 21, 2022

WojciechMula commented Apr 21, 2022

klauspost commented Apr 21, 2022

WojciechMula commented Apr 21, 2022

WojciechMula commented Mar 31, 2022 •

edited

Loading

klauspost commented Apr 19, 2022 •

edited

Loading

klauspost commented Apr 21, 2022 •

edited

Loading