Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: extend executeSimple with history support #542

Merged
merged 5 commits into from
Apr 4, 2022

Conversation

WojciechMula
Copy link
Contributor

@WojciechMula WojciechMula commented Mar 24, 2022

Add history support in the asm implementation. Part of task #515.

As usual marking as a draft, because of few failing tests. [fixed]

Performance comparison between the current master and this branch on an IceLake machine with the hacked decodeSync is below. There are some nice speedups, but there are also regressions for almost all cases without history. To overcome that, we can have two specialisations: executeWithoutDictionary and executeWithoutDictionatyAndHistory - what do you think?

benchmark                                                                 old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            3819656       3185382       -16.61%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        687254        683309        -0.57%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         15227858      10251054      -32.68%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           10867594      7523672       -30.77%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         2261701       2320133       +2.58%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          3265967       3064706       -6.16%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1177235       1232534       +4.70%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       206482        207584        +0.53%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       127624        126746        -0.69%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             12893835      9988261       -22.53%
BenchmarkDecoder_DecoderSmall/html.zst-16                                 703112        755146        +7.40%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        69432         70033         +0.87%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               403208        405417        +0.55%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           83419         83828         +0.49%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            1156823       1164450       +0.66%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              860108        872507        +1.44%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            297202        303211        +2.02%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             373296        380265        +1.87%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                224877        228995        +1.83%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          21549         21690         +0.65%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          11303         11298         -0.04%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                1097797       1105211       +0.68%
BenchmarkDecoder_DecodeAll/html.zst-16                                    90909         93460         +2.81%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           8817          8797          -0.23%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      926242        940397        +1.53%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      895830        905638        +1.09%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       875836        863300        -1.43%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         927245        950947        +2.56%
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9188          9192          +0.04%
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          223014        218049        -2.23%
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           185555        185699        +0.08%
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             150550        153267        +1.80%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              3701          3366          -9.05%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              3487          2965          -14.97%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               4076          3944          -3.24%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 11714         10794         -7.85%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 4748          4653          -2.00%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 7346          7346          +0.00%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  7374          7364          -0.14%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    7802          7340          -5.92%
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       59084         59315         +0.39%
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       59571         60995         +2.39%
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        56208         57822         +2.87%
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          63838         62055         -2.79%
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9188          9244          +0.61%
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         222181        219850        -1.05%
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          183699        183255        -0.24%
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            150052        152918        +1.91%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    27367         28237         +3.18%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    32047         31990         -0.18%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     25215         25213         -0.01%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       33244         32875         -1.11%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9194          9213          +0.21%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9184          9177          -0.08%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9174          9177          +0.03%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9226          9211          -0.16%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     124135        126397        +1.82%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     127338        129774        +1.91%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      116233        117785        +1.34%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        124661        124190        -0.38%
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         1045          1032          -1.24%
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         35303         35490         +0.53%
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          28175         28136         -0.14%
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            18557         19057         +2.69%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             688           619           -10.05%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             712           753           +5.76%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              621           626           +0.81%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                984           991           +0.68%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                940           930           -1.09%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                1398          1377          -1.50%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1365          1343          -1.61%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   1608          1601          -0.44%
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      12915         13114         +1.54%
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      13258         13571         +2.36%
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       12649         12879         +1.82%
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         13239         13779         +4.08%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        1044          1068          +2.30%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        35834         35656         -0.50%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         27434         27765         +1.21%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           18431         18744         +1.70%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   4915          4912          -0.06%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   5031          4972          -1.17%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    4317          4340          +0.53%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      3984          4006          +0.55%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    1035          1033          -0.19%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    1035          1068          +3.19%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     1036          1032          -0.39%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       1046          1058          +1.15%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       67482         68458         +1.45%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   19201         19512         +1.62%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    197985        200836        +1.44%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      152815        153857        +0.68%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    47876         48784         +1.90%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     67123         68004         +1.31%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        58916         59247         +0.56%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  7954          8082          +1.61%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  1269          1290          +1.65%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        178699        176447        -1.26%
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            19159         19408         +1.30%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1988          1998          +0.50%

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            386.05       462.91       1.20x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        1380.43      1388.40      1.01x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         253.15       376.05       1.49x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           314.15       453.77       1.44x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         442.78       431.63       0.97x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          372.54       397.01       1.07x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             2783.47      2658.59      0.96x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3967.42      3946.36      0.99x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       7715.99      7769.40      1.01x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             435.61       562.33       1.29x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 1165.11      1084.82      0.93x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        469.64       465.61       0.99x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               457.13       454.64       0.99x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           1421.59      1414.67      1.00x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            416.54       413.81       0.99x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              496.16       489.11       0.99x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            421.19       412.85       0.98x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             407.42       399.96       0.98x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1821.44      1788.69      0.98x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          4751.98      4721.03      0.99x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          10890.36     10895.22     1.00x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                639.54       635.25       0.99x
BenchmarkDecoder_DecodeAll/html.zst-16                                    1126.40      1095.66      0.97x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           462.27       463.36       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      418.86       412.55       0.98x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      433.08       428.39       0.99x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       442.96       449.40       1.01x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         418.40       407.98       0.98x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          10883.57     10879.92     1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          448.42       458.63       1.02x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           538.94       538.52       1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             664.25       652.48       0.98x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1111.98      1222.64      1.10x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1180.47      1388.41      1.18x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1009.87      1043.58      1.03x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 351.37       381.32       1.09x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 326.01       332.71       1.02x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 210.73       210.73       1.00x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  209.93       210.21       1.00x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    198.41       210.90       1.06x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       752.78       749.85       1.00x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       746.62       729.19       0.98x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        791.29       769.20       0.97x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          696.71       716.74       1.03x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         10883.72     10818.13     0.99x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         450.10       454.87       1.01x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          544.39       545.70       1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            666.46       653.97       0.98x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1870.83      1813.22      0.97x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1597.68      1600.52      1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     2030.51      2030.70      1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1540.11      1557.41      1.01x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     10876.68     10854.61     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     10888.55     10897.00     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      10900.78     10897.23     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        10838.94     10857.06     1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     3125.33      3069.41      0.98x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     3046.72      2989.53      0.98x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      3337.82      3293.84      0.99x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        3112.14      3123.96      1.00x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         95716.88     96892.28     1.01x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         2832.71      2817.79      0.99x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          3549.41      3554.26      1.00x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            5388.87      5247.50      0.97x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             5984.25      6652.65      1.11x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             5783.78      5469.36      0.95x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              6627.33      6574.29      0.99x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                4183.29      4154.88      0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1646.95      1664.97      1.01x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                1107.64      1123.95      1.01x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1134.36      1152.28      1.02x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   962.67       967.12       1.00x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      3443.93      3391.58      0.98x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      3354.65      3277.40      0.98x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       3516.30      3453.57      0.98x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         3359.52      3227.77      0.96x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        95765.44     93621.55     0.98x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        2790.76      2804.67      1.00x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         3645.23      3601.82      0.99x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           5425.74      5335.08      0.98x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   10417.13     10422.70     1.00x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   10176.13     10296.83     1.01x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    11860.57     11796.01     0.99x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      12851.09     12779.36     0.99x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    96638.52     96769.88     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    96660.04     93623.18     0.97x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     96528.12     96886.29     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       95635.43     94530.39     0.99x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       2731.40      2692.47      0.99x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   6176.03      6077.82      0.98x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    2433.82      2399.27      0.99x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      2792.61      2773.71      0.99x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    2614.65      2565.97      0.98x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     2265.83      2236.47      0.99x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        6952.21      6913.46      0.99x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  12874.36     12669.91     0.98x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  96991.66     95392.25     0.98x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        3928.89      3979.03      1.01x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            5344.75      5276.27      0.99x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   2050.37      2039.73      0.99x

@klauspost
Copy link
Owner

"Regressions" seems within margin of error. Good job, I will go through it.

@WojciechMula
Copy link
Contributor Author

Impact of commit 3a35124 (comparison with the previous commit from this branch).

Can't tell it's significantly better or worse, except few cases.

benchmark                                                                 old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            3275140       3197659       -2.37%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        688157        688801        +0.09%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         10174938      9977297       -1.94%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           7464937       7488839       +0.32%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         2315824       2322484       +0.29%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          2971254       2984982       +0.46%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1218236       1199092       -1.57%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       204441        203376        -0.52%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       126309        126411        +0.08%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             9711496       9721413       +0.10%
BenchmarkDecoder_DecoderSmall/html.zst-16                                 729087        731648        +0.35%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        69411         69318         -0.13%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               401512        404052        +0.63%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           83962         83687         -0.33%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            1155677       1168934       +1.15%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              860139        873312        +1.53%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            299327        301737        +0.81%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             375086        380545        +1.46%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                228614        228758        +0.06%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          21867         21759         -0.49%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          11299         11300         +0.01%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                1107666       1099319       -0.75%
BenchmarkDecoder_DecodeAll/html.zst-16                                    93401         93670         +0.29%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           8788          8786          -0.02%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      933956        940400        +0.69%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      904977        908869        +0.43%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       857884        864611        +0.78%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         902840        959689        +6.30%
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9180          9194          +0.15%
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          218821        217876        -0.43%
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           186824        186077        -0.40%
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             152161        150956        -0.79%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              3326          3268          -1.74%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              3008          2954          -1.80%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               3917          3958          +1.05%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 10700         10746         +0.43%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 4753          4665          -1.85%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 7352          7277          -1.02%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  7367          7333          -0.46%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    7250          7770          +7.17%
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       59693         60146         +0.76%
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       61358         61251         -0.17%
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        57739         57833         +0.16%
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          62846         61222         -2.58%
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9191          9209          +0.20%
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         220163        219331        -0.38%
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          183516        182773        -0.40%
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            151235        152682        +0.96%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    27742         27312         -1.55%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    31985         31819         -0.52%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     25521         25129         -1.54%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       33093         32938         -0.47%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9187          9186          -0.01%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9181          9184          +0.03%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9176          9178          +0.02%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9202          9210          +0.09%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     125946        124304        -1.30%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     129010        128391        -0.48%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      118192        117074        -0.95%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        123044        123888        +0.69%
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         1033          1036          +0.29%
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         35850         35520         -0.92%
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          28095         27881         -0.76%
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            18661         18632         -0.16%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             614           717           +16.80%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             725           655           -9.70%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              687           639           -7.00%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                983           985           +0.15%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                925           932           +0.70%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                1362          1372          +0.73%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1351          1346          -0.37%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   1591          1586          -0.31%
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      13069         13083         +0.11%
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      13489         13615         +0.93%
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       12777         12862         +0.67%
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         13337         13475         +1.03%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        1036          1032          -0.39%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        35567         35691         +0.35%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         27402         27815         +1.51%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           18755         18689         -0.35%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   4964          4908          -1.13%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   5051          5038          -0.26%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    4404          4347          -1.29%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      4081          4033          -1.18%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    1040          1031          -0.87%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    1034          1038          +0.39%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     1034          1038          +0.39%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       1041          1049          +0.77%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       67962         68051         +0.13%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   19570         19321         -1.27%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    199859        197860        -1.00%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      151725        153752        +1.34%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    48884         48180         -1.44%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     67054         67496         +0.66%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        59590         58551         -1.74%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  8079          8055          -0.30%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  1266          1262          -0.32%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        179567        179464        -0.06%
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            19411         19355         -0.29%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1996          1985          -0.55%

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            450.23       461.14       1.02x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        1378.62      1377.33      1.00x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         378.86       386.37       1.02x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           457.34       455.88       1.00x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         432.43       431.19       1.00x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          409.49       407.61       1.00x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             2689.79      2732.74      1.02x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       4007.02      4028.00      1.01x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       7796.32      7789.99      1.00x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             578.36       577.77       1.00x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 1123.60      1119.66      1.00x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        469.78       470.41       1.00x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               459.06       456.18       0.99x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           1412.40      1417.05      1.00x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            416.95       412.22       0.99x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              496.15       488.66       0.98x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            418.20       414.86       0.99x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             405.48       399.66       0.99x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1791.67      1790.54      1.00x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          4682.77      4706.12      1.00x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          10894.56     10892.92     1.00x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                633.84       638.66       1.01x
BenchmarkDecoder_DecodeAll/html.zst-16                                    1096.34      1093.20      1.00x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           463.80       463.91       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      415.40       412.55       0.99x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      428.70       426.86       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       452.23       448.72       0.99x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         429.71       404.26       0.94x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          10893.36     10876.76     1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          457.01       458.99       1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           535.28       537.43       1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             657.22       662.46       1.01x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1237.51      1259.51      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1368.29      1393.13      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1050.82      1039.90      0.99x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 384.69       383.03       1.00x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 325.67       331.85       1.02x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 210.55       212.73       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  210.12       211.09       1.00x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    213.53       199.22       0.93x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       745.09       739.48       0.99x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       724.88       726.15       1.00x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        770.31       769.06       1.00x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          707.72       726.49       1.03x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         10880.58     10859.80     1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         454.22       455.95       1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          544.93       547.14       1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            661.24       654.98       0.99x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1845.59      1874.61      1.02x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1600.73      1609.12      1.01x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     2006.19      2037.47      1.02x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1547.17      1554.45      1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     10884.82     10886.19     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     10892.63     10889.08     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      10898.49     10895.68     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        10867.64     10858.60     1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     3080.40      3121.08      1.01x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     3007.25      3021.74      1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      3282.49      3313.83      1.01x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        3153.04      3131.57      0.99x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         96806.80     96550.57     1.00x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         2789.45      2815.42      1.01x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          3559.46      3586.81      1.01x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            5358.99      5367.39      1.00x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             6708.35      5743.08      0.86x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             5677.13      6287.12      1.11x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              5992.62      6444.28      1.08x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                4186.41      4179.85      1.00x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1673.25      1661.70      0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                1136.67      1127.95      0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1145.72      1149.76      1.00x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   972.79       976.17       1.00x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      3403.28      3399.65      1.00x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      3297.31      3266.68      0.99x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       3481.07      3457.98      0.99x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         3334.76      3300.76      0.99x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        96573.18     96921.38     1.00x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        2811.68      2801.87      1.00x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         3649.51      3595.32      0.99x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           5332.21      5350.86      1.00x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   10314.11     10431.19     1.01x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   10136.59     10162.19     1.00x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    11624.63     11779.27     1.01x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      12544.69     12695.27     1.01x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    96189.69     96984.07     1.01x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    96669.58     96320.20     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     96674.84     96374.67     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       96030.92     95325.25     0.99x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       2712.09      2708.56      1.00x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   6059.77      6137.90      1.01x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    2411.01      2435.36      1.01x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      2812.68      2775.60      0.99x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    2560.74      2598.17      1.01x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     2268.14      2253.30      0.99x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        6873.66      6995.65      1.02x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  12674.81     12712.86     1.00x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  97258.21     97565.20     1.00x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        3909.88      3912.14      1.00x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            5275.23      5290.63      1.00x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   2041.86      2053.73      1.01x

@WojciechMula WojciechMula marked this pull request as ready for review March 24, 2022 11:40
@klauspost
Copy link
Owner

So a tricky situation is that we cannot overread from history, since we may get out of the current page.

So we need a "precise" memory copier. I have one for s2 which can be adapted. It is used for sizes 1->64, but it could be 0 -> 16 so it can be called whenever there is less than 16 bytes left.

Untested, but this should be something like it:

// func genMemMoveShort
// src and dst may not overlap.
// No registers are updated.
// Length must be 0 -> 16 bytes
func genMemMoveShort(name string, dst, src, length reg.GPVirtual, end LabelRef) {
	Comment("genMemMoveShort")
	AX, CX := GP64(), GP64()
	name += "_memmove_"

	// Only enable if length can be 0.
	if true {
		TESTQ(length, length)
		JEQ(end)
	}

	CMPQ(length, U8(3))
	JB(LabelRef(name + "move_1or2"))
	JE(LabelRef(name + "move_3"))
	CMPQ(length, U8(8))
	JB(LabelRef(name + "move_4through7"))

	//Label(name + "move_8through16")
	MOVQ(Mem{Base: src}, AX)
	MOVQ(Mem{Base: src, Disp: -8, Index: length, Scale: 1}, CX)
	MOVQ(AX, Mem{Base: dst})
	MOVQ(CX, Mem{Base: dst, Disp: -8, Index: length, Scale: 1})
	JMP(end)

	Label(name + "move_1or2")
	MOVB(Mem{Base: src}, AX.As8())
	MOVB(Mem{Base: src, Disp: -1, Index: length, Scale: 1}, CX.As8())
	MOVB(AX.As8(), Mem{Base: dst})
	MOVB(CX.As8(), Mem{Base: dst, Disp: -1, Index: length, Scale: 1})
	JMP(end)

	Label(name + "move_3")
	MOVW(Mem{Base: src}, AX.As16())
	MOVB(Mem{Base: src, Disp: 2}, CX.As8())
	MOVW(AX.As16(), Mem{Base: dst})
	MOVB(CX.As8(), Mem{Base: dst, Disp: 2})
	JMP(end)

	Label(name + "move_4through7")
	MOVL(Mem{Base: src}, AX.As32())
	MOVL(Mem{Base: src, Disp: -4, Index: length, Scale: 1}, CX.As32())
	MOVL(AX.As32(), Mem{Base: dst})
	MOVL(CX.As32(), Mem{Base: dst, Disp: -4, Index: length, Scale: 1})
	JMP(end)
}

@WojciechMula
Copy link
Contributor Author

Right, I will change it.

@WojciechMula WojciechMula force-pushed the asm-seqdec-execute-history branch from 3a35124 to 83ad8f9 Compare March 31, 2022 07:33
@WojciechMula
Copy link
Contributor Author

WojciechMula commented Mar 31, 2022

OK, I used another memory copy routine for history. Updated the main issue with current timings. Still nice improvements for cases with history.

Copy link
Owner

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take out this. Then I don't see any problems in merging.

zstd/seqdec.go Outdated Show resolved Hide resolved
Co-authored-by: Klaus Post <klauspost@gmail.com>
@klauspost klauspost merged commit c4e0096 into klauspost:master Apr 4, 2022
@WojciechMula WojciechMula deleted the asm-seqdec-execute-history branch April 7, 2022 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants