zstd: add arm64 xxhash assembly #464

lizthegrey · 2022-01-09T03:38:57Z

benchmark                                                    old ns/op      new ns/op      delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               14740529       14093294       -4.39%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           3003737        3008035        +0.14%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            70052218       70885931       +1.19%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              33424884       34714144       +3.86%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            9694846        9563735        -1.35%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             12322865       12782681       +3.73%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                7290319        7134523        -2.14%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          825251         816683         -1.04%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          462903         486823         +5.17%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                1333159120     1114784390     -16.38%
BenchmarkDecoder_DecoderSmall/html.zst-16                    3117950        3095118        -0.73%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           272085         271508         -0.21%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  1411294        1407632        -0.26%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              370870         367499         -0.91%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               4721330        4718339        -0.06%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 3517766        3487756        -0.85%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               1186672        1180367        -0.53%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                1498383        1502922        +0.30%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   748113         742537         -0.75%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             99132          98206          -0.93%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             53805          53209          -1.11%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   4041818        4028347        -0.33%
BenchmarkDecoder_DecodeAll/html.zst-16                       387794         383309         -1.16%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              34390          34296          -0.27%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          89510          88785          -0.81%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      23315          23128          -0.80%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       306437         325176         +6.12%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         226433         222589         -1.70%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       74612          74182          -0.58%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        95066          94304          -0.80%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           47498          46946          -1.16%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     6291           6237           -0.86%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     3498           3453           -1.29%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           295850         318916         +7.80%
BenchmarkDecoder_DecodeAllParallel/html.zst-16               24340          24196          -0.59%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      2220           2199           -0.95%

benchmark                                                    old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               100.03       104.63       1.05x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           315.84       315.39       1.00x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            55.03        54.38        0.99x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              102.14       98.35        0.96x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            103.30       104.71       1.01x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             98.74        95.18        0.96x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                449.47       459.29       1.02x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          992.67       1003.08      1.01x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          2127.32      2022.80      0.95x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                4.21         5.04         1.20x
BenchmarkDecoder_DecoderSmall/html.zst-16                    262.74       264.67       1.01x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           119.84       120.10       1.00x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  130.60       130.94       1.00x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              319.76       322.69       1.01x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               102.06       102.13       1.00x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 121.31       122.36       1.01x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               105.49       106.05       1.01x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                101.50       101.20       1.00x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   547.51       551.62       1.01x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             1032.97      1042.70      1.01x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             2287.77      2313.39      1.01x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   173.71       174.29       1.00x
BenchmarkDecoder_DecodeAll/html.zst-16                       264.06       267.15       1.01x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              118.52       118.85       1.00x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          2059.22      2076.03      1.01x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      5086.37      5127.46      1.01x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       1572.47      1481.85      0.94x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         1884.68      1917.22      1.02x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       1677.73      1687.46      1.01x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        1599.82      1612.75      1.01x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           8623.46      8724.94      1.01x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     16276.14     16418.01     1.01x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     35194.18     35648.95     1.01x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           2373.12      2201.48      0.93x
BenchmarkDecoder_DecodeAllParallel/html.zst-16               4207.06      4232.02      1.01x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      1835.78      1853.29      1.01x

benchmark                                                    old allocs     new allocs     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               1              3              +200.00%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           1              1              +0.00%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            4              5              +25.00%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              1              1              +0.00%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            1              1              +0.00%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             1              1              +0.00%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                1              1              +0.00%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          1              1              +0.00%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          1              1              +0.00%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                52             50             -3.85%
BenchmarkDecoder_DecoderSmall/html.zst-16                    1              1              +0.00%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           1              1              +0.00%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  0              0              +0.00%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              0              0              +0.00%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               0              0              +0.00%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 0              0              +0.00%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               0              0              +0.00%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                0              0              +0.00%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   0              0              +0.00%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             0              0              +0.00%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             0              0              +0.00%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   0              0              +0.00%
BenchmarkDecoder_DecodeAll/html.zst-16                       0              0              +0.00%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/html.zst-16               0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      0              0              +0.00%

benchmark                                                    old bytes     new bytes     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               89288         113942        +27.61%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           10888         12311         +13.07%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            981385        1492955       +52.13%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              385552        367000        -4.81%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            30752         30503         -0.81%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             108837        56868         -47.75%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                75923         97336         +28.20%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          48            48            +0.00%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          1964          1909          -2.80%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                25456880      25432640      -0.10%
BenchmarkDecoder_DecoderSmall/html.zst-16                    48            48            +0.00%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           51            48            -5.88%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  0             0             +0.00%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              5             0             -100.00%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               2             0             -100.00%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 26            0             -100.00%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               0             0             +0.00%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                10            0             -100.00%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   0             0             +0.00%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             0             0             +0.00%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             1             0             -100.00%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   0             0             +0.00%
BenchmarkDecoder_DecodeAll/html.zst-16                       0             0             +0.00%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              0             0             +0.00%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          234           231           -1.28%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      38            38            +0.00%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       1991          2287          +14.87%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         1501          1298          -13.52%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       132           130           -1.52%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        214           220           +2.80%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           259           258           -0.39%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     9             9             +0.00%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     6             6             +0.00%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           2868          3358          +17.09%
BenchmarkDecoder_DecodeAllParallel/html.zst-16               34            34            +0.00%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      0             0             +0.00%

klauspost · 2022-01-09T09:50:47Z

@lizthegrey Thanks a bunch!

JoeUX · 2022-01-17T20:34:19Z

Is this a net benefit or speed-up? More of the test cases show a speed-up than a slowdown, but it's close, and the effect size distribution is asymmetric. There are more large slowdowns than large speed-ups (say >2%). I see two 4% slowdowns, one 5%, one 6%, and an 8%. For speed-ups, I see one 16%, a 4%, and two 2%. Most speed-ups are ≤1%.

The XXHASH project didn't accept this code because it didn't appear to offer a net speed-up, was too mixed. What's the benefit for compress? With the results so mixed, I'd weigh the different test cases based on contextual relevance and impact on the use case, in this case the Zstd implementation. Is that what you already did?

More broadly, it looks like performance improvement will require vectorization and other newer instructions. That's where assembly shines the most, especially given the fact that the Go compiler makes little use of autovectorization and newer instructions in general. I don't know if Armv8.x or 9 has bit manipulation instructions, but that's a good example of newer instructions that pay off, and aren't SIMD. (They're in Haswell and later, and the AMD counterparts.)

klauspost · 2022-01-18T08:51:37Z

@JoeUX What do you propose?

lizthegrey · 2022-01-18T08:57:11Z

In our empirical testing and profiling on a fairly substantial workload it was a net speed-up.

But I'm always happy to put in effort to further iterate and vectorize now that we have a version that we believe does the same thing the original Go code does.

JoeUX · 2022-01-19T00:30:28Z

@JoeUX What do you propose?

I don't propose anything. I'm just asking why you checked this in since there's no actual speed-up in the published tests. Code changes always carry risk, especially assembly, and normally we'd want to have reasons for a change.

JoeUX · 2022-01-19T00:33:55Z

In our empirical testing and profiling on a fairly substantial workload it was a net speed-up.

But I'm always happy to put in effort to further iterate and vectorize now that we have a version that we believe does the same thing the original Go code does.

Was this testing published somewhere else? The ones published here show more chunky slowdowns than speed-ups.

lizthegrey · 2022-01-19T00:35:33Z

In our empirical testing and profiling on a fairly substantial workload it was a net speed-up.
But I'm always happy to put in effort to further iterate and vectorize now that we have a version that we believe does the same thing the original Go code does.

Was this testing published somewhere else? The ones published here show more chunky slowdowns than speed-ups.

No, sadly, this is the results of our production Kafka Sarama -> Klauspost Compress workload which we profiled, rather than something that can handily be digested into a separate benchmark.

klauspost · 2022-01-19T08:26:20Z

I don't propose anything. I'm just asking why you checked this in since there's no actual speed-up in the published tests. Code changes always carry risk, especially assembly, and normally we'd want to have reasons for a change.

I don't know who "we" are. The speedup is minor, but seems to be an overall gain. If "we" are concerned, you can use the build tags to exclude the assembly.

Unless you contribute your own benchmarks that prove the opposite, I consider this matter closed.

lizthegrey · 2022-01-19T09:00:09Z

I don't propose anything. I'm just asking why you checked this in since there's no actual speed-up in the published tests. Code changes always carry risk, especially assembly, and normally we'd want to have reasons for a change.

I don't know who "we" are. The speedup is minor, but seems to be an overall gain. If "we" are concerned, you can use the build tags to exclude the assembly.

Unless you contribute your own benchmarks that prove the opposite, I consider this matter closed.

I've solicited some help from the AWS Graviton team with further optimising this ASM. Even if there is a small regression on some practical workloads, my hope is that getting some mileage/confidence on this current ASM and then improving will help us make the changes incrementally rather than big bang all at once trying to both convert and optimise at the same time.

Our data does show a modest improvement, but hey, maybe we're not a representative workload :)

lizthegrey · 2022-01-25T21:53:25Z

See cespare/xxhash#51 (comment)

The issue is that no matter whether you're using ASM or native Go, on Neoverse N1 and on Cortex A72 (and older) you'll see bottlenecking on floating point units, but once that bottleneck is removed (Graviton3), the ASM is much faster. That definitely explains why benchmarking results are inconsistent on earlier hardware.

backport greatroar arm64 xxhash

65ccdff

klauspost approved these changes Jan 9, 2022

View reviewed changes

klauspost changed the title ~~feat: backport greatroar arm64 xxhash into zstd~~ zstd: add arm64 xxhash assembly Jan 9, 2022

klauspost merged commit 35a5ed5 into klauspost:master Jan 9, 2022

lizthegrey deleted the lizf.include-greatroar branch January 10, 2022 20:14

klauspost mentioned this pull request Oct 5, 2022

"illegal combination" error when used in gomobile bind #681

Closed

zekena2 mentioned this pull request Mar 21, 2023

Add compression for the file output plugin (prefereably zstd) influxdata/telegraf#12875

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstd: add arm64 xxhash assembly #464

zstd: add arm64 xxhash assembly #464

lizthegrey commented Jan 9, 2022 •

edited

Loading

klauspost commented Jan 9, 2022

JoeUX commented Jan 17, 2022 •

edited

Loading

klauspost commented Jan 18, 2022

lizthegrey commented Jan 18, 2022

JoeUX commented Jan 19, 2022

JoeUX commented Jan 19, 2022

lizthegrey commented Jan 19, 2022

klauspost commented Jan 19, 2022

lizthegrey commented Jan 19, 2022

lizthegrey commented Jan 25, 2022 •

edited

Loading

zstd: add arm64 xxhash assembly #464

zstd: add arm64 xxhash assembly #464

Conversation

lizthegrey commented Jan 9, 2022 • edited Loading

klauspost commented Jan 9, 2022

JoeUX commented Jan 17, 2022 • edited Loading

klauspost commented Jan 18, 2022

lizthegrey commented Jan 18, 2022

JoeUX commented Jan 19, 2022

JoeUX commented Jan 19, 2022

lizthegrey commented Jan 19, 2022

klauspost commented Jan 19, 2022

lizthegrey commented Jan 19, 2022

lizthegrey commented Jan 25, 2022 • edited Loading

lizthegrey commented Jan 9, 2022 •

edited

Loading

JoeUX commented Jan 17, 2022 •

edited

Loading

lizthegrey commented Jan 25, 2022 •

edited

Loading