Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: add arm64 xxhash assembly #464

Merged
merged 1 commit into from
Jan 9, 2022

Conversation

lizthegrey
Copy link
Contributor

@lizthegrey lizthegrey commented Jan 9, 2022

see cespare/xxhash#51

benchmark                                                    old ns/op      new ns/op      delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               14740529       14093294       -4.39%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           3003737        3008035        +0.14%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            70052218       70885931       +1.19%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              33424884       34714144       +3.86%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            9694846        9563735        -1.35%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             12322865       12782681       +3.73%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                7290319        7134523        -2.14%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          825251         816683         -1.04%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          462903         486823         +5.17%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                1333159120     1114784390     -16.38%
BenchmarkDecoder_DecoderSmall/html.zst-16                    3117950        3095118        -0.73%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           272085         271508         -0.21%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  1411294        1407632        -0.26%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              370870         367499         -0.91%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               4721330        4718339        -0.06%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 3517766        3487756        -0.85%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               1186672        1180367        -0.53%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                1498383        1502922        +0.30%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   748113         742537         -0.75%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             99132          98206          -0.93%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             53805          53209          -1.11%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   4041818        4028347        -0.33%
BenchmarkDecoder_DecodeAll/html.zst-16                       387794         383309         -1.16%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              34390          34296          -0.27%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          89510          88785          -0.81%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      23315          23128          -0.80%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       306437         325176         +6.12%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         226433         222589         -1.70%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       74612          74182          -0.58%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        95066          94304          -0.80%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           47498          46946          -1.16%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     6291           6237           -0.86%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     3498           3453           -1.29%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           295850         318916         +7.80%
BenchmarkDecoder_DecodeAllParallel/html.zst-16               24340          24196          -0.59%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      2220           2199           -0.95%

benchmark                                                    old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               100.03       104.63       1.05x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           315.84       315.39       1.00x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            55.03        54.38        0.99x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              102.14       98.35        0.96x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            103.30       104.71       1.01x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             98.74        95.18        0.96x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                449.47       459.29       1.02x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          992.67       1003.08      1.01x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          2127.32      2022.80      0.95x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                4.21         5.04         1.20x
BenchmarkDecoder_DecoderSmall/html.zst-16                    262.74       264.67       1.01x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           119.84       120.10       1.00x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  130.60       130.94       1.00x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              319.76       322.69       1.01x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               102.06       102.13       1.00x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 121.31       122.36       1.01x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               105.49       106.05       1.01x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                101.50       101.20       1.00x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   547.51       551.62       1.01x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             1032.97      1042.70      1.01x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             2287.77      2313.39      1.01x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   173.71       174.29       1.00x
BenchmarkDecoder_DecodeAll/html.zst-16                       264.06       267.15       1.01x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              118.52       118.85       1.00x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          2059.22      2076.03      1.01x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      5086.37      5127.46      1.01x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       1572.47      1481.85      0.94x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         1884.68      1917.22      1.02x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       1677.73      1687.46      1.01x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        1599.82      1612.75      1.01x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           8623.46      8724.94      1.01x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     16276.14     16418.01     1.01x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     35194.18     35648.95     1.01x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           2373.12      2201.48      0.93x
BenchmarkDecoder_DecodeAllParallel/html.zst-16               4207.06      4232.02      1.01x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      1835.78      1853.29      1.01x

benchmark                                                    old allocs     new allocs     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               1              3              +200.00%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           1              1              +0.00%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            4              5              +25.00%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              1              1              +0.00%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            1              1              +0.00%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             1              1              +0.00%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                1              1              +0.00%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          1              1              +0.00%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          1              1              +0.00%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                52             50             -3.85%
BenchmarkDecoder_DecoderSmall/html.zst-16                    1              1              +0.00%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           1              1              +0.00%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  0              0              +0.00%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              0              0              +0.00%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               0              0              +0.00%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 0              0              +0.00%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               0              0              +0.00%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                0              0              +0.00%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   0              0              +0.00%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             0              0              +0.00%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             0              0              +0.00%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   0              0              +0.00%
BenchmarkDecoder_DecodeAll/html.zst-16                       0              0              +0.00%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/html.zst-16               0              0              +0.00%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      0              0              +0.00%

benchmark                                                    old bytes     new bytes     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16               89288         113942        +27.61%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16           10888         12311         +13.07%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16            981385        1492955       +52.13%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16              385552        367000        -4.81%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16            30752         30503         -0.81%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16             108837        56868         -47.75%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                75923         97336         +28.20%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16          48            48            +0.00%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16          1964          1909          -2.80%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                25456880      25432640      -0.10%
BenchmarkDecoder_DecoderSmall/html.zst-16                    48            48            +0.00%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16           51            48            -5.88%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                  0             0             +0.00%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16              5             0             -100.00%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16               2             0             -100.00%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                 26            0             -100.00%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16               0             0             +0.00%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                10            0             -100.00%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                   0             0             +0.00%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16             0             0             +0.00%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16             1             0             -100.00%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                   0             0             +0.00%
BenchmarkDecoder_DecodeAll/html.zst-16                       0             0             +0.00%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16              0             0             +0.00%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16          234           231           -1.28%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16      38            38            +0.00%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16       1991          2287          +14.87%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16         1501          1298          -13.52%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16       132           130           -1.52%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16        214           220           +2.80%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16           259           258           -0.39%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16     9             9             +0.00%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16     6             6             +0.00%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16           2868          3358          +17.09%
BenchmarkDecoder_DecodeAllParallel/html.zst-16               34            34            +0.00%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16      0             0             +0.00%

@klauspost
Copy link
Owner

@lizthegrey Thanks a bunch!

@klauspost klauspost changed the title feat: backport greatroar arm64 xxhash into zstd zstd: add arm64 xxhash assembly Jan 9, 2022
@klauspost klauspost merged commit 35a5ed5 into klauspost:master Jan 9, 2022
@lizthegrey lizthegrey deleted the lizf.include-greatroar branch January 10, 2022 20:14
@JoeUX
Copy link

JoeUX commented Jan 17, 2022

Is this a net benefit or speed-up? More of the test cases show a speed-up than a slowdown, but it's close, and the effect size distribution is asymmetric. There are more large slowdowns than large speed-ups (say >2%). I see two 4% slowdowns, one 5%, one 6%, and an 8%. For speed-ups, I see one 16%, a 4%, and two 2%. Most speed-ups are ≤1%.

The XXHASH project didn't accept this code because it didn't appear to offer a net speed-up, was too mixed. What's the benefit for compress? With the results so mixed, I'd weigh the different test cases based on contextual relevance and impact on the use case, in this case the Zstd implementation. Is that what you already did?

More broadly, it looks like performance improvement will require vectorization and other newer instructions. That's where assembly shines the most, especially given the fact that the Go compiler makes little use of autovectorization and newer instructions in general. I don't know if Armv8.x or 9 has bit manipulation instructions, but that's a good example of newer instructions that pay off, and aren't SIMD. (They're in Haswell and later, and the AMD counterparts.)

@klauspost
Copy link
Owner

@JoeUX What do you propose?

@lizthegrey
Copy link
Contributor Author

In our empirical testing and profiling on a fairly substantial workload it was a net speed-up.

But I'm always happy to put in effort to further iterate and vectorize now that we have a version that we believe does the same thing the original Go code does.

@JoeUX
Copy link

JoeUX commented Jan 19, 2022

@JoeUX What do you propose?

I don't propose anything. I'm just asking why you checked this in since there's no actual speed-up in the published tests. Code changes always carry risk, especially assembly, and normally we'd want to have reasons for a change.

@JoeUX
Copy link

JoeUX commented Jan 19, 2022

In our empirical testing and profiling on a fairly substantial workload it was a net speed-up.

But I'm always happy to put in effort to further iterate and vectorize now that we have a version that we believe does the same thing the original Go code does.

Was this testing published somewhere else? The ones published here show more chunky slowdowns than speed-ups.

@lizthegrey
Copy link
Contributor Author

In our empirical testing and profiling on a fairly substantial workload it was a net speed-up.
But I'm always happy to put in effort to further iterate and vectorize now that we have a version that we believe does the same thing the original Go code does.

Was this testing published somewhere else? The ones published here show more chunky slowdowns than speed-ups.

No, sadly, this is the results of our production Kafka Sarama -> Klauspost Compress workload which we profiled, rather than something that can handily be digested into a separate benchmark.

@klauspost
Copy link
Owner

I don't propose anything. I'm just asking why you checked this in since there's no actual speed-up in the published tests. Code changes always carry risk, especially assembly, and normally we'd want to have reasons for a change.

I don't know who "we" are. The speedup is minor, but seems to be an overall gain. If "we" are concerned, you can use the build tags to exclude the assembly.

Unless you contribute your own benchmarks that prove the opposite, I consider this matter closed.

@lizthegrey
Copy link
Contributor Author

I don't propose anything. I'm just asking why you checked this in since there's no actual speed-up in the published tests. Code changes always carry risk, especially assembly, and normally we'd want to have reasons for a change.

I don't know who "we" are. The speedup is minor, but seems to be an overall gain. If "we" are concerned, you can use the build tags to exclude the assembly.

Unless you contribute your own benchmarks that prove the opposite, I consider this matter closed.

I've solicited some help from the AWS Graviton team with further optimising this ASM. Even if there is a small regression on some practical workloads, my hope is that getting some mileage/confidence on this current ASM and then improving will help us make the changes incrementally rather than big bang all at once trying to both convert and optimise at the same time.

Our data does show a modest improvement, but hey, maybe we're not a representative workload :)

@lizthegrey
Copy link
Contributor Author

lizthegrey commented Jan 25, 2022

See cespare/xxhash#51 (comment)

The issue is that no matter whether you're using ASM or native Go, on Neoverse N1 and on Cortex A72 (and older) you'll see bottlenecking on floating point units, but once that bottleneck is removed (Graviton3), the ASM is much faster. That definitely explains why benchmarking results are inconsistent on earlier hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants