-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[huff0] Add x86 specialisation of Decode4X #512
Conversation
@WojciechMula Thanks for the great work. I will take a look and do some tests here as well. A few pre-review notes (ignoring what is reported by tests, assuming you'll fix that) Please make a For the Go version, I found that interleaving 2 streams would give better pipelining. You removed that from the non-asm version. I will test that there isn't a regression here. Please run asmfmt on the assembly. Is Try breaking dependency chains by interleaving operations more. Your assembly is pretty much "serial", making the cpu having to work hard to re-order your code. Zstd tests can be noisy. Checking here, literal decoding takes op 9.7% of cpu time in |
Numbers are looking good 👍🏼
(these are also the only ones that have tablelog >8) Some small regressions without asm:
|
Thank you for looking at this. Yeah, I restore that freshest version. Got lost with rebasing at some point.
Sure!
Yes, I checked also SHR and SHL and the noticed BMI was faster, maybe not significantly.
Sure, will do.
Thank you for so quick response! I'll look at the regression for the plain Go version. |
Pure "amd64" speed is extremely similar, so for now bmi doesn't seem worth it:
You can maybe have the entire My quickly thrown together code here: https://gist.github.com/klauspost/82d5c9b85c067d06d606f1c12c82615c |
I will do a more detailed review tomorrow. Obviously we need to fix the bugs. |
Without the extra "AND":
This also makes it "competitive" to replace that Only very small payloads (unlikely) are worse. |
Thank you for the review. I changed almost everything that you asked for. The only big thing is to reshuffle instructions (if it's possible). Perf results are similar on IceLake. |
You can just work on that later. The improvement is already significant. |
@klauspost I tried to interleave the operations as the Go code does. However, didn't notice any significant performance changes -- I'd rather say it's a noise. You can check what I did: https://github.com/WojciechMula/compress/tree/experiment. Any hints highly appreciate. :) |
Yeah, SSE <-> GPR usually is pretty slow. Don't worry about it, let's get it working with the current improvements, we can tweak later. |
Seems the file for other platforms is missing:
|
I also checked if interleaving decoding of two streams is profitable. It's not. Yeah, I learned recently that using the stack is way faster (and easier) than trying to keep temp values in SSE or AVX512 kregs. Hopefully, I fixed the build tags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should fix the tags.
huff0/decompress_generic.go
Outdated
//go:build noasm | ||
// +build noasm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//go:build noasm | |
// +build noasm | |
//go:build !amd64 || appengine || !gc || noasm | |
// +build !amd64 appengine !gc noasm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Fixed.
I have fuzz tested the current version, and it looks fine. 👍🏼 Once we get the build tags sorted, I will do a final benchmark and we can move to merging. |
Great! Thank you very much for checking this. And sorry for build tags problems, TBH I'm not familiar with them. So far never used them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼 Merging when tests complete.
Let me squash the commits before. |
BenchmarkDecompress4XNoTable/gettysburg-32 593.83 681.84 1.15x BenchmarkDecompress4XNoTable/twain-32 491.42 680.16 1.38x BenchmarkDecompress4XNoTable/pngdata.001-32 718.28 870.23 1.21x
2a34ea9
to
20a106f
Compare
@klauspost OK, squshed the changes and added perf results to the commit message |
Hi, first of all, thank you for such a great library! I have been working on speeding up the Zstd decompression, mainly by porting hot loops into the assembly. This is the first PR, that's pretty small and I'd like to make it an opportunity to discuss code shape. Is it something acceptable, or not.
I'm marking it as a draft because not all tests in Zstd pass now; I branched some time ago and seems there are were some changes I have to investigate.
Any way, below is comparison of decompressing speed for Zstd after applying the patch. Benchmarks were run on an Ice Lake machine.