-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zstd: x86 assembler implementation of sequenceDecs.decode #528
Conversation
0dcdbd0
to
f90b72c
Compare
Looks great! Impressive numbers - is this comparing to a similar setup, ie not using "decodeSync"? I wouldn't mind a version for vbmi and one without. I guess the difference is measurable. Yes, I expect "execute" to be much better. Especially if we "overallocate" the destination by 7 bytes, so we can do non-overlapping copies in blocks of 8 bytes. |
From what I can gather there are no benchmarks that use async API, thus I had to hack
My experiments showed that if we copy in 32-byte blocks (i.e. using AVX2 registers) we're getting the biggest speedup. However, I optimized just the most common path: i.e. copy from literals + copy from s.out, and bail out to the Go code to handle more complex cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick questions.
Yes, it is only useful for streams, and in that case the I was actually thinking of making a microbench for
Benchmarks can be really hard to interpret. It could be that it doesn't like this change - but it could also just be something like jump alignment - it can easily vary by that much with seemingly unrelated changes. |
The modified one.
It would be great!
Yeah, I know. That Ice Lake machine I use is an AWS one, and sometimes exactly the same code run twice can be slower/faster for no clear reason. Thus I gave up any microoptimizations, as I couldn't tell if differences are due to my changes or the Moon's phase. BTW, speaking of loop alignment, I wanted to enforce it, but seems that is disabled right now for the x86 target: https://groups.google.com/g/golang-nuts/c/M86PTw1jl6w. |
The failed CI was due to a race test of |
Yeah, sometimes we get a very slow machine. This is just a timeout. Restarting the actions. |
@WojciechMula I have added benchmarks for decode, execute and decodeSync in #530 I hope it doesn't conflict too much. I will merge when tests pass. |
@klauspost Great! I'll rebase to the master when you merge the benchmarks and then squash commits. |
Pure AMD64:
With BMI:
BMI speedup alone:
So it seems like it is worth it to have runtime BMI detection. With only 2 sequences there is a small regressions. Good to know, but not worth special code IMO. |
Merged. The conflict should be trivial. Please remove the |
8a7fb0d
to
1f0baa2
Compare
These are results from Ice Lake (noasm vs GOAMD64=v3)
Do you think simple CPUID invocation just to detect BMI1/BMI2 would be sufficient? Or do you prefer to add to dependencies your library https://github.com/klauspost/cpuid? |
@klauspost What do you plan to do with this PR? Merge or amend with your |
@WojciechMula Let's put in the avo version, add CPU tests and remove the decodeSync hack. |
@klauspost OK, I'm doing this right now. |
Differences with the Go implementation: - check ml and mo in the main loop, - s.seqSize and litRemain are checked in the end.
1f0baa2
to
8ca6a66
Compare
The failed test can be ignored. I will add a check for it separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff. Do you want to add anything else before I merge?
Thanks :) And thank you for such great support. I think there's nothing to add. |
This is plain x86 and x86 with BMI2 implementation of
sequenceDecs.decode
. Part of #515.Since the benchmarks use
decodeSync
I temporarily replaced its implementation with one usingdecode
andexecute
, at cost of allocation of theseqVals
array every time.There are some IMHO nice improvements and small regressions in few cases. From my previous experience can tell that we'll get quite big speedup when rewrite
execute
. And of course we'll get the biggest speedup when fusedecode
andexecute
into a single procedure.Marking PR as a draft as just one test[fixed]TestNewDecoderBad/Reader-4/6f88497edbc9059998f9e6d0ea0d0eed8d8af38d.zst
fails. Have to investigate why.Below are benchmarks.
old.txt
was produced by the commandgo generate && go test -tags noasm -run XYZ -bench BenchmarkDecoder
.new.txt
was produced by the commandgo generate && go test -run XYZ -bench BenchmarkDecoder
.new-bmi2.txt
was produced by the commandgo generate && GOAMD64=v3 go test -run XYZ -bench BenchmarkDecoder
.Comparison of
old.txt
withnew.txt
Comparison of
old.txt
withnew-bmi2.txt