-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zstd: asm version of decodeSync #545
zstd: asm version of decodeSync #545
Conversation
The current code appears stable and we can proceed with this one when you have the time. |
I'm fighting right now with avo, after rebasing to the master. I wish your PR mmcloughlin/avo#105 got merged. |
@WojciechMula Yeah, I squeezed out the last registers, and it can sometimes we quite hard to figure out where allocs are failing. If you need to free one or more registers, you can use:
This will allocate stack for the value. |
A little hacking in current generator allowed to reuse almost all code. That's nice!
Seems we'll need to refactor generator not to get lost in an if-maze.
After spilling to the stack all execute-related values, we're still running out of registers. Need to figure out which values from decode can be spilled too. However, I'm unhappy with this, as my initial version of decodeSync didn't use any stack. I believe it can be done, but I'd like to share as many code as possible. On the other hand, having code full of short if-statements is not readable.
18e29ce
to
2694088
Compare
Thank you. I spilled some values, but still, have one place where allocation fails. I will work on it on Monday. I hope it will finally compile, but I'm afraid performance improvement will drop when we put too many values on the stack. |
I reclaimed three registers from adjustOffsets method, by reintroducing its old version that didn't cache values in registers. There are more and more if statements...
go test -run TestDecoder pass, multiframe tests still fail.
Don't worry too much about that. Stack is pretty much always in L1 and reads/writes can be done async to most other code. Futhermore, some archs (Zen 2 for instance) can do memory to virtual register mapping, meaning that writing to non-changing addresses are kept in virtual CPU registers. There is so much magic going on with modern super-scalar CPUs :) (the prev-offsets, that I moved to registers were not provable static to the CPU, since you used different addressing for them - that is why they benefitted to be in regs). If you would like assistance, I can take a stab at finishing up the code. For now I will review the code and test it out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these two fixes the tests pass. Not sure how often it will be used then, but at least this should get an idea of the direction.
I think for now we should ditch the output reallocate logic. For blocks that have the frame-content-size set we can allocate the output what we need, and we would never need to extend the output. We must fail these if we exceed frame-content-size anyway. For decodes without frame content size we allocate the default and let it fall back once we have less than a full frame left on output. We can use |
Thank you very much for checking this. TBH I got stuck. Removing reallocation logic simplify a lot the whole code. |
There are significant regressions when compared to the initial version. My guess is that we use "precise" copy everywhere. Benchmark results from Ice Lake.
|
@WojciechMula We need to make a few tweaks to allocations. It will not be used in some cases right now, basically falling back. I can take over from here, if you'd like, so we can use it in the majority now. I also see that with the changes we can now use the simple copying, but I will need to make some changes to safely enable it. Should we try to merge and we can do improvement as separate PR(s)? |
@klauspost If you are happy with the current shape, we can merge the PR and improve the code in others PRs. I'd be happy if you just point me out where to start, I'd like to do some coding and testing stuff. I guess you already have another job. :) |
@WojciechMula ok, yeah, the outline of what needs to be done:
We should build a version of With these two things in place we can always decode using asm when we have FrameContentSize, and most of the time when we don't, as longs as we have a maxBlock of output space available. |
@klauspost Thank you, that's clear. |
The changes are not super trivial. Let me see what I can cook up. |
Don't worry, I worked with legacy C++ code. :) Working with Go code is easy. |
A little hacking in the current generator allowed us to reuse almost all code. That's nice!
Part of #515.
For now
go test -run TestDecoder
pass, I'm working on fixing the remaining tests. Another thing is that PR does not incorporate yet the history support (waiting for #542)Benchmark results from an Ice Lake machine. Some quite good speed ups are there!