-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: add unrolling stage for automatic loop unrolling #51302
Comments
Related: #49997 |
This is not an API change, so taking this out of the proposal process. In my opinion the most important consideration is compile time. The Go compiler has a much greater emphasis on compile time than a compiler like clang. CC @randall77 @golang/runtime |
Another consideration is binary size. We don't want to just unroll everything, as that can lead to binary bloat. I think there's definitely something we can proceed with here. The way I see it there are 2 parts:
It would be good to start with figuring out how to implement 1, and have a very conservative heuristic for 2. Once we have that working, we can investigate improving the heuristic to encompass more cases. Whether that is using profile-directed feedback, or analysis of what kind of loop bodies would benefit the most, or whatever. |
Maybe we should do the loop invariant optimization first? |
Yeah, I agree that there is no necessary connection between the two, I mean if loop unrolling is based on loop invariant hoisting , will this optimization be a little easier to implement? In addition, perhaps not very relevant to this topic, do we need to consider loop-related optimizations from a higher dimension. Because maybe in the future we will also consider auto-vectorization, I hope we have a general consideration. |
cc @dr2chase who did some experiment with loop unrolling in SSA. |
I've played with both loop unrolling and loop invariant hoisting, and when the Go world was very very Intel-oriented neither of these reliably made sense (and their code was unpleasant). I recently attempted to revive the invariant-hoisting CL and could not easily get it to work (and I am not likely to look at this again before March 7, i.e., the likely go1.18 release date). But in general, the wins were less than awesome, binary sizes go up a bit with loop unrolling, and for a given binary-size/compile-time budget it was not at all clear that this was the best way to increase performance. It also creates problems for debugging. |
I'm interested in invariant loop hoisting. Where is the CL you revived? I suspect it would help more on Power or maybe even ARM64. |
Here's the hoisting CL, newly revived: https://go-review.googlesource.com/c/go/+/37338. Unrolling, I didn't do as good a job on that CL (it merely boilerplated the body, didn't handle iterator arithmetic cleverly). Unrolling is tricky because updating SSA in place is a pain (I can try to find that one, nonetheless). |
Note that this CL changes arm64 code generation for one of the codegen tests (so it fails), which is something I'll need to look at. I'm also doing a quick round of benchmarking on amd64 and a low-end arm64. |
I think there are still valid use cases for this feature, especially for more performance-sensitive Go applications. At the very least, we can provide an option for Go users to enable better performance at the cost of larger binary size, longer compile time, and harder debugging as mentioned. By providing this option to Go developers, they can select the appropriate optimization levels depending on the specific tradeoffs of their applications and whether they have performance-critical code or not. For example, we could add this feature behind an opt-in experimental flag. This approach is adopted by compilers of other programming languages such as:
|
Now that we have PGO, maybe we can add this just to the most "hot paths"? |
All in all, a more sophisticated manner of unrolling would yield a better byte\time ratio IMHO |
I agree that PGO would make loop unrolling far more tenable, so I've added it to the PGO umbrella issue (#62463). Note however that loop unrolling is not just a balance with binary size/i-cache pressure. On modern CPUs, simply unrolling a loop is often a pessimization, for example if it contains conditional branches then unrolling it will increase pressure on the branch predictor (Intel Optimization Manual, "Loop Unrolling"). It's really a lot like inlining: the main benefit is generally not the unrolling itself, but the follow-on optimizations it enables. That said, that just makes PGO an even better fit for this, since it would allow us to focus where we spend compile time analyzing whether unrolling a loop is even worthwhile. |
Loop unrolling is a tecnique intended to speed up loops. It's supported by other mature compilers such as Clang.
This proposal consists of possible implementation ideas: general loop unrolling rules and how they can apply to Golang compiler, and some simple benchmarks reflecting this optimization performance on simple constant range loops.
It's easy to begin with a simple constant range for loops such as:
And then add more features inside unroll package which will represent a loop unrolling optimization pass.
Unroll package implementation ideas
The following approach could already be easily integrated inside Golang optimization pipeline right after the inlining stage.
UnrollPackage() will traverse each function to find for loops and check if it's appropriate for unrolling, then perform unrolling by calling Unroll() function if so.
It's important to calculate the unrolling factor correctly. If it's too big we can run into a problem when a for loop body exceeds the instruction cache. A possible idea of picking the factor is by reusing the part of the inlining stage, that is, hairyVisitor since there was already a lot of work done for choosing the weights of the nodes.
if A = maximum weight of the for loop which is short enough to be kept in cache, and B = the weight of a for loop body calculated by the extended version of hairyVisitor, then the unrolling factor = A / B. If it's greater than 1 then loop unrolling is beneficial for that loop.
Unroll function implementation ideas
When unroll variable is picked a for loop can be unrolled in 4 steps. Once again, we're dealing with constant values here:
This step is a little bit more complex since to only copy the body isn't enough. Suppose unroll is 4 and the body of the loop is:
Just coping the body 4 times isn't enough since it gives us:
The correct version is:
Firstly, we must find all induction variables and then after coping the body, apply shifting operation each time.
Keeping that in mind, body unrolling can be implemented in the following way:
Suppose we have a loop:
And the unroll is 3. Than it should be unrolled to this:
Since 101 isn't divisible by 3 there are 101 % 3 operations that hasn't been performed yet:
This should also be generated and placed after the for loop.
Results
The following lines of code:
currently are compiled to:
After applying a very basic version of loop unrolling with the above approach those are compiled to:
Body is repeated 4 times with the shifted indices. Tail is placed after the loop that handles 96th, 97th and 98th indices.
Benchmarks
The text was updated successfully, but these errors were encountered: