-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoiding bound checks in setunion code #301
base: master
Are you sure you want to change the base?
Conversation
For me, Go v1.16.5's produces I've tried to make The results are worse than the branchy implementation, so I'm either doing something wrong (which is quite probable) or due to some factors (data dependencies between loop iterations? cost of additional arithmetical operations? something else) it's just not worth it. I'm under an impression that the proper way to approach |
It is damn hard to benchmark branchy code. For example, to see the benefits of a branchless approach, you have to have mispredicted branches to begin with, but most benchmarking techniques minimize mispredicted branches by looping over the same data. We tend to always underestimate the value of branchless approaches.
Very cheap is relative. You still incur the two instructions (maybe less with fusion) per access. A union function is basically just that... access, compare, access, compare... so if access cost you even a bit more, it should show up in the performance. But you are, of course, correct that the real solution is to use assembly. Your branchless approach is interesting but, given that this is Go, it may not compile to efficient assembly. Have you looked at the assembly? |
Yeah, I'm aware of the problem. That's why I'm using a large array of randomly generated arrays (~1M of numbers in total) here: puzpuzpuz@5630617#diff-c2898d6069ac5db5285b63cc9066e814584908dd34801fde55919838e1ef53bbR176 In theory, 1M of iterations should be enough to get rid of the effects of the branch predictor. For instance, if I change the benchmark to the similar approach as you did in the shortgun intersection experiment (generate random array on each iteration),
Yes, I did. I'm not sure about how good the compiler output is in terms of the "cost" of the emitted instructions, but I didn't notice any branches except for the loop condition one. Most likely those instructions are rather expensive since I see higher instruction per cycle value reported by Update. Higher insn per cycle value is in fact a good indicator (see the below messages). Also, it's not a surprise that the total number of instructions executed in the benchmark is higher for the |
I agree that what you did looks good. Ideally, you'd want the intersection size to vary unpredictably between 'all' to 'nothing'... I don't expect your approach to be good at that... but that's not likely to be an issue in this instance.
We want more instructions per cycle. Being at 1.46 ins per cycle is bad. Keep in mind that recent processors like the Apple M1 can consistently retire 8 instructions per cycle. But it does explain why trying to get a leaner code does not help. If we are at 1.46 ins per cycle, then trying to save instructions will not help us because we have plenty of room for more instructions. This suggest that we are primarily limited by data dependencies issues. And bound checks do nothing to harm data dependency. This suggests that the move forward would be algorithmic in nature, at least for big superscalar processors. Note that we have a faster assembly version of this function for ARM: https://github.com/RoaringBitmap/roaring/blob/master/setutil_arm64.s It is not algorithmic in nature. But it is possible that the gains are explained by the fact that it is tested on relatively small cores. |
Thanks for you inputs!
I misinterpreted the metric, my bad (probably, due to incredibly hot weather I have locally). The higher insn per cycle measurement we get, the better since in general it means that the code benefits from ILP. Thanks for the correction.
I'm also under the same impression so far.
Makes sense. As for the ARM |
It may depend on the specific ARM core microarchitecture. There are many small ARM cores that struggle to retire 2 instructions per cycle. But there are also powerful ARM cores. Historically, x64 cores have been more similar (large superscalar cores)... there are tiny x86 cores used for mobile devices but most of us don't use them very much. On small cores, saving a few instructions can make a big difference. Anyhow, I think that the long-term fix might be algorithmic. (Note: your investigation has been fruitful. Let us consider algorithmic improvements!) |
A minor update on the branchless implementation. I was able to simplify it and got rid of the lookup tables. The main part of the changes in the loop code may be seen here: puzpuzpuz@a2b2510#diff-7916b4e45da125122d637e8abb339ad262f2efc9fde0e2ceb6671ae9e75fd298R36 With this update, the Before: BenchmarkUnion2by2/union2by2_branchless-8 40 29035424 ns/op
BenchmarkUnion2by2/union2by2-8 48 24743344 ns/op After: BenchmarkUnion2by2/union2by2_branchless-8 50 22842745 ns/op
BenchmarkUnion2by2/union2by2-8 46 24849983 ns/op I'm glad to finally see the improvement. On the other hand, even in such synthetic scenario it's not significant enough to be considered as a valuable optimization. |
@puzpuzpuz A consistent 5% speed boost would be valuable. |
OK, let me run real-roaring-datasets and synthetic benchmarks and see if the branchless function makes any difference. If it does, I'll submit a PR. Thanks for your inputs once again. |
What happens when the two values are equal in your algorithm? |
I would also like to see how something stupidly simple like the following works... while(pos1 < size1 && pos2 < size2) {
v1 = input1[pos1];
v2 = input2[pos2];
if(v1 <= v2) {
output_buffer[pos++] = v1;
} else {
output_buffer[pos++] = v2;
}
if v1<=v2 {
pos1 += 1;
}
if v2<=v1 {
pos2 += 1;
}
} I am not convinced that your shifting routine is worth it. |
Duplicates between arrays are ignored, just like in the original implementation.
Compared with the following implementation of the for k1 < len1 && k2 < len2 {
s1 = set1[k1]
s2 = set2[k2]
if s1 < s2 {
buffer[pos] = s1
pos++
k1++
} else if s1 == s2 {
buffer[pos] = s1
pos++
k1++
k2++
} else { // s1 > s2
buffer[pos] = s2
pos++
k2++
}
} Not sure if that's what you asked for, but the benchmark result is the following: BenchmarkUnion2by2/union2by2_branchless-8 51 22744960 ns/op
BenchmarkUnion2by2/union2by2-8 48 24475920 ns/op The branchless function is still a bit faster. |
@puzpuzpuz Please see Faster sorted array unions by reducing branches https://lemire.me/blog/2021/07/14/faster-sorted-array-unions-by-reducing-branches/ (I credit you in the blog post and in the corresponding code.) |
Thanks for mentioning me in the blog post. As I wrote before, I'll try to conduct more benchmark runs and post an update on #327, or simply open a PR if the results are promising. |
This PR would remove most bound checks in the setunion code. Sadly it seems to make no difference to the performance. Marking it as a draft in case someone can make it work better.
The assembly does look leaner to me, at a glance.