-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RyuJit: avoid conditional jumps using cmov and similar instructions #6749
Comments
There is code for it I think but @jamesqo noticed it was never getting executed in dotnet/coreclr#6647 (comment) Function called |
@benaadams That code is in codegenlegacy.cpp, which, as far as I can tell, is the pre-RuyJit codegen and does not even support x64. So I'm not sure how relevant is that for RuyJit on x64. |
Might be why its not called then 😆 |
@benaadams I believe I tested it with x86 builds, without the x86 RyuJIT (which is disabled by default) enabled. Feel free to check for yourself, though, in case I messed something up. |
There does seem to be a lot of stuff for cmov in the code, have never seen it output though |
We used to generate CMOV/SETCC from our internal QMark operator, however we ended up expanding QMark after importer to short-circuit flow to simplify handling in downstream phases. As I recall the plan was to enable a more general if-conversion phase to identify simple cmov/setcc patterns before lowering to reconstitute them. /cc @dotnet/jit-contrib @russellhadley I believe JIT32 (used today on x86) does generate at least SETCC, if not CMOV. Regardless of the current state, I think we all agree with the previous comments both are pretty powerful features of the IA architecture and worth generating. Often result in smaller and faster code without the possibility of expensive mis-prediction penalties. |
I agree, it's pretty important to handle both of these cases. In terms of priority @svick where is this coming from? Is it in a particular workload or benchmark? What's the opportunity here? |
@russellhadley I noticed this when I tried to modify this code in Kestrel to use cmov and failed. But that was just "Hey, this code could be faster! No? Too bad.", I don't know if it would actually have measurable impact on the overall performance of Kestrel. So I don't think you should prioritize this based on me. |
@svick It's still Kestrel and it's in one of the methods we want to get inlined into the MemoryAllocatorIterator::Seek routine and optimized - so it had some priority. Thanks for the heads up. :) |
@svick, that code is probably faster with bsf/tzcnt(count trailing zeroes) and lshl(logical shift left) instead of cmov/condIf's. Whether JIT knows to generate them is another question. |
@choikwa but ternary operators, for example, are probably a much more common pattern across the ecosystem as a first step to improve; and likely more simple to recognise - since its a language construct. Have an improvement for that particular section of code in Kestrel in a PR aspnet/KestrelHttpServer#1138 |
@cmckinsey @russellhadley Anyone working on this? FYI I'm trying to add support for CMOV to RyuJIT's codegen for a different issue. Maybe I can take a look at this one after that. |
AFAIK nobody is looking into this at the moment. Feel free to dive in :) |
@mikedn I think @benaadams has provided a branchless version of the code that motivated this in Kestrel, but getting to the point where we can generate CMOVs is still interesting. @sivarv has been working on some ideas around how to improve this kind of instruction selection. It would be good to make sure this effort dovetails. |
FWIW I've made a prototype that survives diffing and testing. It detects simple half/full hammocks where the true/false BBs contain a single assignment (or a return) that involves only local variables and constants. It only works with i4 and i8 but with some care we might be able to handle small integer types. I'll probably try to extend this to indirs though it's problematic due to side-effects. A quick and dirty experiment that survives diffing could probably show if there are enough opportunities for this to be useful. The conversion is currently done when lowering FX diff shows a 2k reduction in code size but it could be better. I had to turn off a and-cmp to test optimization that caused a lot of trouble and that results in code size regressions. Also, using
Codegen for The 2 examples from the initial post generate:
The code is similar to what the native compilers generate except a couple of warts. The first example has a widening cast that can be easily removed. The second example has a useless It's worth noting that Clang assumes that bools are 0/1 and generates better code for the first example. I don't think we can do that, CLR bools are 0/non-zero. The
so we get rid of a useless jump by accident (those kind of jumps can probably appear in other situations that aren't handled by if-conversion). The for (int i = 0; i < N; i++)
sum = (bits[i] ? 1 : 0) + sum; generates
The reason why the the original code ( With some additional work this could probably turn into:
The nice thing about if-conversion is that once you generate a GT_SELECT you can look around and detect all kind of patterns relatively easily. I'll try to do some perf measurements but the perf characteristics of converting branches are probably well known to anyone familiar with the subject. If the branch isn't predictable then you get good performance (I've seen numbers in the range 2-5x but I have to double check). If the branch is predictable then performance may regress slightly. Maybe such conversions should be driven by profile data but it seems that all native compilers generate |
As a simple perf test let's sum all the positive numbers from an array:
As mentioned above, use of
With 1,000,000 numbers this runs in 630us if all numbers are positive and 4300us if we have a random mix of positive and negative numbers. Poor branch prediction takes its toll. If we change the sum expression so that if-conversion kicks in we get CMOV:
Now we always get 720us no matter what the numbers are. We're slightly slower in the "all positive" case and 6 (yes, six!) times faster in the random case. Should something be done about the slowdown we see in the "all positive" case? I don't know, as a developer I would use the following code if the numbers are known to be mostly positive:
If-conversion doesn't kick in and this runs in 490us if all numbers are positive, that's faster than all of the above. |
Also relevant: http://aakinshin.net/en/blog/dotnet/perfex-min/ |
I know. The prototype I have does generate CMOV in the |
@mikedn For your prototype did you just try and re-enable they qmark support? (not expand them early into flow) |
@russellhadley Nope, I'm generating |
@mikedn Does it take 2 full trees to be evaluated or is it restricted to taking lclVars? |
@russellhadley lclVars and constants. It recognizes: if (whatever) { lclX = lclY (or const); }
// generates STORE_LCL(lclX, SELCC<cond>(lclY, lclX))
if (whatever) { lclX = lclY (or const); } else { lclX = lclZ (or const); }
// generates STORE_LCL(lclX, SELCC<cond>(lclY, lclZ))
if (whatever) { return lclY (or const); } else { return lclZ (or const); }
// generates RETURN(SELCC<cond>(lclY, lclZ)) It should be possible to handle more patterns but there are all kinds of issues:
I think (haven't looked at it too much) that GT_QMARK allows more or less arbitrary trees and that makes it problematic, it looks like some kind of control flow embedded in an expression tree. GT_SELCC looks like any other binary operator except that it has a hidden dependency on condition flags. I mixed up my code with a few other experiments, I'll try to clean it up a bit tomorrow and push it to my GitHub fork for reference. Though at 500 lines of code (including empty lines) it may not be a pretty sight. Needs reorganization. |
Thinking about it, given that CMOV is complicated to catch everything, why not tackle only the 'easy' cases (when there is certainty to have performance improvements) and provide also an intrisic version in case you need in any of the |
@russellhadley My experimental branch is here: https://github.com/mikedn/coreclr/commits/if-convert2 |
@redknightlois I'm not sure how an intrinsic would help. Once it works for local variables it's up to you to ensure that the code has the right pattern to make use of CMOV. Or create a method like |
@mikedn Looks like the experimental branch was deleted. Do you have a draft somewhere? |
@sdmaclea See the "If conversion" commit in this branch https://github.com/mikedn/coreclr/commits/relop-lower-2 |
@mikedn, as for providing an intrinsic (a bit late, I know), I didn't think the runtime had a rule stating that a variable that is passed directly to a method (without being stored to a local in the interim) has to be evaluated (that's just the way it happens to work today, since we don't expose very many, if any, "intrinsics" like this). That is, it should be entirely possible for the runtime to expose a method in Corlib that has a special contract where That being said, just ensuring we recognize |
@tannergooding I don't quite understand what you are saying. In particular, the "Select(c, a[i], b[i]) will only evaluate a[i] or b[i] based on the result of c" makes no sense. The whole point of adding the Select method I suggested is to allow people to clearly state that they are OK with both
It will never recognize that. It's simply impossible to express the semantics of such an expression using CMOV. |
@svick, yes. I was indicating that, when locals are involved, additional analysis for those values has to be done (something the JIT probably doesn't have time to do), so the transformation would not occur and the evaluation would happen before the method call. (There would need to be an post-compilation, but pre JIT/AOT IL optimization tool for these transformations). However, for the cases when no locals are involved, the JIT should be able to determine the value is passed directly to the method and can then attempt the appropriate transforms. |
This is starting to go pretty far off the rails. Let's back up and answer @tannergooding's high-level question:
This transformation cannot be performed in the way that you suggest because it changes the behavior of the program. In the first case, the evaluation of |
It's in the part of I.12.6.4 that you quoted. Adding emphasis (note "and exceptions"):
The JIT must guarantee that the |
@pgavlin, @JosephTremoulet. Thanks. Am I correctly understanding the subsequent section on |
@tannergooding, by my reading, that would still run afoul of:
|
@JosephTremoulet. Hmm, it seems that the further explanation of relaxations in Annex F specifically calls out the scenario I am interested in:
|
In either case. Thanks for helping me understand this better. I'm going to create a new issue so I can continue the discussion without polluting this thread anymore. I think, even if the runtime does not "technically" allow it today (although I think it might), it should support these types of optimizations. |
@tannergooding, translating "caller" and "M" to your example, that's saying that (if both are relaxed) a fault in |
static int Test1(int x)
{
return x == 42 ? 1000 : 2000;
}
static int Test2(int x, int a, int b)
{
return x == 42 ? a : b;
} ; Method Tests:Test1(int):int
G_M56601_IG01:
G_M56601_IG02:
83F92A cmp ecx, 42
B8E8030000 mov eax, 0x3E8
BAD0070000 mov edx, 0x7D0
0F45C2 cmovne eax, edx
G_M56601_IG03:
C3 ret
; Total bytes of code: 17
; Method Tests:Test2(int,int,int):int
G_M50938_IG01:
G_M50938_IG02:
83F92A cmp ecx, 42
8BC2 mov eax, edx
410F45C0 cmovne eax, r8d
G_M50938_IG03:
C3 ret
; Total bytes of code: 10 |
This adds support for contained compares in GT_SELECT nodes for xarch. As part of this, also enables if-conversion on xarch. Fix dotnet#6749
Conditional jumps, especially those that are hard to predict, are fairly expensive, so they should be avoided if possible. One way to avoid them is to use conditional moves and similar instructions (like
sete
). As far as I can tell, RuyJit never does this and I think it should.For example, take these two methods:
For both of them, RyuJit generates a conditional jump:
For comparison, here are the same methods compiled using Clang and GCC with
-O1
(by Compiler Explorer):GCC 6.2:
Clang 3.9.0:
category:cq
theme:basic-cq
skill-level:expert
cost:large
impact:small
The text was updated successfully, but these errors were encountered: