-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc LSRA throughput improvements #85842
Misc LSRA throughput improvements #85842
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsWhile working on consecutive-registers, I realized few things that could help in the throughput:
|
TP regressions is surprising. Probably need to compare the assembly of before vs. after to see which individual change might be causing it. |
I didn't realize that we do not use intrinsics for popcount and which is why we are seeing lot of regressions. This is yet another example of why cross compilation comparison for TP might not be always accurate. cc: @jakobbotsch @BruceForstall runtime/src/coreclr/jit/utils.cpp Lines 2788 to 2808 in 8854509
newRefposition:buildInternalregisterusageI will revert the popcount change. |
This reverts commit a75a7da.
Would be worthwhile profiling current function versus popcount -- probably we should switch to popcount. |
It's a good point, but also a place where we should actively ensure we have parity regardless of the compiler we are using. It seems unfortunate that we are regressing either MSVC produced code or Clang produced code. |
Looking at the diffs for minopts benchmarks_run windows-x64 I see slight regression:
But looking at the code, we should still profitable, because we are eliminating a condition: |
Fixed. |
GCC/Clang generate the exact code that was codified for MSVC, they just do it implicitly via the This was likely one of the many cases where we "execute more instructions" but the code was actually faster in practice. |
you mean when building clrjit using clang, right?
Agree. btw, I do see that with VC++ |
I'm confused, are you saying the MSVC multiplication code was faster than using a single |
|
This reverts commit 0b3da21.
@jakobbotsch: No, rather In order for us to emit
|
Odd it shows x64 as a slight regression. |
While working on consecutive-registers, I realized few things that could help in the throughput:
1. We pass aroundDone separately in #86016.RegisterType
in various methods, but that parameter is only used forTARGET_ARM
. So wrap the parameter in `ARM_ARG.updateAssignedInterval()
frequently, but more than half of the time, we passinterval == nullptr
which is essentially clearing the interval. IntroducedclearAssignedInterval()
for that purpose.3. UseDone separately as part of #85944.BitOperations::PopCount()
in a method that is used forIsSingleRegister()
check.