-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize IM_NORMALIZE2F_OVER_ZERO #4091
base: master
Are you sure you want to change the base?
Conversation
Thanks @wolfpld we'll try to merge this soon (currently in the middle of construction work). I used this as another occasion to test the perf tools @rokups we've been working on. Pardon taking a large Average of 8 runs over a dozen tests: Individual runs I noticed things are slower in Debug: Release X64 Debug X64 That's fine for me as I am going to take the occasion to enable some of the "reduce agressive stack bound checks" macros around some key low-level functions. Gets us more toward those: (green is Debug X64 with patch + some pragma to reduce stack bound checks on a few functions) |
I think these measurements are very interesting and appropriate to talk about here.
This is understandable, given that MSVC is instrumenting the code for use in the debugger. In the assembly output you can see multiple instances of seemingly nonsensical behavior, such as: movss dword ptr [rbp+114h], xmm0
movss xmm0, dword ptr [rbp+114h] This sequence stores the
I have another similar optimization queued up: wolfpld/tracy@0bd6479 |
Same PR separate commit is good.
Already pushed some of the changes to remove the stack-bound check on small low level functions so the numbers will be back to saner levels and we can benefit from this even in typical “debug” settings.
|
This function calculates the result of 1/x. On SSE1 capable platforms this is performed as an approximation, using rcpps.
Ok, added new changes and rebased to current master. |
I merged the first part as 4c9f0ce (with minor amends). First part in red, second part (ImRecip) in green: In Debug the overhead of non-inlined function call sorts of hinders it for now. I'll investigate later with using macros. |
I tried various variants (with/without macros) and I couldn't find a satisfying setup where the ImRecip change seems very worthy. It's a very minor gain in "Release" mode and often a more noticeable loss in "Debug" mode. Did you measure meaningful changes when doing the ImRecip() change on your side? |
BTW, I have measured that functions such as
const auto wpos = ImGui::GetWindowPos();
...
AddLine( wpos + v1, wpos + v2 ); So I can just add (0.5, 0.5) to |
You can use this if SSE is not available. the rsqrtss is a lookup table instruction. a low accurate (16bit or 20bit) inverse square root. |
I took a peek at Release builds benefit a bit from it, effect of using a macro instead of inline function can be written off as statistically insignificant. Debug builds however, still suffer more than they should. Using a macro does not help much.
push rbp
mov rbp, rsp
sub rsp, 16
movss xmm0, DWORD PTR .LC0[rip]
movss DWORD PTR [rbp-4], xmm0
movss xmm1, DWORD PTR [rbp-4]
movss xmm0, DWORD PTR .LC1[rip]
divss xmm0, xmm1
pxor xmm2, xmm2
cvtss2sd xmm2, xmm0
movq rax, xmm2
movq xmm0, rax
mov edi, OFFSET FLAT:.LC2
mov eax, 1
call printf
mov eax, 0
leave
ret
push rbp
mov rbp, rsp
sub rsp, 48
movss xmm0, DWORD PTR .LC0[rip]
movss DWORD PTR [rbp-40], xmm0
movss xmm0, DWORD PTR [rbp-40]
movss DWORD PTR [rbp-36], xmm0
mov eax, DWORD PTR [rbp-36]
movd xmm0, eax
movaps XMMWORD PTR [rbp-32], xmm0
rcpps xmm0, XMMWORD PTR [rbp-32]
nop
movaps XMMWORD PTR [rbp-16], xmm0
movss xmm0, DWORD PTR [rbp-16]
pxor xmm1, xmm1
cvtss2sd xmm1, xmm0
movq rax, xmm1
movq xmm0, rax
mov edi, OFFSET FLAT:.LC1
mov eax, 1
call printf
mov eax, 0
leave
ret Conclusion: SSE for release builds and a macro with simple |
The SSE version has lower precision than division, so the obtained results will vary between release and debug builds. This can potentially turn someone's debugging session into an exercise in frustration, should this difference be not known. |
Just my 2 cents here. Relative error of rcpps is quite high, which will become a problem on high-resolution displays (like 5k ones). I'd suggest adding one newton-raphson iteration to increase accuracy |
Currently Dear ImGui uses
1/sqrt()
inIM_NORMALIZE2F_OVER_ZERO()
. This produces the following instruction sequence inImDrawList::AddPolyline()
:Notice that both
vsqrtss
andvdivss
both have high sample hit count (visible in dependent instructions). This is caused by high latency of both instructions:This PR introduces a new function,
ImRsqrt()
, which also has an alternate implementation if SSE1 is available. Note that SSE2 support is mandated in x64 architecture. With these changes the following assembly is generated:With the new code only one instruction is emitted,
vrqsrtss
, which produces an approximate result, but the reduced precision shouldn't matter in this case. Notice that the relative cost of callingIM_NORMALIZE2F_OVER_ZERO()
has dropped significantly. This is caused by much lower latency ofvrqsrtss
across all uarchs:ARM NEON has similar instruction,
FRSQRTE
, which may or may not be beneficial to use.For context,
ImDrawList::AddPolyline()
in certain conditions can be the hottest function in Tracy.I'm not sure if including
immintrin.h
is the right solution. It works for me, but it may be problematic on older compilers, in which case a more targetted header inclusion should be used.