-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for ifloor, fast_ifloor, floatfrac #1766
Conversation
Say it together, boys and girls: "Untested code is buggy code!" In this episode, I realize that the ifloor() function in fmath.h, and also the inconsistently-named floori() from simd.h, both of which have been used for many years without complaint, have a rather startling bug: they give the wrong answer for negative integers! The formal specification is that it returns the equivalent of int(floor(x)), but the implementation tried to execute faster by using a bit twiddling trick to avoid the 'floor' part of the computation, just cast to int and then correct for negative values. I don't even remember how I stumbled across this, but something caused me to make a unit test for ifloor, and to my surprise, it failed for exact negative integers. (For reasons that are head-smackingly obvious once you think about what the code is doing.) Now, the obvious fix is to just implement as int(floor(x)), but benchmarking shows that to be 25-50% slower! It may fail for negative integers, but I can imagine cases where you are darned sure you will only fed it positive values, OR maybe there are situations where it's ok if the negative integer values are wrong. So I "fixed" ifloor(), but I also made a fast_ifloor() that is the old code -- wrong for negative integers, but much faster. And so down the rabbit hole we go... * The simd versions have the same problem. Fix them there as well, and also rename (floori -> ifloor) so that the simd and fmath versions have the same naming convention. Add unit tests and benchmarks, and in light of that, tighten up some of the SIMD code paths a bit. * floorfrac uses the same trick. Do I also make fast_floorfrac? NO! Because along the way, I realized that a slightly different idiom was much faster -- even slightly faster than the "fast but buggy" version we used previously. So I took the win, and stuck with "very slightly faster and not buggy," and did not bother confusing the matter with the "buggy and even faster" variant. * Added simd versions of floorfrac, in fmath.h (where the float version was defined). * texturesys.cpp: use the fmath floorfrac (remember, we put simd versions there now) instead of defining it locally, which also lets us remove a redundant quick_floor() that was identical to the old buggy ifloor anyway. * Various unit tests for both the scalar and simd cases of these related functions, which led to... * unittest.h: added an "OIIO_CHECK_EQUAL_APPROX" unit test macro, basically it checks a result and considers it ok if it's within 0.1% of the larger of the quantities being compared. Some extra trickery makes it work with either scalar or simd values.
OIIO_CHECK_EQUAL (ifloor(1.0f), 1); | ||
OIIO_CHECK_EQUAL (ifloor(1.001f), 1); | ||
float fval = 1.1; clobber(fval); | ||
bench ("ifloor", [&](){ return DoNotOptimize(ifloor(fval)); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which architecture did you benchmark on? Because from what I looked up - roundss
is a single cycle op on modern chips. It seems funny the extra comparison and all that would ever win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact I seem to recall that in other places where we were using this trick, switching to plain floorf
got faster as soon as we compiled with sse4.1+
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will double check today by rebuilding/running for every SIMD architecture (on the work machines) and post the results.
I think the floor (which is roundss/roundps underneath) is not the expensive part, it's the cast to int that really hurts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - but both functions cast to int ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the work Linux machines: (all numbers are Mvalues/second, bigger is better)
USE_SIMD=sse2:
float ifloor: 543
float fast_ifloor: 1348
vfloat4 ifloor: 501
vfloat4 fast_ifloor: 6400
vfloat8 ifloor: 501
vfloat8 fast_ifloor: 6394
USE_SIMD=sse4.1
float ifloor: 1066
float fast_ifloor: 1348
vfloat4 ifloor: 4264
vfloat4 fast_ifloor: 6400
vfloat8 ifloor: 4259
vfloat8 fast_ifloor: 6390
USE_SIMD=avx2
float ifloor: 1066
float fast_ifloor: 1348
vfloat4 ifloor: 4264
vfloat4 fast_ifloor: 6400
vfloat8 ifloor: 8510
vfloat8 fast_ifloor: 12800
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
godbolt.org:
#include <cmath>
int ifloor (float x)
{
return (int)floorf(x);
}
int fast_ifloor (float x)
{
return (int) x - (x < 0.0f ? 1 : 0);
}
For sse2, you get:
floorf@plt:
jmpq *0x200bea(%rip) # 601020 <_GLOBAL_OFFSET_TABLE_+0x20>
pushq $0x1
jmpq 400410 <_init+0x20>
.plt.got:
jmpq *0x200bb2(%rip) # 600ff8 <_DYNAMIC+0x200>
xchg %ax,%ax
ifloor(float):
push %rax
callq 400430 <floorf@plt>
cvttss2si %xmm0,%eax
pop %rcx
retq
nopl 0x0(%rax)
fast_ifloor(float):
cvttss2si %xmm0,%eax
xorps %xmm1,%xmm1
xor %ecx,%ecx
ucomiss %xmm0,%xmm1
seta %cl
sub %ecx,%eax
retq
data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
main:
xor %eax,%eax
retq
nopw %cs:0x0(%rax,%rax,1)
nopl (%rax)
For sse4.2:
.plt.got:
jmpq *0x200c12(%rip) # 600ff8 <_DYNAMIC+0x200>
xchg %ax,%ax
ifloor(float):
roundss $0x9,%xmm0,%xmm0
cvttss2si %xmm0,%eax
retq
nopl 0x0(%rax,%rax,1)
fast_ifloor(float):
cvttss2si %xmm0,%eax
xorps %xmm1,%xmm1
xor %ecx,%ecx
ucomiss %xmm0,%xmm1
seta %cl
sub %ecx,%eax
retq
data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
main:
xor %eax,%eax
retq
nopw %cs:0x0(%rax,%rax,1)
nopl (%rax)
And for avx2:
.plt.got:
jmpq *0x200c12(%rip) # 600ff8 <_DYNAMIC+0x200>
xchg %ax,%ax
ifloor(float):
vroundss $0x9,%xmm0,%xmm0,%xmm0
vcvttss2si %xmm0,%eax
retq
nopl 0x0(%rax,%rax,1)
fast_ifloor(float):
vcvttss2si %xmm0,%eax
vxorps %xmm1,%xmm1,%xmm1
xor %ecx,%ecx
vucomiss %xmm0,%xmm1
seta %cl
sub %ecx,%eax
retq
data16 data16 nopw %cs:0x0(%rax,%rax,1)
main:
xor %eax,%eax
retq
nopw %cs:0x0(%rax,%rax,1)
nopl (%rax)
So SSE2 is super slow because it's making a function call into libm, whereas sse4.2 is using roundss.
But for sse4.2 and avx2, I don't understand why fast_ifloor is consistently faster than the two instructions for ifloor. Is it possible ifloor's vroundss followed by a vcvtss2si produces some kind of stall that's worse than fast_ifloor's vcvtss2i followed by several integer ops?
Keep in mind, unless someone does a custom build all distro builds only include the minimum sse level guaranteed for that arch, specifically x86_64 I think gets you SSE2 but no higher... |
That seems fairly conclusive. LGTM! |
Chris kind of convinced me -- the benchmarks are squirrelly, very hard to definitively show a large speedup -- that maybe the best bet is just to forget fast_ifloor entirely and concentrate on merely fixing ifloor to be correct. So I just pushed a new commit to this PR, removing fast_ifloor. I think that's probably best. |
LGTM |
Say it together, boys and girls: "Untested code is buggy code!"
In this episode, I realize that the ifloor() function in fmath.h,
and also the inconsistently-named floori() from simd.h, both of which
have been used for many years without complaint, have a rather
startling bug: they give the wrong answer for negative integers!
The formal specification is that it returns the equivalent of
int(floor(x)), but the implementation tried to execute faster by using a
bit twiddling trick to avoid the 'floor' part of the computation, just
cast to int and then correct for negative values.
I don't even remember how I stumbled across this, but something caused
me to make a unit test for ifloor, and to my surprise, it failed for
exact negative integers. (For reasons that are head-smackingly obvious
once you think about what the code is doing.)
Now, the obvious fix is to just implement as int(floor(x)), but
benchmarking shows that to be 25-50% slower! It may fail for negative
integers, but I can imagine cases where you are darned sure you will
only fed it positive values, OR maybe there are situations where it's ok
if the negative integer values are wrong. So I "fixed" ifloor(), but I
also made a fast_ifloor() that is the old code -- wrong for negative
integers, but much faster.
And so down the rabbit hole we go...
The simd versions have the same problem. Fix them there as well, and
also rename (floori -> ifloor) so that the simd and fmath versions
have the same naming convention. Add unit tests and benchmarks, and in
light of that, tighten up some of the SIMD code paths a bit.
floorfrac uses the same trick. Do I also make fast_floorfrac? NO!
Because along the way, I realized that a slightly different idiom
was much faster -- even slightly faster than the "fast but buggy"
version we used previously. So I took the win, and stuck with
"very slightly faster and not buggy," and did not bother confusing
the matter with the "buggy and even faster" variant.
Added simd versions of floorfrac, in fmath.h (where the float version
was defined).
texturesys.cpp: use the fmath floorfrac (remember, we put simd versions
there now) instead of defining it locally, which also lets us remove
a redundant quick_floor() that was identical to the old buggy
ifloor anyway.
Various unit tests for both the scalar and simd cases of these related
functions, which led to...
unittest.h: added an "OIIO_CHECK_EQUAL_APPROX" unit test macro,
basically it checks a result and considers it ok if it's within
0.1% of the larger of the quantities being compared. Some extra
trickery makes it work with either scalar or simd values.