Fixes for ifloor, fast_ifloor, floatfrac #1766

lgritz · 2017-09-13T08:47:18Z

Say it together, boys and girls: "Untested code is buggy code!"

In this episode, I realize that the ifloor() function in fmath.h,
and also the inconsistently-named floori() from simd.h, both of which
have been used for many years without complaint, have a rather
startling bug: they give the wrong answer for negative integers!

The formal specification is that it returns the equivalent of
int(floor(x)), but the implementation tried to execute faster by using a
bit twiddling trick to avoid the 'floor' part of the computation, just
cast to int and then correct for negative values.

I don't even remember how I stumbled across this, but something caused
me to make a unit test for ifloor, and to my surprise, it failed for
exact negative integers. (For reasons that are head-smackingly obvious
once you think about what the code is doing.)

Now, the obvious fix is to just implement as int(floor(x)), but
benchmarking shows that to be 25-50% slower! It may fail for negative
integers, but I can imagine cases where you are darned sure you will
only fed it positive values, OR maybe there are situations where it's ok
if the negative integer values are wrong. So I "fixed" ifloor(), but I
also made a fast_ifloor() that is the old code -- wrong for negative
integers, but much faster.

And so down the rabbit hole we go...

The simd versions have the same problem. Fix them there as well, and
also rename (floori -> ifloor) so that the simd and fmath versions
have the same naming convention. Add unit tests and benchmarks, and in
light of that, tighten up some of the SIMD code paths a bit.
floorfrac uses the same trick. Do I also make fast_floorfrac? NO!
Because along the way, I realized that a slightly different idiom
was much faster -- even slightly faster than the "fast but buggy"
version we used previously. So I took the win, and stuck with
"very slightly faster and not buggy," and did not bother confusing
the matter with the "buggy and even faster" variant.
Added simd versions of floorfrac, in fmath.h (where the float version
was defined).
texturesys.cpp: use the fmath floorfrac (remember, we put simd versions
there now) instead of defining it locally, which also lets us remove
a redundant quick_floor() that was identical to the old buggy
ifloor anyway.
Various unit tests for both the scalar and simd cases of these related
functions, which led to...
unittest.h: added an "OIIO_CHECK_EQUAL_APPROX" unit test macro,
basically it checks a result and considers it ok if it's within
0.1% of the larger of the quantities being compared. Some extra
trickery makes it work with either scalar or simd values.

Say it together, boys and girls: "Untested code is buggy code!" In this episode, I realize that the ifloor() function in fmath.h, and also the inconsistently-named floori() from simd.h, both of which have been used for many years without complaint, have a rather startling bug: they give the wrong answer for negative integers! The formal specification is that it returns the equivalent of int(floor(x)), but the implementation tried to execute faster by using a bit twiddling trick to avoid the 'floor' part of the computation, just cast to int and then correct for negative values. I don't even remember how I stumbled across this, but something caused me to make a unit test for ifloor, and to my surprise, it failed for exact negative integers. (For reasons that are head-smackingly obvious once you think about what the code is doing.) Now, the obvious fix is to just implement as int(floor(x)), but benchmarking shows that to be 25-50% slower! It may fail for negative integers, but I can imagine cases where you are darned sure you will only fed it positive values, OR maybe there are situations where it's ok if the negative integer values are wrong. So I "fixed" ifloor(), but I also made a fast_ifloor() that is the old code -- wrong for negative integers, but much faster. And so down the rabbit hole we go... * The simd versions have the same problem. Fix them there as well, and also rename (floori -> ifloor) so that the simd and fmath versions have the same naming convention. Add unit tests and benchmarks, and in light of that, tighten up some of the SIMD code paths a bit. * floorfrac uses the same trick. Do I also make fast_floorfrac? NO! Because along the way, I realized that a slightly different idiom was much faster -- even slightly faster than the "fast but buggy" version we used previously. So I took the win, and stuck with "very slightly faster and not buggy," and did not bother confusing the matter with the "buggy and even faster" variant. * Added simd versions of floorfrac, in fmath.h (where the float version was defined). * texturesys.cpp: use the fmath floorfrac (remember, we put simd versions there now) instead of defining it locally, which also lets us remove a redundant quick_floor() that was identical to the old buggy ifloor anyway. * Various unit tests for both the scalar and simd cases of these related functions, which led to... * unittest.h: added an "OIIO_CHECK_EQUAL_APPROX" unit test macro, basically it checks a result and considers it ok if it's within 0.1% of the larger of the quantities being compared. Some extra trickery makes it work with either scalar or simd values.

fpsunflower · 2017-09-13T15:42:24Z

src/libutil/fmath_test.cpp

+    OIIO_CHECK_EQUAL (ifloor(1.0f), 1);
+    OIIO_CHECK_EQUAL (ifloor(1.001f), 1);
+    float fval = 1.1; clobber(fval);
+    bench ("ifloor", [&](){ return DoNotOptimize(ifloor(fval)); });


Which architecture did you benchmark on? Because from what I looked up - roundss is a single cycle op on modern chips. It seems funny the extra comparison and all that would ever win.

In fact I seem to recall that in other places where we were using this trick, switching to plain floorf got faster as soon as we compiled with sse4.1+.

I will double check today by rebuilding/running for every SIMD architecture (on the work machines) and post the results.

I think the floor (which is roundss/roundps underneath) is not the expensive part, it's the cast to int that really hurts.

Right - but both functions cast to int ...

On the work Linux machines: (all numbers are Mvalues/second, bigger is better)

USE_SIMD=sse2:

float ifloor: 543 float fast_ifloor: 1348 vfloat4 ifloor: 501 vfloat4 fast_ifloor: 6400 vfloat8 ifloor: 501 vfloat8 fast_ifloor: 6394

USE_SIMD=sse4.1

float ifloor: 1066 float fast_ifloor: 1348 vfloat4 ifloor: 4264 vfloat4 fast_ifloor: 6400 vfloat8 ifloor: 4259 vfloat8 fast_ifloor: 6390

USE_SIMD=avx2

float ifloor: 1066 float fast_ifloor: 1348 vfloat4 ifloor: 4264 vfloat4 fast_ifloor: 6400 vfloat8 ifloor: 8510 vfloat8 fast_ifloor: 12800

godbolt.org:

#include <cmath> int ifloor (float x) { return (int)floorf(x); } int fast_ifloor (float x) { return (int) x - (x < 0.0f ? 1 : 0); }

For sse2, you get:

floorf@plt: jmpq *0x200bea(%rip) # 601020 <_GLOBAL_OFFSET_TABLE_+0x20> pushq $0x1 jmpq 400410 <_init+0x20> .plt.got: jmpq *0x200bb2(%rip) # 600ff8 <_DYNAMIC+0x200> xchg %ax,%ax ifloor(float): push %rax callq 400430 <floorf@plt> cvttss2si %xmm0,%eax pop %rcx retq nopl 0x0(%rax) fast_ifloor(float): cvttss2si %xmm0,%eax xorps %xmm1,%xmm1 xor %ecx,%ecx ucomiss %xmm0,%xmm1 seta %cl sub %ecx,%eax retq data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1) main: xor %eax,%eax retq nopw %cs:0x0(%rax,%rax,1) nopl (%rax)

For sse4.2:

.plt.got: jmpq *0x200c12(%rip) # 600ff8 <_DYNAMIC+0x200> xchg %ax,%ax ifloor(float): roundss $0x9,%xmm0,%xmm0 cvttss2si %xmm0,%eax retq nopl 0x0(%rax,%rax,1) fast_ifloor(float): cvttss2si %xmm0,%eax xorps %xmm1,%xmm1 xor %ecx,%ecx ucomiss %xmm0,%xmm1 seta %cl sub %ecx,%eax retq data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1) main: xor %eax,%eax retq nopw %cs:0x0(%rax,%rax,1) nopl (%rax)

And for avx2:

.plt.got: jmpq *0x200c12(%rip) # 600ff8 <_DYNAMIC+0x200> xchg %ax,%ax ifloor(float): vroundss $0x9,%xmm0,%xmm0,%xmm0 vcvttss2si %xmm0,%eax retq nopl 0x0(%rax,%rax,1) fast_ifloor(float): vcvttss2si %xmm0,%eax vxorps %xmm1,%xmm1,%xmm1 xor %ecx,%ecx vucomiss %xmm0,%xmm1 seta %cl sub %ecx,%eax retq data16 data16 nopw %cs:0x0(%rax,%rax,1) main: xor %eax,%eax retq nopw %cs:0x0(%rax,%rax,1) nopl (%rax)

So SSE2 is super slow because it's making a function call into libm, whereas sse4.2 is using roundss.

But for sse4.2 and avx2, I don't understand why fast_ifloor is consistently faster than the two instructions for ifloor. Is it possible ifloor's vroundss followed by a vcvtss2si produces some kind of stall that's worse than fast_ifloor's vcvtss2i followed by several integer ops?

hobbes1069 · 2017-09-13T20:38:51Z

Keep in mind, unless someone does a custom build all distro builds only include the minimum sse level guaranteed for that arch, specifically x86_64 I think gets you SSE2 but no higher...

fpsunflower · 2017-09-13T20:42:56Z

That seems fairly conclusive. LGTM!

lgritz · 2017-09-14T03:20:45Z

Chris kind of convinced me -- the benchmarks are squirrelly, very hard to definitively show a large speedup -- that maybe the best bet is just to forget fast_ifloor entirely and concentrate on merely fixing ifloor to be correct.

So I just pushed a new commit to this PR, removing fast_ifloor. I think that's probably best.

fpsunflower · 2017-09-14T15:56:59Z

LGTM

fpsunflower reviewed Sep 13, 2017

View reviewed changes

Remove fast_ifloor

0b85731

lgritz merged commit 5c9dc68 into AcademySoftwareFoundation:master Sep 14, 2017

lgritz deleted the lg-fmath branch September 15, 2017 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for ifloor, fast_ifloor, floatfrac #1766

Fixes for ifloor, fast_ifloor, floatfrac #1766

lgritz commented Sep 13, 2017

fpsunflower Sep 13, 2017

fpsunflower Sep 13, 2017

lgritz Sep 13, 2017

fpsunflower Sep 13, 2017

lgritz Sep 13, 2017

lgritz Sep 13, 2017

hobbes1069 commented Sep 13, 2017

fpsunflower commented Sep 13, 2017

lgritz commented Sep 14, 2017

fpsunflower commented Sep 14, 2017

Fixes for ifloor, fast_ifloor, floatfrac #1766

Fixes for ifloor, fast_ifloor, floatfrac #1766

Conversation

lgritz commented Sep 13, 2017

fpsunflower Sep 13, 2017

Choose a reason for hiding this comment

fpsunflower Sep 13, 2017

Choose a reason for hiding this comment

lgritz Sep 13, 2017

Choose a reason for hiding this comment

fpsunflower Sep 13, 2017

Choose a reason for hiding this comment

lgritz Sep 13, 2017

Choose a reason for hiding this comment

lgritz Sep 13, 2017

Choose a reason for hiding this comment

hobbes1069 commented Sep 13, 2017

fpsunflower commented Sep 13, 2017

lgritz commented Sep 14, 2017

fpsunflower commented Sep 14, 2017