Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for ifloor, fast_ifloor, floatfrac #1766

Merged
merged 2 commits into from
Sep 14, 2017

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Sep 13, 2017

Say it together, boys and girls: "Untested code is buggy code!"

In this episode, I realize that the ifloor() function in fmath.h,
and also the inconsistently-named floori() from simd.h, both of which
have been used for many years without complaint, have a rather
startling bug: they give the wrong answer for negative integers!

The formal specification is that it returns the equivalent of
int(floor(x)), but the implementation tried to execute faster by using a
bit twiddling trick to avoid the 'floor' part of the computation, just
cast to int and then correct for negative values.

I don't even remember how I stumbled across this, but something caused
me to make a unit test for ifloor, and to my surprise, it failed for
exact negative integers. (For reasons that are head-smackingly obvious
once you think about what the code is doing.)

Now, the obvious fix is to just implement as int(floor(x)), but
benchmarking shows that to be 25-50% slower! It may fail for negative
integers, but I can imagine cases where you are darned sure you will
only fed it positive values, OR maybe there are situations where it's ok
if the negative integer values are wrong. So I "fixed" ifloor(), but I
also made a fast_ifloor() that is the old code -- wrong for negative
integers, but much faster.

And so down the rabbit hole we go...

  • The simd versions have the same problem. Fix them there as well, and
    also rename (floori -> ifloor) so that the simd and fmath versions
    have the same naming convention. Add unit tests and benchmarks, and in
    light of that, tighten up some of the SIMD code paths a bit.

  • floorfrac uses the same trick. Do I also make fast_floorfrac? NO!
    Because along the way, I realized that a slightly different idiom
    was much faster -- even slightly faster than the "fast but buggy"
    version we used previously. So I took the win, and stuck with
    "very slightly faster and not buggy," and did not bother confusing
    the matter with the "buggy and even faster" variant.

  • Added simd versions of floorfrac, in fmath.h (where the float version
    was defined).

  • texturesys.cpp: use the fmath floorfrac (remember, we put simd versions
    there now) instead of defining it locally, which also lets us remove
    a redundant quick_floor() that was identical to the old buggy
    ifloor anyway.

  • Various unit tests for both the scalar and simd cases of these related
    functions, which led to...

  • unittest.h: added an "OIIO_CHECK_EQUAL_APPROX" unit test macro,
    basically it checks a result and considers it ok if it's within
    0.1% of the larger of the quantities being compared. Some extra
    trickery makes it work with either scalar or simd values.

Say it together, boys and girls: "Untested code is buggy code!"

In this episode, I realize that the ifloor() function in fmath.h,
and also the inconsistently-named floori() from simd.h, both of which
have been used for many years without complaint, have a rather
startling bug: they give the wrong answer for negative integers!

The formal specification is that it returns the equivalent of
int(floor(x)), but the implementation tried to execute faster by using a
bit twiddling trick to avoid the 'floor' part of the computation, just
cast to int and then correct for negative values.

I don't even remember how I stumbled across this, but something caused
me to make a unit test for ifloor, and to my surprise, it failed for
exact negative integers. (For reasons that are head-smackingly obvious
once you think about what the code is doing.)

Now, the obvious fix is to just implement as int(floor(x)), but
benchmarking shows that to be 25-50% slower! It may fail for negative
integers, but I can imagine cases where you are darned sure you will
only fed it positive values, OR maybe there are situations where it's ok
if the negative integer values are wrong. So I "fixed" ifloor(), but I
also made a fast_ifloor() that is the old code -- wrong for negative
integers, but much faster.

And so down the rabbit hole we go...

* The simd versions have the same problem. Fix them there as well, and
  also rename (floori -> ifloor) so that the simd and fmath versions
  have the same naming convention. Add unit tests and benchmarks, and in
  light of that, tighten up some of the SIMD code paths a bit.

* floorfrac uses the same trick. Do I also make fast_floorfrac?  NO!
  Because along the way, I realized that a slightly different idiom
  was much faster -- even slightly faster than the "fast but buggy"
  version we used previously. So I took the win, and stuck with
  "very slightly faster and not buggy," and did not bother confusing
  the matter with the "buggy and even faster" variant.

* Added simd versions of floorfrac, in fmath.h (where the float version
  was defined).

* texturesys.cpp: use the fmath floorfrac (remember, we put simd versions
  there now) instead of defining it locally, which also lets us remove
  a redundant quick_floor() that was identical to the old buggy
  ifloor anyway.

* Various unit tests for both the scalar and simd cases of these related
  functions, which led to...

* unittest.h: added an "OIIO_CHECK_EQUAL_APPROX" unit test macro,
  basically it checks a result and considers it ok if it's within
  0.1% of the larger of the quantities being compared. Some extra
  trickery makes it work with either scalar or simd values.
OIIO_CHECK_EQUAL (ifloor(1.0f), 1);
OIIO_CHECK_EQUAL (ifloor(1.001f), 1);
float fval = 1.1; clobber(fval);
bench ("ifloor", [&](){ return DoNotOptimize(ifloor(fval)); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which architecture did you benchmark on? Because from what I looked up - roundss is a single cycle op on modern chips. It seems funny the extra comparison and all that would ever win.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I seem to recall that in other places where we were using this trick, switching to plain floorf got faster as soon as we compiled with sse4.1+.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will double check today by rebuilding/running for every SIMD architecture (on the work machines) and post the results.

I think the floor (which is roundss/roundps underneath) is not the expensive part, it's the cast to int that really hurts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - but both functions cast to int ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the work Linux machines: (all numbers are Mvalues/second, bigger is better)

USE_SIMD=sse2:

float ifloor:  543
float fast_ifloor:   1348
vfloat4 ifloor:   501
vfloat4 fast_ifloor:   6400
vfloat8 ifloor:   501
vfloat8 fast_ifloor:   6394

USE_SIMD=sse4.1

float ifloor:  1066
float fast_ifloor: 1348
vfloat4 ifloor:  4264
vfloat4 fast_ifloor:  6400
vfloat8 ifloor:  4259
vfloat8 fast_ifloor:  6390

USE_SIMD=avx2

float ifloor: 1066
float fast_ifloor:  1348
vfloat4 ifloor:  4264
vfloat4 fast_ifloor:  6400
vfloat8 ifloor:  8510
vfloat8 fast_ifloor:  12800

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godbolt.org:

#include <cmath>

int ifloor (float x)
{
    return (int)floorf(x);
}

int fast_ifloor (float x)
{
    return (int) x - (x < 0.0f ? 1 : 0);
}

For sse2, you get:

floorf@plt:
 jmpq   *0x200bea(%rip)        # 601020 <_GLOBAL_OFFSET_TABLE_+0x20>
 pushq  $0x1
 jmpq   400410 <_init+0x20>
.plt.got:
 jmpq   *0x200bb2(%rip)        # 600ff8 <_DYNAMIC+0x200>
 xchg   %ax,%ax
ifloor(float):
 push   %rax
 callq  400430 <floorf@plt>
 cvttss2si %xmm0,%eax
 pop    %rcx
 retq   
 nopl   0x0(%rax)
fast_ifloor(float):
 cvttss2si %xmm0,%eax
 xorps  %xmm1,%xmm1
 xor    %ecx,%ecx
 ucomiss %xmm0,%xmm1
 seta   %cl
 sub    %ecx,%eax
 retq   
 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
main:
 xor    %eax,%eax
 retq   
 nopw   %cs:0x0(%rax,%rax,1)
 nopl   (%rax)

For sse4.2:

.plt.got:
 jmpq   *0x200c12(%rip)        # 600ff8 <_DYNAMIC+0x200>
 xchg   %ax,%ax
ifloor(float):
 roundss $0x9,%xmm0,%xmm0
 cvttss2si %xmm0,%eax
 retq   
 nopl   0x0(%rax,%rax,1)
fast_ifloor(float):
 cvttss2si %xmm0,%eax
 xorps  %xmm1,%xmm1
 xor    %ecx,%ecx
 ucomiss %xmm0,%xmm1
 seta   %cl
 sub    %ecx,%eax
 retq   
 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
main:
 xor    %eax,%eax
 retq   
 nopw   %cs:0x0(%rax,%rax,1)
 nopl   (%rax)

And for avx2:

.plt.got:
 jmpq   *0x200c12(%rip)        # 600ff8 <_DYNAMIC+0x200>
 xchg   %ax,%ax
ifloor(float):
 vroundss $0x9,%xmm0,%xmm0,%xmm0
 vcvttss2si %xmm0,%eax
 retq   
 nopl   0x0(%rax,%rax,1)
fast_ifloor(float):
 vcvttss2si %xmm0,%eax
 vxorps %xmm1,%xmm1,%xmm1
 xor    %ecx,%ecx
 vucomiss %xmm0,%xmm1
 seta   %cl
 sub    %ecx,%eax
 retq   
 data16 data16 nopw %cs:0x0(%rax,%rax,1)
main:
 xor    %eax,%eax
 retq   
 nopw   %cs:0x0(%rax,%rax,1)
 nopl   (%rax)

So SSE2 is super slow because it's making a function call into libm, whereas sse4.2 is using roundss.

But for sse4.2 and avx2, I don't understand why fast_ifloor is consistently faster than the two instructions for ifloor. Is it possible ifloor's vroundss followed by a vcvtss2si produces some kind of stall that's worse than fast_ifloor's vcvtss2i followed by several integer ops?

@hobbes1069
Copy link
Contributor

Keep in mind, unless someone does a custom build all distro builds only include the minimum sse level guaranteed for that arch, specifically x86_64 I think gets you SSE2 but no higher...

@fpsunflower
Copy link
Contributor

That seems fairly conclusive. LGTM!

@lgritz
Copy link
Collaborator Author

lgritz commented Sep 14, 2017

Chris kind of convinced me -- the benchmarks are squirrelly, very hard to definitively show a large speedup -- that maybe the best bet is just to forget fast_ifloor entirely and concentrate on merely fixing ifloor to be correct.

So I just pushed a new commit to this PR, removing fast_ifloor. I think that's probably best.

@fpsunflower
Copy link
Contributor

LGTM

@lgritz lgritz merged commit 5c9dc68 into AcademySoftwareFoundation:master Sep 14, 2017
@lgritz lgritz deleted the lg-fmath branch September 15, 2017 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants