-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qmath: micro-optimize the BoxOnPlaneSide
function
#1142
Conversation
About this: bool bit0 = dist[ 0 ] >= p->dist;
bool bit1 = dist[ 1 ] < p->dist;
return bit0 | ( bit1 << 1 ); We can also rewrite the last line this way: return bit0 + ( bit1 * 2 ); The disassembly is the same in both cases: // return bit0 | ( bit1 << 1 );
lea eax, [rax + rcx*2]
ret // return bit0 + ( bit1 * 2 );
lea eax, [rax + rcx*2]
ret |
MSVC prefers: return bit0 + ( bit1 * 2 ); Otherwise it aborts compilation and prints that:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transformation BoxOnPlaneSide: return early
appears incorrect. But I think 0-7 are the only possible values anyway. So I'd just start with ASSERT_EQ(signbits & 7, 7)
.
If you need to convert a bool to int you can use unary +
.
Ah right, What I understand is that if It looks like I misread the second comparison, I should have done: if ( p->signbits >= 8 )
{
return 2 * ( 0 < p->dist );
} Let's try with |
I did |
No, that last commit breaks the culling. I was storing to a |
d035f34
to
3453828
Compare
233b6b8
to
ac9768f
Compare
Just for fun I tried making an SSE version. I don't have much of an idea when this stuff is faster though, so I wouldn't suggest using it unless you really have a good way of benchmarking. int BoxOnPlaneSide(const vec3_t emins, const vec3_t emaxs, const cplane_t* p)
{
ASSERT_LT(emins[0], emaxs[0]);
ASSERT_LT(emins[1], emaxs[1]);
ASSERT_LT(emins[2], emaxs[2]);
auto mins = sseLoadVec3Unsafe(emins);
auto maxs = sseLoadVec3Unsafe(emaxs);
auto normal = sseLoadVec3Unsafe(p->normal);
auto prod0 = _mm_mul_ps(maxs, normal);
auto prod1 = _mm_mul_ps(mins, normal);
auto pmax = _mm_max_ps(prod0, prod1);
auto pmin = _mm_min_ps(prod0, prod1);
ALIGNED(16, vec4_t pmaxv);
ALIGNED(16, vec4_t pminv);
_mm_store_ps(pmaxv, pmax);
_mm_store_ps(pminv, pmin);
float dist0 = pmaxv[0] + pmaxv[1] + pmaxv[2];
float dist1 = pminv[0] + pminv[1] + pminv[2];
return (dist0 > p->dist) + 2 * (dist1 < p->dist);
} |
628f8d3
to
14a21a3
Compare
I added another change that helps the compiler to vectorize the multiplication. Actually those lines can be completely removed by rewriting the vec3_t bounds[ 2 ];
VectorCopy( emins, bounds[ 0 ] );
VectorCopy( emaxs, bounds[ 1 ] ); But this would require a more intrusive change modifying many files (even the game code relies on this function). Nevertheless, the compiler still produces less instructions. In fact it's possible that the copy is avoided because most engine calls to this function are actually using On the left is before the multiplication vectorization, We can see there is the same amount of instructions in the center on the right, but the instructions are a bit different, the engine can really benefit by always using everywhere a |
My duplication of On the left: before vector multiplication, on the right: after vector multiplication, wihout |
I believe I cannot do more without going out of the scope of this function. |
Ah great! Once everything optimizing |
I included the SSE code with a The non-SSE code will benefit other architectures (nacl, arm64). |
I also verified the code behaved the same before this branch, with my optimizations, and with slipher's SSE code. What I do for this is that I go inside ATCSHD central room, set |
Cool, the compiler did some further vectorization of the "horizontal add" and got rid of the I propose a documentation update since the function can now return 0 in the unlikely case of a zero-size box. I looked at the callsites and they seem OK as-is - a 0 return won't harm them.
Cleaned-up SSE version with types and some comments: int BoxOnPlaneSide(const vec3_t emins, const vec3_t emaxs, const cplane_t* p)
{
__m128 mins = sseLoadVec3Unsafe( emins );
__m128 maxs = sseLoadVec3Unsafe( emaxs );
__m128 normal = sseLoadVec3Unsafe( p->normal );
// Element-wise products
__m128 prod0 = _mm_mul_ps( maxs, normal );
__m128 prod1 = _mm_mul_ps( mins, normal );
// Element-wise min/max
__m128 pmax = _mm_max_ps( prod0, prod1 );
__m128 pmin = _mm_min_ps( prod0, prod1 );
ALIGNED( 16, vec4_t pmaxv );
ALIGNED( 16, vec4_t pminv );
_mm_store_ps( pmaxv, pmax );
_mm_store_ps( pminv, pmin );
// Maximum dot product with p->normal over the 8 box corners
float dist0 = pmaxv[ 0 ] + pmaxv[ 1 ] + pmaxv[ 2 ];
// Minimum dot product with p->normal over the 8 box corners
float dist1 = pminv[ 0 ] + pminv[ 1 ] + pminv[ 2 ];
return ( dist0 > p->dist ) + 2 * ( dist1 < p->dist );
} |
src/engine/qcommon/q_math.cpp
Outdated
float dist[ 2 ]; | ||
int sides, b, i; | ||
#if idx86_sse | ||
ASSERT_LT(emins[0], emaxs[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserts can be dropped since I checked that the 0-size case is OK.
src/engine/qcommon/q_math.cpp
Outdated
float dist1 = pminv[0] + pminv[1] + pminv[2]; | ||
return (dist0 > p->dist) + 2 * (dist1 < p->dist); | ||
#else | ||
ASSERT( p->signbits < 8 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ASSERT_LT
src/engine/qcommon/q_math.cpp
Outdated
dist[ !index[ 1 ] ] += rmins[ 1 ]; | ||
dist[ !index[ 2 ] ] += rmins[ 2 ]; | ||
|
||
int bit0 = dist[ 0 ] >= p->dist; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be >
for consistency with the SSE version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: why is it modified in SSE version? Is testing for equality useless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really matter since you can't count on exact float equality, but it enables more culling in the case of exact equality and makes the algorithm the same as the SSE one.
With the SSE one it matters a bit more since in the axis-aligned plane case, exact equality might be somewhat common. Changing >=
to >
there made the results better agree with the non-SSE algorithm (and enables more culling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. 👍️
I shared that part.
19e5137
to
f51fe3c
Compare
Along many 1-to-1 transformations, there are 2 differences being introduced: - The condition for p->signbits >= 8 is entirely dropped as it is assumed p->signbits can't be > 7. An assert for p->signbits < 8 is added instead. - The first bit is modified from dist[0] >= p->dist to dist[0] > p->dist, slipher suggested it and said: > It doesn't really matter since you can't count on exact float equality, > but it enables more culling in the case of exact equality and makes the > algorithm the same as the SSE one. > With the SSE one it matters a bit more since in the axis-aligned plane > case, exact equality might be somewhat common. Changing >= to > there > made the results better agree with the non-SSE algorithm (and enables > more culling). -- #1142 (comment)
f51fe3c
to
82f994e
Compare
Along many 1-to-1 transformations, there are 2 differences being introduced: - The condition for p->signbits >= 8 is entirely dropped as it is assumed p->signbits can't be > 7. An assert for p->signbits < 8 is added instead. - The first bit is modified from dist[0] >= p->dist to dist[0] > p->dist, slipher suggested it and said: > It doesn't really matter since you can't count on exact float equality, > but it enables more culling in the case of exact equality and makes the > algorithm the same as the SSE one. > With the SSE one it matters a bit more since in the axis-aligned plane > case, exact equality might be somewhat common. Changing >= to > there > made the results better agree with the non-SSE algorithm (and enables > more culling). -- #1142 (comment)
BoxOnPlaneSide
function
I rebased and squashed the step-by-step commits. I have written some useful comment in
This looks ready to me. |
82f994e
to
a5f794c
Compare
LGTM |
Along many 1-to-1 transformations, there are 2 differences being introduced: - The condition for p->signbits >= 8 is entirely dropped as it is assumed p->signbits can't be > 7. An assert for p->signbits < 8 is added instead. - The first bit is modified from dist[0] >= p->dist to dist[0] > p->dist, slipher suggested it and said: > It doesn't really matter since you can't count on exact float equality, > but it enables more culling in the case of exact equality and makes the > algorithm the same as the SSE one. > With the SSE one it matters a bit more since in the axis-aligned plane > case, exact equality might be somewhat common. Changing >= to > there > made the results better agree with the non-SSE algorithm (and enables > more culling). -- #1142 (comment)
Extracted from:
This function is heavily used in CPU culling and milking the maximum of performance from it is welcome.
Patches are meant to be squashed.