qmath: micro-optimize the `BoxOnPlaneSide` function #1142

illwieckz · 2024-05-13T10:16:11Z

Extracted from:

WIP: renderer: micro-optimize CPU culling and other things #1125

This function is heavily used in CPU culling and milking the maximum of performance from it is welcome.

Patches are meant to be squashed.

illwieckz · 2024-05-13T10:24:47Z

About this:

	bool bit0 = dist[ 0 ] >= p->dist;
	bool bit1 = dist[ 1 ] < p->dist;
	return bit0 | ( bit1 << 1 );

We can also rewrite the last line this way:

	return bit0 + ( bit1 * 2 );

The disassembly is the same in both cases:

//	return bit0 | ( bit1 << 1 );
	lea          eax, [rax + rcx*2]
	ret

//	return bit0 + ( bit1 * 2 );
	lea          eax, [rax + rcx*2]
	ret

illwieckz · 2024-05-13T14:27:42Z

MSVC prefers:

	return bit0 + ( bit1 * 2 );

Otherwise it aborts compilation and prints that:

src\engine\qcommon\q_math.cpp(781,23):
  error C2220: the following warning is treated as an error
  [build\engine-lib.vcxproj]
src\engine\qcommon\q_math.cpp(781,23):
  warning C4805: '|': unsafe mix of type 'bool' and type 'int' in operation
  [build\engine-lib.vcxproj]

slipher

The transformation BoxOnPlaneSide: return early appears incorrect. But I think 0-7 are the only possible values anyway. So I'd just start with ASSERT_EQ(signbits & 7, 7).

If you need to convert a bool to int you can use unary +.

illwieckz · 2024-05-13T16:29:06Z

Ah right, dist[ 1 ] < p->dist can be false if p->dist is 0.

What I understand is that if p->signbits >= 8, then dist[ 0 ] = dist[ 1 ] = 0 (as the old comment said), then dist[ 0 ] >= p->dist is false assuming p->dist is always positive, so first bit is 0, so the only remaining test is dist[ 1 ] < p->dist for the second bit.

It looks like I misread the second comparison, I should have done:

	if ( p->signbits >= 8 )
	{
		return 2 * ( 0 < p->dist );
	}

Let's try with ASSERT_EQ( p->signbits & 7, 7 ).

illwieckz · 2024-05-13T16:42:31Z

I did ASSERT( p->signbits < 8 ) instead. I loaded some maps and moved around and haven't caught an assertion.

illwieckz · 2024-05-13T17:06:25Z

I added a commit to encourage the compilator to vectorize the &1 operation. This produced less instructions (on the right):

illwieckz · 2024-05-13T17:23:30Z

No, that last commit breaks the culling. I was storing to a bool…

illwieckz · 2024-05-13T17:31:17Z

Now it works, without breaking the culling:

slipher · 2024-05-14T07:59:43Z

Just for fun I tried making an SSE version. I don't have much of an idea when this stuff is faster though, so I wouldn't suggest using it unless you really have a good way of benchmarking.

int BoxOnPlaneSide(const vec3_t emins, const vec3_t emaxs, const cplane_t* p)
{
	ASSERT_LT(emins[0], emaxs[0]);
	ASSERT_LT(emins[1], emaxs[1]);
	ASSERT_LT(emins[2], emaxs[2]);

	auto mins = sseLoadVec3Unsafe(emins);
	auto maxs = sseLoadVec3Unsafe(emaxs);
	auto normal = sseLoadVec3Unsafe(p->normal);

	auto prod0 = _mm_mul_ps(maxs, normal);
	auto prod1 = _mm_mul_ps(mins, normal);

	auto pmax = _mm_max_ps(prod0, prod1);
	auto pmin = _mm_min_ps(prod0, prod1);

	ALIGNED(16, vec4_t pmaxv);
	ALIGNED(16, vec4_t pminv);
	_mm_store_ps(pmaxv, pmax);
	_mm_store_ps(pminv, pmin);

	float dist0 = pmaxv[0] + pmaxv[1] + pmaxv[2];
	float dist1 = pminv[0] + pminv[1] + pminv[2];
	return (dist0 > p->dist) + 2 * (dist1 < p->dist);
}

illwieckz · 2024-05-14T08:28:21Z

I added another change that helps the compiler to vectorize the multiplication.

Actually those lines can be completely removed by rewriting the BoxOnPlaneSide function to directly take a vec3_t[2] as input:

	vec3_t bounds[ 2 ];
	VectorCopy( emins, bounds[ 0 ] );
	VectorCopy( emaxs, bounds[ 1 ] );

But this would require a more intrusive change modifying many files (even the game code relies on this function).

Nevertheless, the compiler still produces less instructions. In fact it's possible that the copy is avoided because most engine calls to this function are actually using emins and emaxs pairs that are already bounds[ 0 ] and bounds[ 1 ] in the calling code.

On the left is before the multiplication vectorization,
In the center is after the multiplication vectorization,
On the right is after the function has been modified to avoid a copy (patch not included in this branch):

We can see there is the same amount of instructions in the center on the right, but the instructions are a bit different, the engine can really benefit by always using everywhere a bounds_t type containing boths mins and maxs, but that's out of scope of that branch.

illwieckz · 2024-05-14T09:19:23Z

My duplication of p->normal was an attempt to see if the compiler could do a single 8-sized vector multiplication instead of two 4-size vector multiplications, it looks like even with a native build on my modern system it does not. Worst than that, it does more instructions when doing two 4-size vector multiplications on the same data (it could simply drop the copy if useless). So I deduplicated and it seems to not do worse.

On the left: before vector multiplication, on the right: after vector multiplication, wihout normal duplication, note that the lines are shifted up by one because the compiler even saved on one arg compared to prior the vectorization:

illwieckz · 2024-05-14T09:23:28Z

I believe I cannot do more without going out of the scope of this function.

illwieckz · 2024-05-14T09:35:52Z

Just for fun I tried making an SSE version. I don't have much of an idea when this stuff is faster though, so I wouldn't suggest using it unless you really have a good way of benchmarking.

Ah great! Once everything optimizing R_RecursiveWorldNode is merged it will be fun to compare.

illwieckz · 2024-05-14T10:04:00Z

A slightly higher FPS, less CPU usage spent in both R_RecursiveWorldNode and BoxOnPlaneSide functions (comparatively to the rest):

I reproduce it on multiple runs.

I also tested the early returns for the simple cases before the SSE code and it was not faster than just replacing the whole body.

illwieckz · 2024-05-14T10:07:16Z

I included the SSE code with a #ifdef idx86_sse.

The non-SSE code will benefit other architectures (nacl, arm64).

illwieckz · 2024-05-14T10:10:59Z

I also verified the code behaved the same before this branch, with my optimizations, and with slipher's SSE code.

What I do for this is that I go inside ATCSHD central room, set r_lockPVS 1 then navigate the map and look which surfaces are rendered and which aren't. The exact same surfaces are rendered and not rendered with the 3 code variants.

illwieckz · 2024-05-14T10:32:57Z

The generated code comparison between my optimized code and slipher's SSE code 🤪️:

illwieckz · 2024-05-14T11:08:10Z

Another thing that happens with that SSE code is that it is now small enough to be inlined when enabling LTO:

The non-SSE code doesn't get inlined when enabling LTO.

slipher · 2024-05-14T18:48:15Z

Cool, the compiler did some further vectorization of the "horizontal add" and got rid of the _mm_store_ps. (Horizontal add is jargon for adding up the elements of a single SSE vector.) I had looked at some horizontal add code here but wasn't sure that it's worthwhile for summing only 3 (rather than 4) elements.

I propose a documentation update since the function can now return 0 in the unlikely case of a zero-size box. I looked at the callsites and they seem OK as-is - a 0 return won't harm them.

/*
 * ==================
 * BoxOnPlaneSide
 *
 * Function used to rule out nonzero-volume intersection of an AABB and a half-space.
 * Returns a 2-bit mask.
 * 
 * If BIT( 0 ) is clear, the box definitely has zero volume on the "front" side of the 
 *    plane (i.e. the side further in the direction of the normal vector).
 * If BIT( 1 ) is clear, the box definitely has zero volume on the back side of the plane.
 * ==================
 */

Cleaned-up SSE version with types and some comments:

int BoxOnPlaneSide(const vec3_t emins, const vec3_t emaxs, const cplane_t* p)
{
	__m128 mins = sseLoadVec3Unsafe( emins );
	__m128 maxs = sseLoadVec3Unsafe( emaxs );
	__m128 normal = sseLoadVec3Unsafe( p->normal );

	// Element-wise products
	__m128 prod0 = _mm_mul_ps( maxs, normal );
	__m128 prod1 = _mm_mul_ps( mins, normal );

	// Element-wise min/max
	__m128 pmax = _mm_max_ps( prod0, prod1 );
	__m128 pmin = _mm_min_ps( prod0, prod1 );

	ALIGNED( 16, vec4_t pmaxv );
	ALIGNED( 16, vec4_t pminv );
	_mm_store_ps( pmaxv, pmax );
	_mm_store_ps( pminv, pmin );

	// Maximum dot product with p->normal over the 8 box corners
	float dist0 = pmaxv[ 0 ] + pmaxv[ 1 ] + pmaxv[ 2 ];
	// Minimum dot product with p->normal over the 8 box corners
	float dist1 = pminv[ 0 ] + pminv[ 1 ] + pminv[ 2 ];

	return ( dist0 > p->dist ) + 2 * ( dist1 < p->dist );
}

slipher · 2024-05-14T18:50:03Z

src/engine/qcommon/q_math.cpp

-	float dist[ 2 ];
-	int   sides, b, i;
+#if idx86_sse
+	ASSERT_LT(emins[0], emaxs[0]);


Asserts can be dropped since I checked that the 0-size case is OK.

slipher · 2024-05-14T18:50:17Z

src/engine/qcommon/q_math.cpp

+	float dist1 = pminv[0] + pminv[1] + pminv[2];
+	return (dist0 > p->dist) + 2 * (dist1 < p->dist);
+#else
+	ASSERT( p->signbits < 8 );


Use ASSERT_LT

slipher · 2024-05-14T18:53:05Z

src/engine/qcommon/q_math.cpp

+	dist[ !index[ 1 ] ] += rmins[ 1 ];
+	dist[ !index[ 2 ] ] += rmins[ 2 ];
+
+	int bit0 = dist[ 0 ] >= p->dist;


This could be > for consistency with the SSE version

What does it change?

Also: why is it modified in SSE version? Is testing for equality useless?

It doesn't really matter since you can't count on exact float equality, but it enables more culling in the case of exact equality and makes the algorithm the same as the SSE one.

With the SSE one it matters a bit more since in the axis-aligned plane case, exact equality might be somewhat common. Changing >= to > there made the results better agree with the non-SSE algorithm (and enables more culling).

OK. 👍️

I shared that part.

Along many 1-to-1 transformations, there are 2 differences being introduced: - The condition for p->signbits >= 8 is entirely dropped as it is assumed p->signbits can't be > 7. An assert for p->signbits < 8 is added instead. - The first bit is modified from dist[0] >= p->dist to dist[0] > p->dist, slipher suggested it and said: > It doesn't really matter since you can't count on exact float equality, > but it enables more culling in the case of exact equality and makes the > algorithm the same as the SSE one. > With the SSE one it matters a bit more since in the axis-aligned plane > case, exact equality might be somewhat common. Changing >= to > there > made the results better agree with the non-SSE algorithm (and enables > more culling). -- #1142 (comment)

illwieckz · 2024-05-14T20:31:06Z

I rebased and squashed the step-by-step commits.

I have written some useful comment in q_math: micro-optimize the BoxOnPlaneSide function, documenting the two functionnal differences compared to the original code:

the removal of the p->signbits >= 8 condition,
the rewrite of dist[0] >= p->dist into dist[0] > p->dist.

This looks ready to me.

slipher · 2024-05-14T20:35:56Z

LGTM

Along many 1-to-1 transformations, there are 2 differences being introduced: - The condition for p->signbits >= 8 is entirely dropped as it is assumed p->signbits can't be > 7. An assert for p->signbits < 8 is added instead. - The first bit is modified from dist[0] >= p->dist to dist[0] > p->dist, slipher suggested it and said: > It doesn't really matter since you can't count on exact float equality, > but it enables more culling in the case of exact equality and makes the > algorithm the same as the SSE one. > With the SSE one it matters a bit more since in the axis-aligned plane > case, exact equality might be somewhat common. Changing >= to > there > made the results better agree with the non-SSE algorithm (and enables > more culling). -- #1142 (comment)

illwieckz added T-Improvement Improvement for an existing feature A-Renderer labels May 13, 2024

slipher reviewed May 13, 2024

View reviewed changes

illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch from d035f34 to 3453828 Compare May 13, 2024 17:28

illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch from 233b6b8 to ac9768f Compare May 13, 2024 18:01

illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch 2 times, most recently from 628f8d3 to 14a21a3 Compare May 14, 2024 08:12

slipher reviewed May 14, 2024

View reviewed changes

illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch from 19e5137 to f51fe3c Compare May 14, 2024 19:42

illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch from f51fe3c to 82f994e Compare May 14, 2024 20:25

q_shared: add the VectorMultiply macro

9051772

illwieckz and others added 2 commits May 14, 2024 22:30

q_math: add an SSE variant of the BoxOnPlaneSide function

a5f794c

illwieckz changed the title ~~qmath: micro-optimize BoxOnPlaneSide function~~ qmath: micro-optimize the BoxOnPlaneSide function May 14, 2024

illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch from 82f994e to a5f794c Compare May 14, 2024 20:33

illwieckz merged commit 540b80e into master May 14, 2024
6 checks passed

illwieckz deleted the illwieckz/qmath-BoxOnPlaneSide branch May 14, 2024 21:19

illwieckz mentioned this pull request May 15, 2024

Add USE_CPU_INTRINSICS build option? Unvanquished/Unvanquished#3001

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qmath: micro-optimize the `BoxOnPlaneSide` function #1142

qmath: micro-optimize the `BoxOnPlaneSide` function #1142

illwieckz commented May 13, 2024

illwieckz commented May 13, 2024 •

edited

Loading

illwieckz commented May 13, 2024

slipher left a comment

illwieckz commented May 13, 2024 •

edited

Loading

illwieckz commented May 13, 2024

illwieckz commented May 13, 2024

illwieckz commented May 13, 2024

illwieckz commented May 13, 2024

slipher commented May 14, 2024

illwieckz commented May 14, 2024 •

edited

Loading

illwieckz commented May 14, 2024 •

edited

Loading

illwieckz commented May 14, 2024

illwieckz commented May 14, 2024

illwieckz commented May 14, 2024

illwieckz commented May 14, 2024 •

edited

Loading

illwieckz commented May 14, 2024 •

edited

Loading

illwieckz commented May 14, 2024

illwieckz commented May 14, 2024

slipher commented May 14, 2024

slipher May 14, 2024

slipher May 14, 2024

slipher May 14, 2024

illwieckz May 14, 2024

illwieckz May 14, 2024

slipher May 14, 2024

illwieckz May 14, 2024

illwieckz commented May 14, 2024

slipher commented May 14, 2024

qmath: micro-optimize the BoxOnPlaneSide function #1142

qmath: micro-optimize the BoxOnPlaneSide function #1142

Conversation

illwieckz commented May 13, 2024

illwieckz commented May 13, 2024 • edited Loading

illwieckz commented May 13, 2024

slipher left a comment

Choose a reason for hiding this comment

illwieckz commented May 13, 2024 • edited Loading

illwieckz commented May 13, 2024

illwieckz commented May 13, 2024

illwieckz commented May 13, 2024

illwieckz commented May 13, 2024

slipher commented May 14, 2024

illwieckz commented May 14, 2024 • edited Loading

illwieckz commented May 14, 2024 • edited Loading

illwieckz commented May 14, 2024

illwieckz commented May 14, 2024

illwieckz commented May 14, 2024

illwieckz commented May 14, 2024 • edited Loading

illwieckz commented May 14, 2024 • edited Loading

illwieckz commented May 14, 2024

illwieckz commented May 14, 2024

slipher commented May 14, 2024

slipher May 14, 2024

Choose a reason for hiding this comment

slipher May 14, 2024

Choose a reason for hiding this comment

slipher May 14, 2024

Choose a reason for hiding this comment

illwieckz May 14, 2024

Choose a reason for hiding this comment

illwieckz May 14, 2024

Choose a reason for hiding this comment

slipher May 14, 2024

Choose a reason for hiding this comment

illwieckz May 14, 2024

Choose a reason for hiding this comment

illwieckz commented May 14, 2024

slipher commented May 14, 2024

qmath: micro-optimize the `BoxOnPlaneSide` function #1142

qmath: micro-optimize the `BoxOnPlaneSide` function #1142

illwieckz commented May 13, 2024 •

edited

Loading

illwieckz commented May 13, 2024 •

edited

Loading

illwieckz commented May 14, 2024 •

edited

Loading

illwieckz commented May 14, 2024 •

edited

Loading

illwieckz commented May 14, 2024 •

edited

Loading

illwieckz commented May 14, 2024 •

edited

Loading