Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qmath: micro-optimize the BoxOnPlaneSide function #1142

Merged
merged 3 commits into from
May 14, 2024

Conversation

illwieckz
Copy link
Member

Extracted from:

This function is heavily used in CPU culling and milking the maximum of performance from it is welcome.

Patches are meant to be squashed.

@illwieckz illwieckz added T-Improvement Improvement for an existing feature A-Renderer labels May 13, 2024
@illwieckz
Copy link
Member Author

illwieckz commented May 13, 2024

About this:

	bool bit0 = dist[ 0 ] >= p->dist;
	bool bit1 = dist[ 1 ] < p->dist;
	return bit0 | ( bit1 << 1 );

We can also rewrite the last line this way:

	return bit0 + ( bit1 * 2 );

The disassembly is the same in both cases:

//	return bit0 | ( bit1 << 1 );
	lea          eax, [rax + rcx*2]
	ret
//	return bit0 + ( bit1 * 2 );
	lea          eax, [rax + rcx*2]
	ret

@illwieckz
Copy link
Member Author

MSVC prefers:

	return bit0 + ( bit1 * 2 );

Otherwise it aborts compilation and prints that:

src\engine\qcommon\q_math.cpp(781,23):
  error C2220: the following warning is treated as an error
  [build\engine-lib.vcxproj]
src\engine\qcommon\q_math.cpp(781,23):
  warning C4805: '|': unsafe mix of type 'bool' and type 'int' in operation
  [build\engine-lib.vcxproj]

Copy link
Member

@slipher slipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transformation BoxOnPlaneSide: return early appears incorrect. But I think 0-7 are the only possible values anyway. So I'd just start with ASSERT_EQ(signbits & 7, 7).

If you need to convert a bool to int you can use unary +.

@illwieckz
Copy link
Member Author

illwieckz commented May 13, 2024

Ah right, dist[ 1 ] < p->dist can be false if p->dist is 0.

What I understand is that if p->signbits >= 8, then dist[ 0 ] = dist[ 1 ] = 0 (as the old comment said), then dist[ 0 ] >= p->dist is false assuming p->dist is always positive, so first bit is 0, so the only remaining test is dist[ 1 ] < p->dist for the second bit.

It looks like I misread the second comparison, I should have done:

	if ( p->signbits >= 8 )
	{
		return 2 * ( 0 < p->dist );
	}

Let's try with ASSERT_EQ( p->signbits & 7, 7 ).

@illwieckz
Copy link
Member Author

I did ASSERT( p->signbits < 8 ) instead. I loaded some maps and moved around and haven't caught an assertion.

@illwieckz
Copy link
Member Author

I added a commit to encourage the compilator to vectorize the &1 operation. This produced less instructions (on the right):

20240513-185921-001 unvanquished-orbit

@illwieckz
Copy link
Member Author

No, that last commit breaks the culling. I was storing to a bool

@illwieckz illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch from d035f34 to 3453828 Compare May 13, 2024 17:28
@illwieckz
Copy link
Member Author

Now it works, without breaking the culling:

20240513-185811-002 unvanquished-orbit

@illwieckz illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch from 233b6b8 to ac9768f Compare May 13, 2024 18:01
@slipher
Copy link
Member

slipher commented May 14, 2024

Just for fun I tried making an SSE version. I don't have much of an idea when this stuff is faster though, so I wouldn't suggest using it unless you really have a good way of benchmarking.

int BoxOnPlaneSide(const vec3_t emins, const vec3_t emaxs, const cplane_t* p)
{
	ASSERT_LT(emins[0], emaxs[0]);
	ASSERT_LT(emins[1], emaxs[1]);
	ASSERT_LT(emins[2], emaxs[2]);

	auto mins = sseLoadVec3Unsafe(emins);
	auto maxs = sseLoadVec3Unsafe(emaxs);
	auto normal = sseLoadVec3Unsafe(p->normal);

	auto prod0 = _mm_mul_ps(maxs, normal);
	auto prod1 = _mm_mul_ps(mins, normal);

	auto pmax = _mm_max_ps(prod0, prod1);
	auto pmin = _mm_min_ps(prod0, prod1);

	ALIGNED(16, vec4_t pmaxv);
	ALIGNED(16, vec4_t pminv);
	_mm_store_ps(pmaxv, pmax);
	_mm_store_ps(pminv, pmin);

	float dist0 = pmaxv[0] + pmaxv[1] + pmaxv[2];
	float dist1 = pminv[0] + pminv[1] + pminv[2];
	return (dist0 > p->dist) + 2 * (dist1 < p->dist);
}

@illwieckz illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch 2 times, most recently from 628f8d3 to 14a21a3 Compare May 14, 2024 08:12
@illwieckz
Copy link
Member Author

illwieckz commented May 14, 2024

I added another change that helps the compiler to vectorize the multiplication.

Actually those lines can be completely removed by rewriting the BoxOnPlaneSide function to directly take a vec3_t[2] as input:

	vec3_t bounds[ 2 ];
	VectorCopy( emins, bounds[ 0 ] );
	VectorCopy( emaxs, bounds[ 1 ] );

But this would require a more intrusive change modifying many files (even the game code relies on this function).

Nevertheless, the compiler still produces less instructions. In fact it's possible that the copy is avoided because most engine calls to this function are actually using emins and emaxs pairs that are already bounds[ 0 ] and bounds[ 1 ] in the calling code.

On the left is before the multiplication vectorization,
In the center is after the multiplication vectorization,
On the right is after the function has been modified to avoid a copy (patch not included in this branch):

20240514-094535-000 unvanquished-cutter

We can see there is the same amount of instructions in the center on the right, but the instructions are a bit different, the engine can really benefit by always using everywhere a bounds_t type containing boths mins and maxs, but that's out of scope of that branch.

@illwieckz
Copy link
Member Author

illwieckz commented May 14, 2024

My duplication of p->normal was an attempt to see if the compiler could do a single 8-sized vector multiplication instead of two 4-size vector multiplications, it looks like even with a native build on my modern system it does not. Worst than that, it does more instructions when doing two 4-size vector multiplications on the same data (it could simply drop the copy if useless). So I deduplicated and it seems to not do worse.

On the left: before vector multiplication, on the right: after vector multiplication, wihout normal duplication, note that the lines are shifted up by one because the compiler even saved on one arg compared to prior the vectorization:

20240514-105550-000 unvanquished-cutter

@illwieckz
Copy link
Member Author

I believe I cannot do more without going out of the scope of this function.

@illwieckz
Copy link
Member Author

Just for fun I tried making an SSE version. I don't have much of an idea when this stuff is faster though, so I wouldn't suggest using it unless you really have a good way of benchmarking.

Ah great! Once everything optimizing R_RecursiveWorldNode is merged it will be fun to compare.

@illwieckz
Copy link
Member Author

A slightly higher FPS, less CPU usage spent in both R_RecursiveWorldNode and BoxOnPlaneSide functions (comparatively to the rest):

20240514-114930-000 unvanquished-orbit

I reproduce it on multiple runs.

I also tested the early returns for the simple cases before the SSE code and it was not faster than just replacing the whole body.

@illwieckz
Copy link
Member Author

illwieckz commented May 14, 2024

I included the SSE code with a #ifdef idx86_sse.

The non-SSE code will benefit other architectures (nacl, arm64).

@illwieckz
Copy link
Member Author

illwieckz commented May 14, 2024

I also verified the code behaved the same before this branch, with my optimizations, and with slipher's SSE code.

What I do for this is that I go inside ATCSHD central room, set r_lockPVS 1 then navigate the map and look which surfaces are rendered and which aren't. The exact same surfaces are rendered and not rendered with the 3 code variants.

@illwieckz
Copy link
Member Author

The generated code comparison between my optimized code and slipher's SSE code 🤪️:

20240514-122612-000 unvanquished-cutter

@illwieckz
Copy link
Member Author

Another thing that happens with that SSE code is that it is now small enough to be inlined when enabling LTO:

20240514-130316-000 unvanquished-orbit

The non-SSE code doesn't get inlined when enabling LTO.

@slipher
Copy link
Member

slipher commented May 14, 2024

Cool, the compiler did some further vectorization of the "horizontal add" and got rid of the _mm_store_ps. (Horizontal add is jargon for adding up the elements of a single SSE vector.) I had looked at some horizontal add code here but wasn't sure that it's worthwhile for summing only 3 (rather than 4) elements.

I propose a documentation update since the function can now return 0 in the unlikely case of a zero-size box. I looked at the callsites and they seem OK as-is - a 0 return won't harm them.

/*
 * ==================
 * BoxOnPlaneSide
 *
 * Function used to rule out nonzero-volume intersection of an AABB and a half-space.
 * Returns a 2-bit mask.
 * 
 * If BIT( 0 ) is clear, the box definitely has zero volume on the "front" side of the 
 *    plane (i.e. the side further in the direction of the normal vector).
 * If BIT( 1 ) is clear, the box definitely has zero volume on the back side of the plane.
 * ==================
 */

Cleaned-up SSE version with types and some comments:

int BoxOnPlaneSide(const vec3_t emins, const vec3_t emaxs, const cplane_t* p)
{
	__m128 mins = sseLoadVec3Unsafe( emins );
	__m128 maxs = sseLoadVec3Unsafe( emaxs );
	__m128 normal = sseLoadVec3Unsafe( p->normal );

	// Element-wise products
	__m128 prod0 = _mm_mul_ps( maxs, normal );
	__m128 prod1 = _mm_mul_ps( mins, normal );

	// Element-wise min/max
	__m128 pmax = _mm_max_ps( prod0, prod1 );
	__m128 pmin = _mm_min_ps( prod0, prod1 );

	ALIGNED( 16, vec4_t pmaxv );
	ALIGNED( 16, vec4_t pminv );
	_mm_store_ps( pmaxv, pmax );
	_mm_store_ps( pminv, pmin );

	// Maximum dot product with p->normal over the 8 box corners
	float dist0 = pmaxv[ 0 ] + pmaxv[ 1 ] + pmaxv[ 2 ];
	// Minimum dot product with p->normal over the 8 box corners
	float dist1 = pminv[ 0 ] + pminv[ 1 ] + pminv[ 2 ];

	return ( dist0 > p->dist ) + 2 * ( dist1 < p->dist );
}

float dist[ 2 ];
int sides, b, i;
#if idx86_sse
ASSERT_LT(emins[0], emaxs[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserts can be dropped since I checked that the 0-size case is OK.

float dist1 = pminv[0] + pminv[1] + pminv[2];
return (dist0 > p->dist) + 2 * (dist1 < p->dist);
#else
ASSERT( p->signbits < 8 );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ASSERT_LT

dist[ !index[ 1 ] ] += rmins[ 1 ];
dist[ !index[ 2 ] ] += rmins[ 2 ];

int bit0 = dist[ 0 ] >= p->dist;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be > for consistency with the SSE version

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: why is it modified in SSE version? Is testing for equality useless?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter since you can't count on exact float equality, but it enables more culling in the case of exact equality and makes the algorithm the same as the SSE one.

With the SSE one it matters a bit more since in the axis-aligned plane case, exact equality might be somewhat common. Changing >= to > there made the results better agree with the non-SSE algorithm (and enables more culling).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. 👍️

I shared that part.

@illwieckz illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch from 19e5137 to f51fe3c Compare May 14, 2024 19:42
illwieckz added a commit that referenced this pull request May 14, 2024
Along many 1-to-1 transformations, there are 2 differences being introduced:

- The condition for p->signbits >= 8 is entirely dropped as it is assumed
  p->signbits can't be > 7. An assert for p->signbits < 8 is added instead.
- The first bit is modified from dist[0] >= p->dist to dist[0] > p->dist,
  slipher suggested it and said:

> It doesn't really matter since you can't count on exact float equality,
> but it enables more culling in the case of exact equality and makes the
> algorithm the same as the SSE one.
> With the SSE one it matters a bit more since in the axis-aligned plane
> case, exact equality might be somewhat common. Changing >= to > there
> made the results better agree with the non-SSE algorithm (and enables
> more culling).
-- #1142 (comment)
@illwieckz illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch from f51fe3c to 82f994e Compare May 14, 2024 20:25
illwieckz and others added 2 commits May 14, 2024 22:30
Along many 1-to-1 transformations, there are 2 differences being introduced:

- The condition for p->signbits >= 8 is entirely dropped as it is assumed
  p->signbits can't be > 7. An assert for p->signbits < 8 is added instead.
- The first bit is modified from dist[0] >= p->dist to dist[0] > p->dist,
  slipher suggested it and said:

> It doesn't really matter since you can't count on exact float equality,
> but it enables more culling in the case of exact equality and makes the
> algorithm the same as the SSE one.
> With the SSE one it matters a bit more since in the axis-aligned plane
> case, exact equality might be somewhat common. Changing >= to > there
> made the results better agree with the non-SSE algorithm (and enables
> more culling).
-- #1142 (comment)
@illwieckz illwieckz changed the title qmath: micro-optimize BoxOnPlaneSide function qmath: micro-optimize the BoxOnPlaneSide function May 14, 2024
@illwieckz
Copy link
Member Author

I rebased and squashed the step-by-step commits.

I have written some useful comment in q_math: micro-optimize the BoxOnPlaneSide function, documenting the two functionnal differences compared to the original code:

  • the removal of the p->signbits >= 8 condition,
  • the rewrite of dist[0] >= p->dist into dist[0] > p->dist.

This looks ready to me.

@illwieckz illwieckz force-pushed the illwieckz/qmath-BoxOnPlaneSide branch from 82f994e to a5f794c Compare May 14, 2024 20:33
@slipher
Copy link
Member

slipher commented May 14, 2024

LGTM

@illwieckz illwieckz merged commit 540b80e into master May 14, 2024
6 checks passed
illwieckz added a commit that referenced this pull request May 14, 2024
Along many 1-to-1 transformations, there are 2 differences being introduced:

- The condition for p->signbits >= 8 is entirely dropped as it is assumed
  p->signbits can't be > 7. An assert for p->signbits < 8 is added instead.
- The first bit is modified from dist[0] >= p->dist to dist[0] > p->dist,
  slipher suggested it and said:

> It doesn't really matter since you can't count on exact float equality,
> but it enables more culling in the case of exact equality and makes the
> algorithm the same as the SSE one.
> With the SSE one it matters a bit more since in the axis-aligned plane
> case, exact equality might be somewhat common. Changing >= to > there
> made the results better agree with the non-SSE algorithm (and enables
> more culling).
-- #1142 (comment)
@illwieckz illwieckz deleted the illwieckz/qmath-BoxOnPlaneSide branch May 14, 2024 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Renderer T-Improvement Improvement for an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants