Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.x] Shadow volume culling and tighter shadow caster culling #82584

Merged
merged 1 commit into from
Jan 29, 2024

Conversation

lawnjelly
Copy link
Member

@lawnjelly lawnjelly commented Sep 30, 2023

Existing shadow caster culling using the BVH takes no account of the camera. This PR adds the highly encapsulated class VisualServerLightCuller which can cut down the casters in the shadow volume to only those which can cast shadows on the camera frustum.

This is used to:

  • More accurately defer dirty updates to shadows when the shadow volume does not intersect the camera frustum.
  • Tighter cull shadow casters to the view frustum.

Lights dirty state is now automatically managed:

  • Continuous (tighter caster culling)
  • Static (all casters are rendered)

Explanation

You can see roughly how it works in this old video of mine (ignore the rooms and portals, that is a separate system):
https://www.youtube.com/watch?v=1WT5AXZlsDc

The blue lines from the light sources to the camera frustum show the extra culling planes.

How does it work?

At runtime, the routine checks each plane of the camera frustum, and finds whether it is facing either towards or away from the light (0 or 1). These bits for the 6 planes form a 6 bit number, which is the lookup.

The lookup tells us a list of corner points from the camera frustum which form a silhouette, which can be used to generate culling planes together with the light origin (3 points form a culling plane).

References:

http://lspiroengine.com/?p=153
http://www.terathon.com/gdc06_lengyel.pdf

Performance

In tests in TPS demo, without GI and just using shadows, in many areas this halves the number of drawcalls / vertex count, in some cases reduces drawcalls by a factor of 10x. This can lead to 10-300% increase in FPS (the increase in FPS depends on settings used, if fill rate is bottlenecking then improving shadows has less dramatic effect and vice versa).

In WroughtFlesh, which uses directional light only, I get a more modest 10% or so improvement if FPS, due to the tighter caster culling with the directional light. So it seems like the benefits are higher the more omnis / spots are used.

Notes

  • This is based on a resurrection of [WIP] Tighter shadow caster culling #33340 . That PR's approach wasn't compatible with the dirty flag optimization for omnis and spots, but this PR automatically manages dynamic lights to handle tighter culling, while reverting to rendering all casters when used in static manner.
  • This PR also adds the major optimization that shadow maps outside the view frustum don't need to be updated at all. There was some existing culling based on AABBs but this PR is far more accurate. This can lead to major performance gains where a lot of shadowed lights are in use (e.g. TPS demo).
  • The project settings are initially for testing, it can detect changes at runtime (if you e.g. change project setting from script). If it works correctly there may be no need to have it switchable, as it should be virtually always a win and the calculations are very little, and outweighed by any gains.

Tighter caster culling and Multiple Cameras

There is one more situation in which tighter caster culling is problematic: when multiple viewports are in use, and the shadow volume intersects multiple cameras.
In this situation tighter caster culling will work - it will do a tight cull on the first camera, and a full render for the second camera. The problem is that it will do 2 shadow renders per frame instead of one.
The answer used here is to detect this situation (in detect_light_intersects_multiple_cameras()) and switch to a different mode light_intersects_multiple_cameras.
This reverts to the legacy approach of doing a full render on the first update. However, we still want to detect the situation where it changes back to a single camera. This is done by means of a timeout after a certain number of frames without a double update.

Directional Lights

Directional lights are handled separately in 3.x, they are always updated, and with different shadow maps if multiple cameras are used with viewports. Therefore they can always do the tighter caster cull.

Further work

There is one important further optimization which I have not used yet here. A shadowmap update is triggered by either an object that is paired with a light moving, or the light itself moving. However, if the object / objects moving that trigger the update are culled by tighter shadow casting, there is actually no need to update the shadow map at all, unless it is a full update. This could be significant in some cases, if there is e.g. a moving object that doesn't cast on the frustum that is triggering the whole process.

@Calinou
Copy link
Member

Calinou commented Oct 1, 2023

It is highly possible we can use some AI approach to change omnis and spots from their regular "dirty optimization" mode to continuous mode using tighter shadow casting. Which they use will depend on detecting moving objects within their volume. This could be e.g. a timer, if no moving objects after 10 frames, change to regular mode, else change to continuous mode.

If we mark light shadows as static or dynamic for each light, this could be decided based on whether the light is declared to be static or dynamic.

@lawnjelly

This comment was marked as resolved.

@lawnjelly lawnjelly force-pushed the lightcull_23 branch 7 times, most recently from 9e3c8cb to 5818a1b Compare October 1, 2023 14:11
@lawnjelly lawnjelly marked this pull request as ready for review October 1, 2023 16:58
@lawnjelly lawnjelly requested review from a team as code owners October 1, 2023 16:58
@jams3223
Copy link

jams3223 commented Oct 1, 2023

Could we cherry-pick this for 4.x ?

servers/visual/visual_server_light_culler.cpp Outdated Show resolved Hide resolved
servers/visual/visual_server_light_culler.cpp Outdated Show resolved Hide resolved
servers/visual/visual_server_light_culler.cpp Outdated Show resolved Hide resolved
servers/visual/visual_server_light_culler.cpp Outdated Show resolved Hide resolved
servers/visual/visual_server_light_culler.cpp Outdated Show resolved Hide resolved
servers/visual/visual_server_light_culler.cpp Outdated Show resolved Hide resolved
servers/visual/visual_server_light_culler.cpp Outdated Show resolved Hide resolved
servers/visual/visual_server_light_culler.cpp Outdated Show resolved Hide resolved
servers/visual/visual_server_light_culler.h Outdated Show resolved Hide resolved
servers/visual/visual_server_scene.cpp Outdated Show resolved Hide resolved
@lawnjelly
Copy link
Member Author

lawnjelly commented Oct 18, 2023

Rendering meeting today:

  • We are fine with this PR, and it should be fine to port to 4.x (although the actual shadow map render may be deferred).
  • I'll try and include the code used to generate the lookup table (perhaps in long comment) for future maintenance / debugging, and in case of order of frustum plane changes.

UPDATE:
Now includes the lookup table generation code. This does double the PR size, but the generation code is compiled out unless VISUAL_SERVER_LIGHT_CULLER_CALCULATE_LUT is defined.

The lookup generation prints the LUT to the standard output, and this can be copied directly to the c++ source.

``` LIGHT VOLUME TABLE BEGIN

Copy this to LUT_entry_sizes:

{0, 4, 4, 0, 4, 6, 6, 8, 4, 6, 6, 8, 6, 6, 6, 6, 4, 6, 6, 8, 0, 8, 8, 0, 6, 6, 6, 6, 8, 6, 6, 4, 4, 6, 6, 8, 6, 6, 6, 6, 0, 8, 8, 0, 8, 6, 6, 4, 6, 6, 6, 6, 8, 6, 6, 4, 8, 6, 6, 4, 0, 4, 4, 0, }

Copy this to LUT_entries:

{0, 0, 0, 0, 0, 0, 0, },
{7, 6, 4, 5, 0, 0, 0, },
{1, 0, 2, 3, 0, 0, 0, },
{0, 0, 0, 0, 0, 0, 0, },
{1, 5, 4, 0, 0, 0, 0, },
{1, 5, 7, 6, 4, 0, 0, },
{4, 0, 2, 3, 1, 5, 0, },
{5, 7, 6, 4, 0, 2, 3, },
{0, 4, 6, 2, 0, 0, 0, },
{0, 4, 5, 7, 6, 2, 0, },
{6, 2, 3, 1, 0, 4, 0, },
{2, 3, 1, 0, 4, 5, 7, },
{0, 1, 5, 4, 6, 2, 0, },
{0, 1, 5, 7, 6, 2, 0, },
{6, 2, 3, 1, 5, 4, 0, },
{2, 3, 1, 5, 7, 6, 0, },
{2, 6, 7, 3, 0, 0, 0, },
{2, 6, 4, 5, 7, 3, 0, },
{7, 3, 1, 0, 2, 6, 0, },
{3, 1, 0, 2, 6, 4, 5, },
{0, 0, 0, 0, 0, 0, 0, },
{2, 6, 4, 0, 1, 5, 7, },
{7, 3, 1, 5, 4, 0, 2, },
{0, 0, 0, 0, 0, 0, 0, },
{2, 0, 4, 6, 7, 3, 0, },
{2, 0, 4, 5, 7, 3, 0, },
{7, 3, 1, 0, 4, 6, 0, },
{3, 1, 0, 4, 5, 7, 0, },
{2, 0, 1, 5, 4, 6, 7, },
{2, 0, 1, 5, 7, 3, 0, },
{7, 3, 1, 5, 4, 6, 0, },
{3, 1, 5, 7, 0, 0, 0, },
{3, 7, 5, 1, 0, 0, 0, },
{3, 7, 6, 4, 5, 1, 0, },
{5, 1, 0, 2, 3, 7, 0, },
{7, 6, 4, 5, 1, 0, 2, },
{3, 7, 5, 4, 0, 1, 0, },
{3, 7, 6, 4, 0, 1, 0, },
{5, 4, 0, 2, 3, 7, 0, },
{7, 6, 4, 0, 2, 3, 0, },
{0, 0, 0, 0, 0, 0, 0, },
{3, 7, 6, 2, 0, 4, 5, },
{5, 1, 0, 4, 6, 2, 3, },
{0, 0, 0, 0, 0, 0, 0, },
{3, 7, 5, 4, 6, 2, 0, },
{3, 7, 6, 2, 0, 1, 0, },
{5, 4, 6, 2, 3, 7, 0, },
{7, 6, 2, 3, 0, 0, 0, },
{3, 2, 6, 7, 5, 1, 0, },
{3, 2, 6, 4, 5, 1, 0, },
{5, 1, 0, 2, 6, 7, 0, },
{1, 0, 2, 6, 4, 5, 0, },
{3, 2, 6, 7, 5, 4, 0, },
{3, 2, 6, 4, 0, 1, 0, },
{5, 4, 0, 2, 6, 7, 0, },
{6, 4, 0, 2, 0, 0, 0, },
{3, 2, 0, 4, 6, 7, 5, },
{3, 2, 0, 4, 5, 1, 0, },
{5, 1, 0, 4, 6, 7, 0, },
{1, 0, 4, 5, 0, 0, 0, },
{0, 0, 0, 0, 0, 0, 0, },
{3, 2, 0, 1, 0, 0, 0, },
{5, 4, 6, 7, 0, 0, 0, },
{0, 0, 0, 0, 0, 0, 0, },

LIGHT VOLUME TABLE END

</details>

@Calinou
Copy link
Member

Calinou commented Nov 14, 2023

Tested locally, it works as expected. Visuals look correct too from my testing in various demo projects.

Great work, this likely resolves one of Godot's largest rendering bottlenecks in complex scenes 🙂

Benchmark on tps-demo

OS: Fedora 38
CPU: Intel Core i9-13900K
GPU: GeForce RTX 4090 (NVIDIA 535.113.01)

The project is modified to disable V-Sync. The FPS reported is the highest FPS attained over a period of 10 seconds after loading the level, although I can confirm the average values are always increased in a similar proportion. When CPU-limited, the FPS varies a fair bit over time due to the flying forklift moving in and out of view.

Type Before After
4K Maximum GLES3 131 FPS (7.63 mspf) 135 FPS (7.40 mspf)
4K Minimum GLES3 325 FPS (3.07 mspf) 431 FPS (2.32 mspf)
720p Maximum GLES3 309 FPS (3.23 mspf) 398 FPS (2.51 mspf)
720p Minimum GLES3 330 FPS (3.03 mspf) 434 FPS (2.30 mspf)
4K Minimum GLES2 181 FPS (5.52 mspf) 213 FPS (4.69 mspf)
720p Minimum GLES2 186 FPS (5.37 mspf) 221 FPS (4.52 mspf)
  • Maximum settings has all settings enabled or set to their highest possible value.
    • These tests are largely GPU-bound, in particular for the 4K one.
  • Minimum settings has all settings disabled except shadow mapping, which is left enabled.
    • These tests are largely CPU-bound.

Existing shadow caster culling using the BVH takes no account of the camera. This PR adds the highly encapsulated class VisualServerLightCuller which can cut down the casters in the shadow volume to only those which can cast shadows on the camera frustum.

This is used to:
* More accurately defer dirty updates to shadows when the shadow volume does not intersect the camera frustum.
* Tighter cull shadow casters to the view frustum.

Lights dirty state is now automatically managed:
* Continuous (tighter caster culling)
* Static (all casters are rendered)
@lawnjelly
Copy link
Member Author

I pushed some small improvements, but it turns out the bug in the master version is because it's being used multithread there, and it isn't thread safe. So 3.x version should be fine in that respect, and I'll see if I can fix up the master version. 👍

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I trust the testing that has already been done.

The performance benefits speak for themselves. Let's get this in to 3.6

@akien-mga akien-mga merged commit 6f3c5e6 into godotengine:3.x Jan 29, 2024
13 checks passed
@akien-mga
Copy link
Member

Thanks!

@Zireael07
Copy link
Contributor

Will there be equivalent improvements to Vulkan or is this GLES only?

@lawnjelly
Copy link
Member Author

This is the 3.x PR, I'm just testing the master PR #84745 . There are improvements to all backends, as the culling takes place before the backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants