Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize base and shadow meshes for vertex cache #94241

Merged
merged 1 commit into from
Aug 16, 2024

Conversation

zeux
Copy link
Contributor

@zeux zeux commented Jul 11, 2024

Previously, vertex cache optimization was ran for the LOD meshes, but was never ran for the base mesh or for the shadow meshes, including shadow LOD chain (shadow LOD chain would sometimes get implicitly optimized for vertex cache as a byproduct of base LOD optimization, but not always). This could significantly affect the rendering performance of geometry heavy scenes, especially for depth or shadow passes where the fragment load is light.

This PR unconditionally runs the optimization for base mesh before further processing, and for any generated shadow index buffers; if meshoptimizer module is not loaded, we silently skip the processing. Note that this is the same algorithm we already use for LOD index buffers.

I generally treat this optimization as "always on, do no harm" - it only changes the order of triangles, which is generally speaking indeterminate on import, and is fairly quick. For a sense of scale, this is ~6x faster than tangent generation, ~25x faster than LOD generation (before my previous optimization PR, so maybe ~10x after?), and consequently should not change the import time much. I've tested this with DragonAttenuation model (https://github.com/KhronosGroup/glTF-Sample-Models/tree/main/2.0/DragonAttenuation) and didn't see overall import time change in a statistically measurable way. The appearance of any model should be the same, this only changes the submitted triangle order within each mesh, which has no impact on opaque meshes and should not make transparent meshes worse in that the order of triangles on them could not be relied upon anyway.

image

As any hardware performance optimization, this is hard to measure well. On a scene with 28 clones of the model above, with some objects closer to camera (LOD 0) and some further away, my aggregate measurements on NVidia RTX 4090 make that scene ~17% faster in terms of full frame time to render. Most of the gains are just from the shadow mesh optimization (it's something like 11% for shadow mesh optimization and 6% extra on top from base mesh optimization) - depth pre-pass and shadow passes tend to be vertex/raster bound, and the shadow mesh is rendered multiple times, so that makes sense. Note that other meshes may display no performance gains (for example, if a mesh is fairly low-poly, or if the scene has been preprocessed with tools like gltfpack that generate optimal order, the gains will be small to non-existent), and could also display larger performance gains (as the original order can be more pathologically bad depending on the exporter). Realistically I would not expect a double digit performance improvement here on any realistic scenes, but the gains are free.

image

The measurements quoted above are with VSync disabled using full frame FPS, if we measure the GPU time on the individual passes (using Godot's Visual Profiler), the relative gains are more significant - note that I'm using the numbers as displayed by the profiler (2 decimal digits), my GPU is clearly too fast for this 😝:

Pass Time (Before) Time (After) Improvement (%)
Depth Pre-Pass 0.09 ms 0.06 ms ~33%
Shadows 0.12 ms 0.09 ms ~25%
Opaque Pass 0.18 ms 0.15 ms ~16%
(Total) 3D Scene 0.44 ms 0.34 ms ~22%

@zeux zeux requested a review from a team as a code owner July 11, 2024 23:17
@zeux zeux force-pushed the optimize-cache branch 2 times, most recently from 5c1b821 to a6db472 Compare July 11, 2024 23:22
@Calinou
Copy link
Member

Calinou commented Jul 12, 2024

This could also benefit #94097 when using complex PrimitiveMeshes.

Like create_shadow_mesh() which is not exposed yet, the method in ImporterMesh may be worth exposing to scripting, so that procedural geometry generation scripts can make use of it. In general, it should be possible to procedurally generated meshes to achieve the same level of optimization as pre-authored meshes (assuming you can spend the time doing this processing once when the mesh is first generated).

@zeux
Copy link
Contributor Author

zeux commented Jul 12, 2024

Would procedural geometry use ImporterMesh or SurfaceTool? Asking because SurfaceTool already exposes optimize_indices_for_cache.

@fire
Copy link
Member

fire commented Jul 12, 2024

At this point, I expect to use both for procedural generation and .. csg, but I am in favour of this.

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'm glad to see that the performance benefits are so tangible

@clayjohn clayjohn modified the milestones: 4.x, 4.4 Jul 12, 2024
@mrjustaguy
Copy link
Contributor

#68959 should be tested with this..

@zeux
Copy link
Contributor Author

zeux commented Jul 12, 2024

@mrjustaguy That issue should not be affected by this change in isolation for two reasons: 1) this PR only adds the relevant functionality to the glTF import path; adding it to .obj is a matter of adding this to the .obj importer but I'm worried this will cause conflicts with #94108 so I'd rather do that separately / as part of that change if this ends up getting merged first:

+       for (int i = 0; i < r_meshes.size(); i++) {
+               r_meshes.get(i)->optimize_indices_for_cache();
+       }
  1. that issue has a geometry file which is basically a lot of cubes. Due to lack of vertex sharing, in general models with faceted shading - cubes or otherwise - are simultaneously inefficient to render, and mostly not affected by vertex cache optimizations. That said, I would assume Add Generate LODs, Shadow Mesh and Lightmap UV2 options to OBJ mesh import #94108 helps somewhat as it adds shadow meshes which should accelerate depth pre-pass/shadow rendering and get a further small boost from this PR.

edit yeah confirmed that shadow meshes help on that file, depth pre-pass drops from 0.45ms to 0.24ms on 4090. Without this change but with shadow mesh creation depth pre-pass drops to 0.26ms, so there's a small improvement for shadow meshes even for this edge case from this PR which is nice.

@mrjustaguy
Copy link
Contributor

I mean that was really a Stress test to compare Godot 3 with Godot 4 primitive performance..

Though I think that there have been a few optimizations relevant to it since I've last looked at it so IDK how Godot 4 compares to 3 Today in that aspect.

@Calinou
Copy link
Member

Calinou commented Jul 12, 2024

Would procedural geometry use ImporterMesh or SurfaceTool? Asking because SurfaceTool already exposes optimize_indices_for_cache.

Procedural geometry generation is done with SurfaceTool, but ImporterMesh also exposes similar functions so that import scripts can make use of it.

LiveTrower pushed a commit to LiveTrower/godot that referenced this pull request Aug 12, 2024
@akien-mga
Copy link
Member

Needs a rebase to resolve merge conflicts after some initial merges in 4.4.
Then it's in the queue for merging ASAP.

Previously, vertex cache optimization was ran for the LOD meshes, but
was never ran for the base mesh or for the shadow meshes, including
shadow LOD chain (shadow LOD chain would sometimes get implicitly
optimized for vertex cache as a byproduct of base LOD optimization, but
not always). This could significantly affect the rendering performance
of geometry heavy scenes, especially for depth or shadow passes where
the fragment load is light.
@zeux
Copy link
Contributor Author

zeux commented Aug 16, 2024

Rebased vs master.

@akien-mga akien-mga merged commit 759d7d4 into godotengine:master Aug 16, 2024
18 checks passed
@akien-mga
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants