Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure shadowmap rendering in mobile renderer #76872

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

BastiaanOlij
Copy link
Contributor

This PR attempts to simplify shadow rendering for the mobile renderer as we're not trying to run things in parallel with GI.

Also trying out a few performance improvements recommended.

So far this is not having the desired result yet so lots to be done yet.

@BastiaanOlij BastiaanOlij force-pushed the restructure_mobile_shadows branch from 426e5dc to 92f2545 Compare May 9, 2023 15:01
@BastiaanOlij
Copy link
Contributor Author

Ok, lots of good feedback after talking with some of the GPU guys at ARM and I already managed to update a few things.

So a few things for prosperity.

Barriers
So our current implementation for barriers lumps the vertex and fragment shaders together into a BARRIER_MASK_RASTER. This makes total sense on desktop as these are run in harmony, as the vertex shader processes verts that make up a face, rastering of that face begins.

But on mobile the TBDR architecture results in all vertices being processed by the vertex shader, and then having rastering happen per tile. Our BARRIER_MASK_RASTER and heavy use of BARRIER_MASK_ALL prevented a lot of parallel processing where we had to wait on the fragment shader of the previous render pass being finished before we could start on the vertex shader of the next pass, especially with our cubemap omni lights.

I think there are a number of other processes that will benefit from more targeted barriers.

What I've done is introduce a BARRIER_MASK_VERTEX and BARRIER_MASK_FRAGMENT enum and haveBARRIER_MASK_RASTER combine those flags. This means for the clustered renderer it will do exactly the same thing it did before, but for the mobile renderer we have more control.

When rendering our shadowmap for our cubemap we'll have the following setup:

  • vkCmdCopyBuffer for UBO for cubemap side 1
  • TRANSFER -> VS Barrier
  • RenderPass 1 (Render cubemap side 1)
  • vkCmdCopyBuffer for UBO for cubemap side 2
  • TRANSFER -> VS Barrier
  • RenderPass 2 (Render cubemap side 2)
  • ...
  • vkCmdCopyBuffer for UBO for cubemap side 6
  • TRANSFER -> VS Barrier
  • RenderPass 6 (Render cubemap side 6)
  • FRAG -> FRAG barrier
  • RenderPass 7 (Render cubemap into shadow atlas)

Running the mobile renderer on desktop probably won't have much use of this split either, but here as it's not TBDR anyway it probably won't make much difference.

Uniform buffers
On desktop with dedicated GPUs we need to load data into uniform buffers that reside on GPU memory.
On mobile GPUs (and probably integrated GPUs) however we have unified memory, i.e. the CPU and GPU use the same memory chips.

But we're still creating and updating uniform buffers and some of these are pretty large (like our light buffers) and update each frame. This means a lot of wasted bandwidth.

On mobile GPUs we can instead make uniform buffers map to our own data structures and use the source data. This does mean that we need to keep that source data around as we often destroy our buffers, and we need to make sure we don't start overwriting data if we're rendering multiple viewports and things like that.

Obviously this needs to be optional logic, detecting if we can map data or if we must load data into GPU memory but if we design the mobile renderer with unified memory in mind, it just means the copy we would otherwise have will be introduced on dedicated GPUs.

RenderAreas
Already mentioned this in the OP but it deserves a spotlight. The TBDR architecture means we should be using RenderAreas instead of viewports when rendering to our shadow atlas or each render pass will put the whole render buffer through the tile system regardless of whether tiles are effected.

Strangely it seems that on desktop the opposite it true.

For now I've added a boolean that for testing I've set to true and that makes it use renderAreas (the code for this was already commented out with the remark about this being faster on desktop) but further testing and switching is required.

Cubemap Shadows
Ok this was one thing that came out of discussing this with Clay. We currently render cubemaps shadows to a proper cubemap, and then apply a paraboloid representation of this data into our shadow atlas.

That's a lot for mobile and we should investigate alternatives that can be directly rendered into the shadow atlas.

@Calinou
Copy link
Member

Calinou commented May 9, 2023

Cubemap Shadows
Ok this was one thing that came out of discussing this with Clay. We currently render cubemaps shadows to a proper cubemap, and then apply a paraboloid representation of this data into our shadow atlas.

That's a lot for mobile and we should investigate alternatives that can be directly rendered into the shadow atlas.

Dual parabolid mode is still supported for omni lights in 4.0, but the default is cubemaps since 3.0. This property is set on a per-light basis and is the same on desktop and mobile.

That said, dual parabolid shadows suffer from lots of distortion if using unsubdividied meshes. Maybe look into tetrahedron shadows (4 faces), which don't suffer from as much distortion but should be faster to render than cubemaps.

Relevant quote from https://github.com/Calinou/tesseract-renderer-design (which only targets desktop hardware, so I think tetrahedral is still worth trying for mobile):

After experimenting with different projection setups for omnidirectional shadows such as tetrahedral (4 faces) or dual-parabolic (2 faces), it was found that the ordinary cubemap (6 faces) layout was best as the larger number of smaller frustums actually provides better opportunities for culling and caching of faces while providing the least amount of projection distortion. However, for multi-tap shadowmap filters, the native cubemap format is insufficient for easily computing the locations of neighboring taps. Also, despite texture arrays allowing for batching of many shadowmaps during a single rendering pass, they do not allow adequate control of sizing of individual shadowmaps and their partitions.

@clayjohn
Copy link
Member

clayjohn commented May 9, 2023

@Calinou That quote is very interesting. Bastiaan and I were discussing comparing tetrahedral and octahedral shadow maps. Octahedral requires rendering to 8 faces, but the quality is comparable to using cubemaps and the texture lookup is much better than any of the other options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants