Performance issue #3

CatCode79 · 2022-11-07T12:30:21Z

We have already talked about it here:
#2
But I prefer to open an issue to better focus on the problem.

Shadows Implementation is suspect number 1, but even disabling on-game doesn't completely solve the slowness (on my notebook).

I'll try again with Nvidia Nsight, but just today the new version of Puffin was released and as soon as I find some time I try it, if everything works better we will have a cpu side profiling (at least we will know for sure if the engine is cpu-bound or gpu-bound

Davidster · 2022-11-07T13:01:58Z

👍 integrating puffin sounds like a cool idea

CatCode79 · 2022-11-08T20:38:19Z

puffed.zip

I managed to do a small profiling of three interesting functions, in the next few days maybe add some other suspicious function.

To analyze the *.puffin file, you must use puffin_viewer with the following commands:
cargo install puffin_viewer
puffin_viewer

As you can see some frames last up to 140ms.

The integration with Puffin went well. I opted for a local server/client communication in order to avoid having to integrate egui within the engine.

Davidster · 2022-11-11T00:21:48Z

nice! I took at look at the profile. puffin seems like a cool little tool. makes sense that the render is taking up the most time, it's the most complicated part of the application and it's very likely that there are some performance bugs in there that I'm unaware of

CatCode79 · 2022-11-13T18:05:57Z

You have to take into account that for now I record, through puffin, the time of only 3 functions, there is no automatic way to get a flamegraph without inserting a lot of macros in each function. It is still an interesting tool.

Now that I think about it I've noticed quite a bit of validation errors and warnings from WGPU, I have no idea if they existed before, but they are certainly something to take into account.
I wonder if validation errors and their management can degrade performance, I had read something about it if I remember correctly. Unfortunately I don't know much about it but as soon as I have a moment of time I'll take a look at it!

Davidster · 2022-11-13T21:26:11Z

that's a good idea, I should look at those validation errors

Davidster · 2022-11-19T18:43:23Z

gunna do a bit of work on the optimization side in the coming weeks when I find the time. First order of business is the fact that I'm generating a separate buffer and draw call for each mesh that I import from the gltf scene. For example, loading the free_low_poly_forest asset from get_gltf_path() results in ~2500 draw calls happening each frame and my GPU utilization is very low, under 50%. I'm wondering if it's possible to reduce the number of draw calls in that type of situation. Or maybe I just need to optimize the asset in blender so it only exports 1 single mesh. Then once that work is done I'll see what the next biggest bottleneck is. Hopefully at the end of it we can get the engine running at 60fps on your computer :D

CatCode79 · 2022-11-20T07:58:37Z

I see, I think you have to use a vertex buffer object (VBO), it is a buffer that remains resident in the GPU memory and does not need updates at every frame. Basically, you can load all the static information (walls and models) into the GPU while the dynamic information (where the models moves) are updated at every frame.

Unfortunately I don't know exactly how to configure a VBO, but I guess you need to use layers, bindgroups and indexes to tell shaders where to find what.

This link may be helpful:
https://github.com/gfx-rs/wgpu/wiki/Do%27s-and-Dont%27s

In the next few days I will publish the branch I'm working on for Puffin integration so you can experiment a bit if you want.

Davidster · 2022-11-20T17:58:24Z

So I am already making vertex buffers in the gltf_loader::build_geometry_buffers function. so once the gltf file is loaded into a bunch of GpuBuffers, those buffers no longer get updated each frame. I believe that's conceptually the same as a VBO. The data that can get updated each frame would be the corresponding instance buffer. So each mesh in the scene gets a dynamically-growing (see GpuBuffer::write) instance buffer allowing you to render more than 1 of them and move them around the scene.

Thanks for the do's and don'ts page I forgot it existed :) the "group resource bindings by the change frequency, start from the lowest" part seems interesting, I'm going to try that out and see if it helps.

If you'd be interested in helping out, you could maybe try running that free_low_poly_forest branch (just checkout and run commit 0594a7), then try opening up the gltf file in blender and see if you can merge all the separate objects/meshes into big single object, export it into a new gltf file and see if it runs more smoothly after that. I wrote out some debug info that logs the number of meshes that get created (around 2500 for that gltf file) in the build_scene function.

Davidster · 2022-11-20T19:15:07Z

oops its 1250 meshes, not 2500

Davidster · 2022-11-20T23:16:22Z

Another thing to check out is whether the vertex buffers are using the correct type of memory. In vulkan it should be gpu-only memory, not host visible

Davidster · 2022-11-22T15:28:30Z

did some more careful inspection yesterday. found out there are quite a few slow parts in the renderer with the free_low_poly_forest scene loaded:

skinning::get_all_bone_data is taking 5ms, which is way too slow considering there are 0 no skinned meshes in the scene lol
RendererState::update is doing some nested loops which blows up with 1250 nodes like in that scene, that's 1562500 iterations where almost all of them are wasted. it needs to be optimized. it's also taking around 6-7ms in this scene
shadows are taking up something like 7ms per frame which is too slow, it's something I want to investigate a bit later but I found some interesting material about doing shadows efficiently here in the "Atlases" section https://godotengine.org/article/godot-3-renderer-design-explained.

CatCode79 · 2022-11-22T19:25:05Z

skinning::get_all_bone_data is taking 5ms, which is way too slow considering there are 0 no skinned meshes in the scene lol

Weird. I remember that among the validation (or error) messages that wgpu (probably naga actually) gave were some regarding skin/bone indexes.

RendererState::update is doing some nested loops which blows up with 1250 nodes like in that scene, that's 1562500 iterations where almost all of them are wasted. it needs to be optimized. it's also taking around 6-7ms in this scene

Ops! ^_^ Well, in the end this is good news, probably once you figure out how to optimize that part, everything will go much smoother.

shadows are taking up something like 7ms per frame which is too slow, it's something I want to investigate a bit later but I found some interesting material about doing shadows efficiently here in the "Atlases" section

Afaik the shadows are pretty hard to do right and fast, so this part will probably get us busy.

From what I've read it seems to me that Godot3 implements a dynamic shadow mapping system. These are all things I barely know and could be wrong, but I got the idea that dynamic shadow mapping could be used for shadows that are only changed in-game occasionally and not continuously frame by frame. In our context, continuously recreating the altlas with shadows every frame might weigh a lot (but I could be wrong about how they are created and used).

I was thinking about Carmack's Reverse technique but, after a little research, I discovered that it is patented (DOOM 3) and that probably with scenes with many vertices it is slow. Maybe there is some kind of usable Shadow Volume technique but I'm starting to think that if Godot3 uses that method there is a good reason.

It should be taken into account that Godot 4 uses a very advanced dynamic global lighting system, the SDFGI:
https://www.youtube.com/watch?v=DNJXkcQxXEg
so advanced that I wouldn't know where to start O_O

I had come across an Atlas of (static) shadow mapping in a bevy jam game, here it is:
https://github.com/DGriffin91/BevyJam2022
I think shadow mappings are these:
https://github.com/DGriffin91/BevyJam2022/tree/main/assets/textures/level1/bake/sm
But it seems to have understood that Godot uses another technique through cubemapping while these are pre-generated shadows on textures.

CatCode79 · 2022-11-22T20:59:48Z

Ok, reading a bit about the shadow atlas in Godot 3 I discovered that allows rendering multiple lights in a single pass.

Instead, Volumetric Shadows techniques must calculate every single light in the scene. This is only feasible if there are few vertices and/or lights.
They also allow you to have sharp and precise shadows, but to make soft shadows you have to do other calculations.

Godot3's shadow atlas have accuracy and size issues, which they corrected in Godot4:
https://twitter.com/reduzio/status/1247702765100511239
independent bias, normal offsets and pancaking... I have no idea what these things mean :D

Davidster · 2022-11-23T04:16:56Z

Weird. I remember that among the validation (or error) messages that wgpu (probably naga actually) gave were some regarding skin/bone indexes.

ah right, I still need to take a look at those validation messages. I keep forgetting. Either way, I fixed the problem with get_all_bone_data, it was another double-nested looping situation. I really need to be more careful to avoid those :D.

next I'll do RendererState::update.

From what I've read it seems to me that Godot3 implements a dynamic shadow mapping system.

Yes, it's made to generate a new shadow map once per frame to account for moving objects. I would like for the shadow implementation to support moving objects but also be efficient. It should be doable! It's been standard in the gaming industry for a long time now if I'm not mistaken.

I agree godot3 is pretty old and has limitations so maybe looking at godot's latest shadow implementation is a better idea. Would be cool to look at bevy's too.

I've never heard of Carmack's Reverse before, I'm gunna read about it, sounds cool :).

It should be taken into account that Godot 4 uses a very advanced dynamic global lighting system, the SDFGI.

Oh yeah, I've seen that video before, it's pretty amazing. Global illumation could be something to look at in the future, I don't know anything about that topic but am very interested in learning about it, as well as dynamic ambient occlusion and reflections. For GI I think a simpler method is voxel cones, which is was they were doing in godot3 so maybe that's an easier starting point for learning before SDFGI, not sure tho.

Confluence of Futility seems to be on Godot version 4 or higher, so the engine comes with built-in dynamic shadows. I think the folder assets/textures/level1/bake/sm is just the smaller versions of the images in the parent folder assets/textures/level1/bake, so 'sm'='small'

CatCode79 · 2022-11-27T21:49:59Z

Could you enable the repository Discussions tab? I would like to continue the discussion about graphics there so as to keep this PR for performance issues.

Davidster · 2022-11-27T22:07:01Z

Should be done

Davidster · 2022-11-27T22:58:09Z

@CatCode79 could you checkout out the optimization branch and try running it on your machine? curious to see what kind of performance you get with some of the optimizations I've added so far. I'm hoping you'll be able to get 60fps with shadows and bloom turned off (press 'm' and 'b'). Feel free to check the commit msg for details on what I've been workin on this week.

Oh I should add some profile macros to that branch too so you can show me what you get in puffin!

CatCode79 · 2022-11-28T03:49:19Z

2022-11-28_optimization.zip

This is the Puffin Log I recorded, I had to disable Bloom and Shadows programmatically because it's so slow that I'm not even sure if I pressed the keys correctly during the game.

In general I can not tell you if it is better since the scene is not the same as the master branch.

But I updated the lunar vulkan sdk and played with the vulkan configurator: among the validation settings there is the possibility to choose a "Reduced-overhead preset" which helped a little more or less to double the performance.
I tried disabling the validation layers completely but didn't get any kind of gain (and the validations were still there), so I have yet to figure out how to disable them properly if possible.
Just for the record, the attached puffin was recorded with the validation layers active before I reduced the overhead, it still remains a significant log.

Another particular thing is that looking at a boulder I noticed that the material is a bit transparent, allowing me to see the stars in the skybox. I don't know if it's a problem of the model at the origin or that, but given how much transparency can weigh, it must be taken into account.

CatCode79 · 2022-11-28T03:57:28Z

More or less with this scene, i.e. the forest, I have an average of 9 FPS. With the master branch scene, I averaged 14 FPS.

Maybe it's useful for you to know that puffin also has the macro to profile just the scopes, and not just the whole function:
https://docs.rs/puffin/0.14.0/puffin/macro.profile_scope.html

Davidster · 2022-11-28T03:58:34Z

Damn I was worried about it going too slow to press the keys, thanks for taking the time to check that.

You're on windows right? Could you try running in with dx12? It's weird that the validation layers are enabled. For me on Linux they stay disabled in a release build unless I set the environment variable (forgot the name of it). And on windows it automatically runs dx12 for me so I'm pretty confused.

Yeah I noticed that the sky shows up, I think it's due to the material on the object, I think it's causing the skybox to be reflected on it. I might have made a mistaking with the default fallback materials that get picked when there's no material in the gltf file.

Davidster · 2022-11-28T04:06:48Z

that's so weird, on my computer the physicsstate::step is taking longer than on yours, what explains that??

Davidster · 2022-11-28T04:07:59Z

it's almost as if somehow the project is getting compiled in debug mode except the dependencies are in release mode or something? or maybe puffin is misbehvaing? very confused!

Davidster · 2022-11-28T04:09:00Z

I will test this on my older pc tomorrow, I think its cpu is slower than yours, maybe I made a mistake

CatCode79 · 2022-11-28T11:55:03Z

You're on windows right? Could you try running in with dx12? It's weird that the validation layers are enabled. For me on Linux they stay disabled in a release build unless I set the environment variable (forgot the name of it). And on windows it automatically runs dx12 for me so I'm pretty confused.

Yes I'm on windows 10. I can't reproduce the problem so probably I'm just confused and the layers have always been there, just the configuration with less overhead makes the frames go better.

that's so weird, on my computer the physicsstate::step is taking longer than on yours, what explains that??

Yes, I tried again with some tests in both debug and release mode and on my laptop the physic step is in the order of hundreds of microseconds (and not milliseconds like on your pc).

My CPU is:

But I have a cooling system that doesn't satisfy me and I always have the doubt that it enters in Thermal Throttling mode.

it's almost as if somehow the project is getting compiled in debug mode except the dependencies are in release mode or something? or maybe puffin is misbehvaing? very confused!

I think you need to set in cargo.toml these lines:
[profile.release.package.""]
opt-level = 3
(we already have the [profile.dev.package.""] lines)

But you actually mean that there is a mix of crates compiled in debug and others in release? Weird, I'd do a cargo clean, cargo update, and cargo build --release to be sure.

All this speech made me remember what I read here:
https://doc.rust-lang.org/cargo/reference/profiles.html?highlight=nalgebra#overrides

For example, nalgebra is a library which defines vectors and matrices making heavy use of generic parameters. If your local code defines concrete nalgebra types like Vector4 and uses their methods, the corresponding nalgebra code will be instantiated and built within your crate. Thus, if you attempt to increase the optimization level of nalgebra using a profile override, it may not result in faster performance.

it could be useful if you use nalgebra for rapier.

Davidster · 2022-11-28T14:38:03Z

Yes cargo clean and rebuild might be a good idea, I'd also double check your system environment variables to see if you have set a vulkan env var and forgot about it.

https://youtu.be/_qRA-WS0aLA
We can see on this video that 60fps should be easily possible with your CPU. and valorant is using many cores of the cpu whereas this project only uses 1 core, so thermal throttling shouldn't be an issue unless there's something wrong with your hardware.

I also know that dx12 performs better on windows than vulkan, so it would be important to check if it runs well on dx12 on your laptop. If it runs bad, another thing we can check is run MSI afterburner or maybe just check the task manager to see if the cpu/GPU are reaching high utilization % values and if they are reaching temperatures above ~75 degrees.

Davidster · 2022-11-28T15:53:10Z

just to double check, did you modify the number of physics_balls in the scene? (physics_ball_count). That could easily explain the physics step timing difference

CatCode79 · 2022-11-28T17:27:11Z

just to double check, did you modify the number of physics_balls in the scene? (physics_ball_count).

no, i just changed enable_shadows to false, enable_bloom to false. However from what little I could see (which at half frame per second is a challenge :D) there are no balls in the forest scene (optimization branch)

and valorant is using many cores of the cpu whereas this project only uses 1 core, so thermal throttling shouldn't be an issue unless there's something wrong with your hardware.

true, you're right, my fan makes so much noise sometimes that it makes me "uncomfortable"

Personally I suspect that there are hidden render draws that kill performance, however it must be taken into account that perhaps we have run into a particular case in which the self-managed barriers from wgpu have a bug and degrade performance. It shouldn't be seen that since wgpu version 0.13 they have improved performance and barrier handling a lot, but I know they still have more to do.

There's still that open job of merging the gltf models into one for the forest, isn't there? It could improve the performance of the rendering part considerably..

I'm also tempted to integrate tracy_full to have full view of message bouncing between cpu and gpu, I suspect all the time spent is barriers waiting for work from the gpu or vice versa.

Now I try to test with dx12, it should actually improve but, unless there are bugs in the vulkan drivers, by only a few percentage points.

CatCode79 · 2022-11-28T18:07:04Z

DX12 backend, interesting result:

Running target\debug\wgpu-sandbox.exe
Serving demo profile data on 127.0.0.1:8585
Using NVIDIA GeForce MX150 (Dx12)
[2022-11-28T18:06:06Z ERROR wgpu_hal::dx12::descriptor] Unable to allocate descriptors: RangeAllocationError { fragmented_free_length: 0 }
[2022-11-28T18:06:06Z ERROR wgpu::backend::direct] Handling wgpu errors as fatal by default
thread 'main' panicked at 'wgpu error: Validation Error

Caused by:
In Device::create_bind_group
note: label = InstancedMeshComponent textures_bind_group
not enough memory left

', C:\Users\Gatto.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-0.14.0\src\backend\direct.rs:2403:5
stack backtrace:
0: std::panicking::begin_panic_handler
at /rustc/b3bc6bf31265ac10946a0832092dbcedf9b26805/library\std\src\panicking.rs:575
1: core::panicking::panic_fmt
at /rustc/b3bc6bf31265ac10946a0832092dbcedf9b26805/library\core\src\panicking.rs:65
2: core::ops::function::Fn::call<void ()(enum2$wgpu::Error),tuple$<enum2$wgpu::Error > >
at /rustc/b3bc6bf31265ac10946a0832092dbcedf9b26805\library\core\src\ops\function.rs:161
3: wgpu::backend::direct::impl$3::device_create_bind_group
at C:\Users\Gatto.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-0.14.0\src\backend\direct.rs:1298
4: wgpu::Device::create_bind_group
at C:\Users\Gatto.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-0.14.0\src\lib.rs:2170
5: wgpu_sandbox::renderer::BaseRendererState::make_pbr_textures_bind_group
at .\src\renderer.rs:587
6: wgpu_sandbox::gltf_loader::build_scene
at .\src\gltf_loader.rs:145
7: wgpu_sandbox::game::init_scene
at .\src\game.rs:1235
8: wgpu_sandbox::start::async_fn$0::async_block$0
at .\src\main.rs:113
9: wgpu_sandbox::start::async_fn$0
at .\src\main.rs:120
10: pollster::block_on<enum2$<wgpu_sandbox::start::async_fn_env$0> >
at C:\Users\Gatto.cargo\registry\src\github.com-1ecc6299db9ec823\pollster-0.2.5\src\lib.rs:125
11: wgpu_sandbox::main
at .\src\main.rs:134
12: core::ops::function::FnOnce::call_once<void ()(),tuple$<> >
at /rustc/b3bc6bf31265ac10946a0832092dbcedf9b26805\library\core\src\ops\function.rs:507

Davidster · 2022-11-28T18:15:36Z

@CatCode79 do you get the same error in a release build? I see you are running debug target\debug\wgpu-sandbox.exe

Davidster · 2022-11-28T18:16:00Z

this is definitely a problem though, I might have some ideas for how to mitigate

Davidster · 2022-11-28T18:17:32Z

I'm also tempted to integrate tracy_full to have full view of message bouncing between cpu and gpu, I suspect all the time spent is barriers waiting for work from the gpu or vice versa.

this is a great idea btw! would be interested in seeing

There's still that open job of merging the gltf models into one for the forest, isn't there? It could improve the performance of the rendering part considerably..

In a real game yes that's what should be done, but I actually think it's good to leave the gltf file as-is because it makes for a good benchmark

CatCode79 · 2022-11-28T18:30:08Z

Same error on release build.

The most useful WGPU issue I've found about it is this:
gfx-rs/wgpu#2857

Now that I remember, right now I'm compiling with the Nightly; thiserror crate gives me a strange error telling me that it is using a nightly-only feature and I solved avoiding the stable.. But is wgpu-sandbox meant to be compiled on nigthtly or in stable?

Tomorrow I try to integrate trace_full, so we remove a lot of doubts when we have the complete situation.

Davidster · 2022-11-28T18:33:12Z

wgpu supports stable. I think it'd be a good idea to try running it on stable

CatCode79 · 2022-11-28T19:00:20Z

ok, I fixed and compiled both in debug and in release with the stable, I always get the same error

Davidster · 2022-11-28T19:58:20Z

ok it looks like this scene is allocating 2gb of vram LOL. I'll have to look into this ASAP

Davidster · 2022-11-28T20:11:21Z

turns out vulkan uses 1.8gb and dx12 uses 2gb, so it's just on the edge of crashing for your gpu

Davidster · 2022-11-29T02:23:57Z

hey, so it turns out most of the memory usage is from the many large textures in the scene. I've disabled them in a test scene that you can get from this commit: f194d56 (optimization_disable_textures branch). Let me know if you get a chance to test that out, I wonder if the memory usage was the problem this whole time

CatCode79 · 2022-11-29T10:01:08Z

testing the optimization_disable_textures branch I have same error as before with dx12, with vulkan instead I have this:

thread 'main' panicked at 'attempt to subtract with overflow', src\game.rs:679:27
stack backtrace:
0: std::panicking::begin_panic_handler
at /rustc/897e37553bba8b42751c67658967889d11ecd120/library\std\src\panicking.rs:584
1: core::panicking::panic_fmt
at /rustc/897e37553bba8b42751c67658967889d11ecd120/library\core\src\panicking.rs:142
2: core::panicking::panic
at /rustc/897e37553bba8b42751c67658967889d11ecd120/library\core\src\panicking.rs:48
3: wgpu_sandbox::game::init_game_state
at .\src\game.rs:679
4: wgpu_sandbox::start::async_fn$0::async_block$0
at .\src\main.rs:116
5: core::future::from_generator::impl$1::poll<enum2$<wgpu_sandbox::start::async_fn$0::async_block_env$0> >
at /rustc/897e37553bba8b42751c67658967889d11ecd120\library\core\src\future\mod.rs:91
6: wgpu_sandbox::start::async_fn$0
at .\src\main.rs:120
7: core::future::from_generator::impl$1::poll<enum2$<wgpu_sandbox::start::async_fn_env$0> >
at /rustc/897e37553bba8b42751c67658967889d11ecd120\library\core\src\future\mod.rs:91
8: pollster::block_on<core::future::from_generator::GenFuture<enum2$<wgpu_sandbox::start::async_fn_env$0> > >
at C:\Users\Gatto.cargo\registry\src\github.com-1ecc6299db9ec823\pollster-0.2.5\src\lib.rs:125
9: wgpu_sandbox::main
at .\src\main.rs:134
10: core::ops::function::FnOnce::call_once<void (*)(),tuple$<> >
at /rustc/897e37553bba8b42751c67658967889d11ecd120\library\core\src\ops\function.rs:248

Note that before the crash I have a lot of these messages:

[2022-11-29T09:51:10Z INFO wgpu_core::device] Created buffer Valid((6278, 1, Vulkan)) with BufferDescriptor { label: Some("GpuBuffer"), size: 5760, usage: INDEX, mapped_at_creation: true }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created buffer Valid((6279, 1, Vulkan)) with BufferDescriptor { label: Some("GpuBuffer"), size: 128, usage: COPY_DST | VERTEX, mapped_at_creation: true }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created buffer Valid((6280, 1, Vulkan)) with BufferDescriptor { label: Some("GpuBuffer"), size: 11520, usage: INDEX, mapped_at_creation: true }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created buffer Valid((6281, 1, Vulkan)) with BufferDescriptor { label: Some("GpuBuffer"), size: 80, usage: COPY_DST | VERTEX, mapped_at_creation: true }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created texture Valid((20, 1, Vulkan)) with TextureDescriptor { label: Some("crosshair_texture"), size: Extent3d { width: 512, height: 512, depth_or_array_layers: 1 }, mip_level_count: 1, sample_count: 1, dimension: D2, format: Rgba8UnormSrgb, usage: COPY_DST | TEXTURE_BINDING | RENDER_ATTACHMENT }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created texture Valid((21, 1, Vulkan)) with TextureDescriptor { label: Some("from_color texture"), size: Extent3d { width: 1, height: 1, depth_or_array_layers: 1 }, mip_level_count: 1, sample_count: 1, dimension: D2, format: Rgba8Unorm, usage: COPY_DST | TEXTURE_BINDING | RENDER_ATTACHMENT }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created texture Valid((22, 1, Vulkan)) with TextureDescriptor { label: Some("from_color texture"), size: Extent3d { width: 1, height: 1, depth_or_array_layers: 1 }, mip_level_count: 1, sample_count: 1, dimension: D2, format: Rgba8Unorm, usage: COPY_DST | TEXTURE_BINDING | RENDER_ATTACHMENT }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created buffer Valid((6282, 1, Vulkan)) with BufferDescriptor { label: Some("GpuBuffer"), size: 416, usage: VERTEX, mapped_at_creation: true }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created buffer Valid((6283, 1, Vulkan)) with BufferDescriptor { label: Some("GpuBuffer"), size: 12, usage: INDEX, mapped_at_creation: true }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created buffer Valid((6284, 1, Vulkan)) with BufferDescriptor { label: Some("GpuBuffer"), size: 128, usage: COPY_DST | VERTEX, mapped_at_creation: true }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created buffer Valid((6285, 1, Vulkan)) with BufferDescriptor { label: Some("GpuBuffer"), size: 24, usage: INDEX, mapped_at_creation: true }
[2022-11-29T09:51:10Z INFO wgpu_core::device] Created buffer Valid((6286, 1, Vulkan)) with BufferDescriptor { label: Some("GpuBuffer"), size: 80, usage: COPY_DST | VERTEX, mapped_at_creation: true }

They take up most of the game's loading time.
Make sure you have the log info level active to show them (maybe even the debug level can help give additional information)

CatCode79 · 2022-11-29T11:21:33Z

2022-11-29_optimization_disable_textures.zip

Wait! I did a cargo update and run in release mode, it works! Frames are between 45 and 50 per second.
(The previous test was done in dev mode)

All tests were done with Shadows and Bloom turned off

Davidster · 2022-11-29T12:25:42Z

Amazing news! Thank you very much for the help :). I'll check out debug mode and your Puffin file to see if I can catch any more issues

CatCode79 · 2022-11-29T18:47:09Z

Big party tonight! ^_^
It was a great feeling to be able to navigate the level more fluidly.

It occurred to me that there is this crate that implements a compression algorithm for textures (apparently the classic image compression algorithms work fine for storage but not for gpu)
https://github.com/BVE-Reborn/ktx2

Davidster · 2022-12-01T00:00:03Z

glad to hear it :)

yup, texture compression is something we need to add! I never heard of ktx2, sounds like a good option.

Davidster · 2022-12-01T00:58:05Z

average of 1.3ms for the cpu time on ur machine now, I'm pretty happy with that! I think texture compression and shadow optimization are the two next low-hanging fruit

CatCode79 · 2022-12-01T01:30:45Z

Nice!

Just today I received the Graphics Programming newsletter which contains a couple of articles on the topic (de)compression and ASTC:
https://www.jendrikillner.com/post/graphics-programming-weekly-issue-263/

Another thing of interest: wgpu 0.14.2 is out with a bug fix, I don't think this bug affected us, but I would do a cargo update anyway.

Davidster · 2022-12-07T23:48:40Z

oh nice, will definitely make the 0.14.2 update asap

CatCode79 · 2022-12-13T22:35:11Z

Teaser: first screenshot of the tracy integration! It's a huge tool, it's like having to learn to drive a shuttle.
Now I have to figure out how to get the information from the GPU...

CatCode79 · 2022-12-13T23:15:12Z

We have our first list of culprits.
I think the acronym MTPC means medium time per call. If you look at the time of the get_next_texture it takes a lot for each frame

Davidster · 2022-12-14T20:21:45Z

damn 6 microseconds still seems way too slow to compute a node's transform, gotta find a better solution there I think. at least it's no longer a bottleneck.

I wonder if get_next_texure is actually the problem or if it's just waiting on some kind of lock from the graphics API, we might need more fine-grained detail to understand what's happening there

CatCode79 · 2022-12-14T21:05:53Z

Here's something interesting about it:
Document when get_current_texture() etc. block (and provide alternative?) #3283

Davidster · 2022-12-21T05:01:46Z

I'm getting closer and closer with the frustum culling! Playing around with collision detections and an Oct tree. It seems to be helping a lot with the forest scene, sometimes drawing only 200 objects instead of the 1200 😀. Hope to be able to PR soon, although the holidays won't help with that lol.

CatCode79 · 2022-12-21T15:56:16Z

Cool!

I read the discussion on the Discord channel, the gain in terms of performance is impressive. Perhaps a bvh structure in place of the octotree could do more, but with the result you've already achieved I'd say stabilize the octotree and keep it there until we know exactly if it's significantly optimizable. They are right to say that everything is very scene-dependent, probably for now it's enough for us to implement naive or simple solutions waiting to have a functioning game level, at which point we can think about which solution to use for that type of scene.

Yes, I too have little time in this period, I'll try to make some small changes here and there.

CatCode79 · 2022-12-21T16:08:29Z

I discovered, by reading this link, that:

Try and write all color channels in a render target because partial writes will disable compression.
This is due to needing to read the partially written surface, decompress, and write it back out with your partial update.

This means that the 4 channels (instead of just 3) for the hdr texture are fine, so there will be no need to modify it anymore!

The whole page has some useful tips, probably they are also for other kind of hardware because in some cases they are things I've read about performance in general.

Davidster · 2022-12-21T20:37:56Z

Try and write all color channels in a render target because partial writes will disable compression

nice. which link was that?

CatCode79 · 2022-12-21T20:40:46Z

Ops! I forgot to add it: https://gpuopen.com/performance/

Davidster · 2022-12-21T20:51:18Z

I read the discussion on the Discord channel

haha, I was wondering if you were in the rust gamedev channel. Add me on discord, would be nice to be able to chat outside github from time to time :)

Davidster · 2022-12-21T21:07:22Z

Split your vertex data into multiple streams.

Allocating position data in its own stream can improve depth only passes.

If there is an attribute that is not being used in all passes, consider moving it to a new stream.

this one seems relevant to shadow mapping performance

Davidster closed this as completed Nov 29, 2022

Davidster reopened this Nov 29, 2022

Davidster closed this as completed Mar 31, 2024

Performance issue #3

Performance issue #3

Comments

CatCode79 commented Nov 7, 2022 • edited Loading

Davidster commented Nov 7, 2022

CatCode79 commented Nov 8, 2022

Davidster commented Nov 11, 2022

CatCode79 commented Nov 13, 2022 • edited Loading

Davidster commented Nov 13, 2022

Davidster commented Nov 19, 2022 • edited Loading

CatCode79 commented Nov 20, 2022

Davidster commented Nov 20, 2022 • edited Loading

Davidster commented Nov 20, 2022

Davidster commented Nov 20, 2022

Davidster commented Nov 22, 2022 • edited Loading

CatCode79 commented Nov 22, 2022

CatCode79 commented Nov 22, 2022

Davidster commented Nov 23, 2022 • edited Loading

CatCode79 commented Nov 27, 2022

Davidster commented Nov 27, 2022

Davidster commented Nov 27, 2022 • edited Loading

CatCode79 commented Nov 28, 2022

CatCode79 commented Nov 28, 2022

Davidster commented Nov 28, 2022

Davidster commented Nov 28, 2022

Davidster commented Nov 28, 2022

Davidster commented Nov 28, 2022

CatCode79 commented Nov 28, 2022

Davidster commented Nov 28, 2022

Davidster commented Nov 28, 2022

CatCode79 commented Nov 28, 2022

CatCode79 commented Nov 28, 2022

Davidster commented Nov 28, 2022

Davidster commented Nov 28, 2022

Davidster commented Nov 28, 2022

CatCode79 commented Nov 28, 2022

Davidster commented Nov 28, 2022

CatCode79 commented Nov 28, 2022

Davidster commented Nov 28, 2022

Davidster commented Nov 28, 2022

Davidster commented Nov 29, 2022

CatCode79 commented Nov 29, 2022

CatCode79 commented Nov 29, 2022 • edited Loading

Davidster commented Nov 29, 2022 • edited Loading

CatCode79 commented Nov 29, 2022

Davidster commented Dec 1, 2022

Davidster commented Dec 1, 2022

CatCode79 commented Dec 1, 2022

Davidster commented Dec 7, 2022

CatCode79 commented Dec 13, 2022

CatCode79 commented Dec 13, 2022

Davidster commented Dec 14, 2022

CatCode79 commented Dec 14, 2022

Davidster commented Dec 21, 2022

CatCode79 commented Dec 21, 2022

CatCode79 commented Dec 21, 2022

Davidster commented Dec 21, 2022

CatCode79 commented Dec 21, 2022

Davidster commented Dec 21, 2022

Davidster commented Dec 21, 2022

CatCode79 commented Nov 7, 2022 •

edited

Loading

CatCode79 commented Nov 13, 2022 •

edited

Loading

Davidster commented Nov 19, 2022 •

edited

Loading

Davidster commented Nov 20, 2022 •

edited

Loading

Davidster commented Nov 22, 2022 •

edited

Loading

Davidster commented Nov 23, 2022 •

edited

Loading

Davidster commented Nov 27, 2022 •

edited

Loading

CatCode79 commented Nov 29, 2022 •

edited

Loading

Davidster commented Nov 29, 2022 •

edited

Loading