-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue #3
Comments
👍 integrating puffin sounds like a cool idea |
I managed to do a small profiling of three interesting functions, in the next few days maybe add some other suspicious function. To analyze the *.puffin file, you must use puffin_viewer with the following commands: As you can see some frames last up to 140ms. The integration with Puffin went well. I opted for a local server/client communication in order to avoid having to integrate egui within the engine. |
nice! I took at look at the profile. puffin seems like a cool little tool. makes sense that the render is taking up the most time, it's the most complicated part of the application and it's very likely that there are some performance bugs in there that I'm unaware of |
You have to take into account that for now I record, through puffin, the time of only 3 functions, there is no automatic way to get a flamegraph without inserting a lot of macros in each function. It is still an interesting tool. Now that I think about it I've noticed quite a bit of validation errors and warnings from WGPU, I have no idea if they existed before, but they are certainly something to take into account. |
that's a good idea, I should look at those validation errors |
gunna do a bit of work on the optimization side in the coming weeks when I find the time. First order of business is the fact that I'm generating a separate buffer and draw call for each mesh that I import from the gltf scene. For example, loading the free_low_poly_forest asset from get_gltf_path() results in ~2500 draw calls happening each frame and my GPU utilization is very low, under 50%. I'm wondering if it's possible to reduce the number of draw calls in that type of situation. Or maybe I just need to optimize the asset in blender so it only exports 1 single mesh. Then once that work is done I'll see what the next biggest bottleneck is. Hopefully at the end of it we can get the engine running at 60fps on your computer :D |
I see, I think you have to use a vertex buffer object (VBO), it is a buffer that remains resident in the GPU memory and does not need updates at every frame. Basically, you can load all the static information (walls and models) into the GPU while the dynamic information (where the models moves) are updated at every frame. Unfortunately I don't know exactly how to configure a VBO, but I guess you need to use layers, bindgroups and indexes to tell shaders where to find what. This link may be helpful: In the next few days I will publish the branch I'm working on for Puffin integration so you can experiment a bit if you want. |
So I am already making vertex buffers in the gltf_loader::build_geometry_buffers function. so once the gltf file is loaded into a bunch of Thanks for the do's and don'ts page I forgot it existed :) the "group resource bindings by the change frequency, start from the lowest" part seems interesting, I'm going to try that out and see if it helps. If you'd be interested in helping out, you could maybe try running that free_low_poly_forest branch (just checkout and run commit 0594a7), then try opening up the gltf file in blender and see if you can merge all the separate objects/meshes into big single object, export it into a new gltf file and see if it runs more smoothly after that. I wrote out some debug info that logs the number of meshes that get created (around 2500 for that gltf file) in the build_scene function. |
oops its 1250 meshes, not 2500 |
Another thing to check out is whether the vertex buffers are using the correct type of memory. In vulkan it should be gpu-only memory, not host visible |
did some more careful inspection yesterday. found out there are quite a few slow parts in the renderer with the free_low_poly_forest scene loaded:
|
Weird. I remember that among the validation (or error) messages that wgpu (probably naga actually) gave were some regarding skin/bone indexes.
Ops! ^_^ Well, in the end this is good news, probably once you figure out how to optimize that part, everything will go much smoother.
Afaik the shadows are pretty hard to do right and fast, so this part will probably get us busy. From what I've read it seems to me that Godot3 implements a dynamic shadow mapping system. These are all things I barely know and could be wrong, but I got the idea that dynamic shadow mapping could be used for shadows that are only changed in-game occasionally and not continuously frame by frame. In our context, continuously recreating the altlas with shadows every frame might weigh a lot (but I could be wrong about how they are created and used). I was thinking about Carmack's Reverse technique but, after a little research, I discovered that it is patented (DOOM 3) and that probably with scenes with many vertices it is slow. Maybe there is some kind of usable Shadow Volume technique but I'm starting to think that if Godot3 uses that method there is a good reason. It should be taken into account that Godot 4 uses a very advanced dynamic global lighting system, the SDFGI: I had come across an Atlas of (static) shadow mapping in a bevy jam game, here it is: |
Ok, reading a bit about the shadow atlas in Godot 3 I discovered that allows rendering multiple lights in a single pass. Instead, Volumetric Shadows techniques must calculate every single light in the scene. This is only feasible if there are few vertices and/or lights. Godot3's shadow atlas have accuracy and size issues, which they corrected in Godot4: |
ah right, I still need to take a look at those validation messages. I keep forgetting. Either way, I fixed the problem with get_all_bone_data, it was another double-nested looping situation. I really need to be more careful to avoid those :D. next I'll do RendererState::update.
Yes, it's made to generate a new shadow map once per frame to account for moving objects. I would like for the shadow implementation to support moving objects but also be efficient. It should be doable! It's been standard in the gaming industry for a long time now if I'm not mistaken. I agree godot3 is pretty old and has limitations so maybe looking at godot's latest shadow implementation is a better idea. Would be cool to look at bevy's too. I've never heard of Carmack's Reverse before, I'm gunna read about it, sounds cool :).
Oh yeah, I've seen that video before, it's pretty amazing. Global illumation could be something to look at in the future, I don't know anything about that topic but am very interested in learning about it, as well as dynamic ambient occlusion and reflections. For GI I think a simpler method is voxel cones, which is was they were doing in godot3 so maybe that's an easier starting point for learning before SDFGI, not sure tho. Confluence of Futility seems to be on Godot version 4 or higher, so the engine comes with built-in dynamic shadows. I think the folder |
Could you enable the repository Discussions tab? I would like to continue the discussion about graphics there so as to keep this PR for performance issues. |
Should be done |
@CatCode79 could you checkout out the Oh I should add some profile macros to that branch too so you can show me what you get in puffin! |
This is the Puffin Log I recorded, I had to disable Bloom and Shadows programmatically because it's so slow that I'm not even sure if I pressed the keys correctly during the game. In general I can not tell you if it is better since the scene is not the same as the master branch. But I updated the lunar vulkan sdk and played with the vulkan configurator: among the validation settings there is the possibility to choose a "Reduced-overhead preset" which helped a little more or less to double the performance. Another particular thing is that looking at a boulder I noticed that the material is a bit transparent, allowing me to see the stars in the skybox. I don't know if it's a problem of the model at the origin or that, but given how much transparency can weigh, it must be taken into account. |
More or less with this scene, i.e. the forest, I have an average of 9 FPS. With the master branch scene, I averaged 14 FPS. Maybe it's useful for you to know that puffin also has the macro to profile just the scopes, and not just the whole function: |
Damn I was worried about it going too slow to press the keys, thanks for taking the time to check that. You're on windows right? Could you try running in with dx12? It's weird that the validation layers are enabled. For me on Linux they stay disabled in a release build unless I set the environment variable (forgot the name of it). And on windows it automatically runs dx12 for me so I'm pretty confused. Yeah I noticed that the sky shows up, I think it's due to the material on the object, I think it's causing the skybox to be reflected on it. I might have made a mistaking with the default fallback materials that get picked when there's no material in the gltf file. |
it's almost as if somehow the project is getting compiled in debug mode except the dependencies are in release mode or something? or maybe puffin is misbehvaing? very confused! |
I will test this on my older pc tomorrow, I think its cpu is slower than yours, maybe I made a mistake |
Yes I'm on windows 10. I can't reproduce the problem so probably I'm just confused and the layers have always been there, just the configuration with less overhead makes the frames go better.
Yes, I tried again with some tests in both debug and release mode and on my laptop the physic step is in the order of hundreds of microseconds (and not milliseconds like on your pc). My CPU is:
I think you need to set in cargo.toml these lines: But you actually mean that there is a mix of crates compiled in debug and others in release? Weird, I'd do a cargo clean, cargo update, and cargo build --release to be sure. All this speech made me remember what I read here:
it could be useful if you use nalgebra for rapier. |
Yes cargo clean and rebuild might be a good idea, I'd also double check your system environment variables to see if you have set a vulkan env var and forgot about it. https://youtu.be/_qRA-WS0aLA I also know that dx12 performs better on windows than vulkan, so it would be important to check if it runs well on dx12 on your laptop. If it runs bad, another thing we can check is run MSI afterburner or maybe just check the task manager to see if the cpu/GPU are reaching high utilization % values and if they are reaching temperatures above ~75 degrees. |
just to double check, did you modify the number of physics_balls in the scene? ( |
no, i just changed enable_shadows to false, enable_bloom to false. However from what little I could see (which at half frame per second is a challenge :D) there are no balls in the forest scene (optimization branch)
true, you're right, my fan makes so much noise sometimes that it makes me "uncomfortable" Personally I suspect that there are hidden render draws that kill performance, however it must be taken into account that perhaps we have run into a particular case in which the self-managed barriers from wgpu have a bug and degrade performance. It shouldn't be seen that since wgpu version 0.13 they have improved performance and barrier handling a lot, but I know they still have more to do. There's still that open job of merging the gltf models into one for the forest, isn't there? It could improve the performance of the rendering part considerably.. I'm also tempted to integrate tracy_full to have full view of message bouncing between cpu and gpu, I suspect all the time spent is barriers waiting for work from the gpu or vice versa. Now I try to test with dx12, it should actually improve but, unless there are bugs in the vulkan drivers, by only a few percentage points. |
DX12 backend, interesting result: Running Caused by: ', C:\Users\Gatto.cargo\registry\src\github.com-1ecc6299db9ec823\wgpu-0.14.0\src\backend\direct.rs:2403:5 |
@CatCode79 do you get the same error in a release build? I see you are running debug target\debug\wgpu-sandbox.exe |
this is definitely a problem though, I might have some ideas for how to mitigate |
this is a great idea btw! would be interested in seeing
In a real game yes that's what should be done, but I actually think it's good to leave the gltf file as-is because it makes for a good benchmark |
Same error on release build. The most useful WGPU issue I've found about it is this: Now that I remember, right now I'm compiling with the Nightly; thiserror crate gives me a strange error telling me that it is using a nightly-only feature and I solved avoiding the stable.. But is wgpu-sandbox meant to be compiled on nigthtly or in stable? Tomorrow I try to integrate trace_full, so we remove a lot of doubts when we have the complete situation. |
wgpu supports stable. I think it'd be a good idea to try running it on stable |
ok, I fixed and compiled both in debug and in release with the stable, I always get the same error |
ok it looks like this scene is allocating 2gb of vram LOL. I'll have to look into this ASAP |
turns out vulkan uses 1.8gb and dx12 uses 2gb, so it's just on the edge of crashing for your gpu |
hey, so it turns out most of the memory usage is from the many large textures in the scene. I've disabled them in a test scene that you can get from this commit: f194d56 (optimization_disable_textures branch). Let me know if you get a chance to test that out, I wonder if the memory usage was the problem this whole time |
testing the optimization_disable_textures branch I have same error as before with dx12, with vulkan instead I have this: thread 'main' panicked at 'attempt to subtract with overflow', src\game.rs:679:27 Note that before the crash I have a lot of these messages: [2022-11-29T09:51:10Z INFO wgpu_core::device] Created buffer Valid((6278, 1, Vulkan)) with BufferDescriptor { label: Some("GpuBuffer"), size: 5760, usage: INDEX, mapped_at_creation: true } They take up most of the game's loading time. |
2022-11-29_optimization_disable_textures.zip Wait! I did a cargo update and run in release mode, it works! Frames are between 45 and 50 per second. All tests were done with Shadows and Bloom turned off |
Amazing news! Thank you very much for the help :). I'll check out debug mode and your Puffin file to see if I can catch any more issues |
Big party tonight! ^_^ It occurred to me that there is this crate that implements a compression algorithm for textures (apparently the classic image compression algorithms work fine for storage but not for gpu) |
glad to hear it :) yup, texture compression is something we need to add! I never heard of ktx2, sounds like a good option. |
average of 1.3ms for the cpu time on ur machine now, I'm pretty happy with that! I think texture compression and shadow optimization are the two next low-hanging fruit |
Nice! Just today I received the Graphics Programming newsletter which contains a couple of articles on the topic (de)compression and ASTC: Another thing of interest: wgpu 0.14.2 is out with a bug fix, I don't think this bug affected us, but I would do a cargo update anyway. |
oh nice, will definitely make the 0.14.2 update asap |
damn 6 microseconds still seems way too slow to compute a node's transform, gotta find a better solution there I think. at least it's no longer a bottleneck. I wonder if get_next_texure is actually the problem or if it's just waiting on some kind of lock from the graphics API, we might need more fine-grained detail to understand what's happening there |
Here's something interesting about it: |
I'm getting closer and closer with the frustum culling! Playing around with collision detections and an Oct tree. It seems to be helping a lot with the forest scene, sometimes drawing only 200 objects instead of the 1200 😀. Hope to be able to PR soon, although the holidays won't help with that lol. |
Cool! I read the discussion on the Discord channel, the gain in terms of performance is impressive. Perhaps a bvh structure in place of the octotree could do more, but with the result you've already achieved I'd say stabilize the octotree and keep it there until we know exactly if it's significantly optimizable. They are right to say that everything is very scene-dependent, probably for now it's enough for us to implement naive or simple solutions waiting to have a functioning game level, at which point we can think about which solution to use for that type of scene. Yes, I too have little time in this period, I'll try to make some small changes here and there. |
I discovered, by reading this link, that:
This means that the 4 channels (instead of just 3) for the hdr texture are fine, so there will be no need to modify it anymore! The whole page has some useful tips, probably they are also for other kind of hardware because in some cases they are things I've read about performance in general. |
nice. which link was that? |
Ops! I forgot to add it: https://gpuopen.com/performance/ |
haha, I was wondering if you were in the rust gamedev channel. Add me on discord, would be nice to be able to chat outside github from time to time :) |
this one seems relevant to shadow mapping performance |
We have already talked about it here:
#2
But I prefer to open an issue to better focus on the problem.
Shadows Implementation is suspect number 1, but even disabling on-game doesn't completely solve the slowness (on my notebook).
I'll try again with Nvidia Nsight, but just today the new version of Puffin was released and as soon as I find some time I try it, if everything works better we will have a cpu side profiling (at least we will know for sure if the engine is cpu-bound or gpu-bound
The text was updated successfully, but these errors were encountered: