Declining performance with unreleased 0.3.x vs 0.2.2 #350

John-Nagle · 2022-02-05T19:12:55Z

(Related to #348, but not entirely about memory bloat.)

Frame rate has dropped since 0.2.2.

With Rend3 0.2.2, my program was about 58 FPS on this test case, which was fine. Lost about 10FPS with the new version. Frame rate will drop as low as 23 FPS if peak memory usage exceeds the GPU's memory. It will also drop when other threads are loading textures and meshes.

This is the ideal case for frame rate. The entire scene, all textures and meshes, are all in the GPU. The camera is not moving. The same image is being displayed over and over. No mipmapping. Rend3 Unreleased (pre 0.3.x). Ubuntu 20.04 LTS. Nvidia 3070. Ryzen 5 with 6 cores, 12 hyperthreads.

Here's an initial Tracy profile. Call stacks are not being captured, so this is rather coarse data. My own code is barely doing anything here; it and the window event system are using 32us per frame.

The problems:

Basic refresh is too slow and is CPU-bound.
Adding meshes and textures from other threads impacts the refresh rate severely, although it's not supposed to.
Performance is getting worse as more features go into Rend3.

I'm attempting to build, in Rust, a client for Second Life / Open Simulator, because the existing C++ OpenGL clients are single-thread and too slow. Rend3's performance used to be well above the existing programs, but it's now not much better, and in some cases is worse. This is puzzling, because Rend3 is using Vulkan and has all the data resident in the GPU, while the C++ clients are using OpenGL and making huge numbers of draw calls.

I'm very concerned about this. If Rend3's performance doesn't improve substantially, my whole effort was a waste.

John-Nagle · 2022-02-05T19:15:05Z

cwfitzgerald · 2022-02-05T22:13:36Z

This is a longer form comment of what I posted on the matrix:

I first want to emphasize that it may take a little bit to iron out all these issues, but they are all definitely fixable in one way or another.

Basic refresh is too slow and is CPU-bound.

This tracy trace is actually quite helpful as it shows me clearly that your program is showing different performance characteristics from the ones I've been testing and gives me some hints about what might be causing it.

How many materials do you have and how many textures do you have? Material upload seems to take a bit and I've identified texture-related bottleneck inside of wgpu's run_render_pass before -- which could explain the performance with mipmapping (more mips = more textures to track).

Adding meshes and textures from other threads impacts the refresh rate severely, although it's not supposed to.

I currently still do the mesh work on the main thread because there would be synchronization issues if I didn't do this. The mesh upload is not particularly optimized at this second and I have plans on how to improve that.

Even if I do split out the mesh work onto another thread, both this and textures becoming truely multithreaded is blocked on gfx-rs/wgpu#2272. This is something I want to get to, but is a quite large task.

Performance is getting worse as more features go into Rend3.

I haven't noticed any performance regressions in my testing, so I want to put together some kind of test case that has similar traits to yours so that I can keep track of performance in this use case. This is also something I want to put together in general as I need to ensure each particular way of using rend3 improves (and doesn't regress) in performance.

I'm very concerned about this. If Rend3's performance doesn't improve substantially, my whole effort was a waste.

Finally I do want to say that these problems are all fixable. I can't promise it will get done immediately, I'm currently but one person (though @setzer22 has recently joined the project 👋🏻) and have a ton on my plate, but everything will be fixed. My goal, as progress goes on is that rend3's performance should end up well above what it was in 0.2. I think this is totally achievable.

cwfitzgerald · 2022-02-05T23:57:18Z

Copying some numbers and conclusions from our discussion on the matrix here so I don't lose them. Scene stats:

Loaded: 14231 meshes, 19765372 vertices, 58364409 triangles, 12574 textures, 311291520 texture bytes.
Reused: meshes: 39054, textures: 32718
Prims: mesh generated: 0, mesh reused: 0
Textures in use: 12018, peak 12018, Texture bytes in use: 308087680, peak 308087680

Todos:

I have a hunch there's a single extra function call I make in 0.3 that could make a big difference for your case that I want to test.

backport that PR so you can revert to 0.2 for the time being

autogenerate a test scene that has artificially high texture/material counts to replicate the symptoms

make material updates sparse -- currently I upload all materials to the gpu every frame, which is a lot of work for both the cpu and gpu with so many. It's non-trivial but not difficult to only upload the diffs.

Improve the texture tracking code in wgpu to be faster. I'm not sure how it'll be done, but it has to be and I've had my eye on it for a bit

John-Nagle · 2022-02-06T05:33:29Z

"I first want to emphasize that it may take a little bit to iron out all these issues, but they are all definitely fixable in one way or another."

That's good to hear.

"Material upload seems to take a bit"

Most changes to materials already in use are only changes to the texture handles involved. While anything can change, usually, most things don't. If an API call for changing only texture handles would help performance, I could make such calls.

"I currently still do the mesh work on the main thread because there would be synchronization issues if I didn't do this. "

That explains some things. Back in October when I made that video, I was loading all the meshes with no textures, and then turned on concurrent texture loading. Performance looked good back then. Then I started loading meshes from one thread and textures from another, while refreshing from a third thread. Performance dropped to down around 20 FPS at times while meshes and textures were being loaded. Loading textures still degrades the frame rate but not, it seems, as badly as loading meshes.

"I haven't noticed any performance regressions in my testing, so I want to put together some kind of test case that has similar traits to yours so that I can keep track of performance in this use case."

All those non-reused textures and meshes are a problem. But that's user-created content for you. If the NFT metaverse crowd ever actually gets 3D worlds going, they'll face that. By the way, Unreal Engine 5's Nanite system is heavily dependent on reusing instances of objects. In their world, a mesh is a directed acyclic graph in which subsections of the mesh are shared. Something like a chain-link fence is represented by a very small number of unique mesh parts shared within a single data structure. It's very clever, but their demos rely heavily on instancing.

cwfitzgerald · 2022-02-07T03:22:29Z

Good news, I can reproduce this with a simple code-based test case. 10k meshes/materials/textures repos nicely.

.

By the way, Unreal Engine 5's Nanite system is heavily dependent on reusing instances of objects.

Interesting, I knew about the rendering tech but I never looked too much into how it's actually stored. That makes sense.

John-Nagle · 2022-02-07T03:24:57Z

Oh, good. A simple test case always helps.
Meanwhile, now that I have Tracy profiling running, I'm looking at how the threads are interacting. I'll have more to say on that soon.

John-Nagle · 2022-02-07T04:34:18Z

Profiling just as texture loading caught up. This shows the difference between frame times while textures are being loaded from other threads, and while they are not. Around 35ms/frame while textures are being loaded, down to 23ms/frame once loading is done.

"triage suspected" accounts for some of the difference, but not all of it.

cwfitzgerald · 2022-02-07T15:59:28Z

This problem has totally nerd sniped me. Been faffing about in wgpu trying to get performance improvements, and so far have gotten my demo from 39fps up to 100fps. I still need to upstream the changes, which will require being less hacky with my changes, but that should all happen.

John-Nagle · 2022-02-07T19:26:12Z

That's great! I'm working on profiling my own stuff now.

Tracy isn't showing all my threads, even ones that are using substantial CPU. Not clear why. That capture above should have shown three more threads which do different things, not just the main thread and the multiple asset loader threads. Any ideas? I just started using Tracy and probably missed something.

cwfitzgerald · 2022-02-07T23:25:26Z

Tracy will only show threads that have spans on them, so if you want your threads to show, you need to annotate the work done with spans (you can use profiling for this)

John-Nagle · 2022-02-07T23:57:37Z

Ah. That's it. Thanks. More profiling data soon.

John-Nagle · 2022-02-08T04:15:13Z

More profiling data. Large Tracy file:
babbagepalisade01.zip

This is the usual Babbage Palisade scene, from startup through loading to just sitting there refreshing. The part at the end, where the CPU load drops way down, is when the scene is just redrawing without changes.

What all those threads are doing:

Main thread - the window events and refreshing. Nothing else. This is Rend3's main thread.
Mesh loader - reading large JSON files, creating meshes, feeding them to Rend3. If you zoom in far enough, you'll see "Add mesh", which is the actual call to renderer.add_mesh(). It's so fast that it's clear it's queuing an instruction for the main thread, not doing the work in the calling thread.
Asset loader - not doing all that much here, because its main job is to start the loading of textures at various LODs as the camera moves, and in this run, the camera is stationary.
Asset fetcher 0 .. 4. These are loading textures. For this run, they're all in the local file system cache, so there's little network I/O except for some that go out to the servers and get 404 errors. Mostly this is loading .PNG files from the cache. If you zoom in far enough you'll see "Add texture 2D" and "write texture" down in Rend3.
Priority queue - priority queue manager for the asset fetcher threads. Doesn't use much time.

Notes:

Highest CPU usage is 59% over all 6 cores, 12 hyperthreads, so we're not out of compute power. I want to make the asset fetchers run their work at lower priority, but that's not in yet. There might be a priority inversion problem if that was done, and they went into Rend3 at low priority and held locks there.
46-48 FPS in the final stable state where the scene is not changing and all the loading code has gone idle.
Tracy Profiler 0.7.8 can read this, once unzipped.

So that's more detail.

cwfitzgerald · 2022-02-19T01:17:14Z

Just giving an update on tracking performance improvements -- this unfortunately was a regression for other wgpu projects so couldn't be brought in as a whole. That being said I have some good ideas for proving both cases. I would, for now, stick with 0.2, as these get sorted, I can't promise they happen with any speed with how much is on my plate right now.

Will work on the backport shortly.

John-Nagle · 2022-02-19T07:24:57Z

Thanks.

I've converted over to "unreleased" from a few weeks back, and it's working well, although sluggish on big scenes. I'm working on another part of the system, concurrent mesh loading, and that's keeping me busy. So don't worry about the backport too much. The general speedup is more useful at this point.

cwfitzgerald · 2024-04-23T04:57:22Z

Closing due to #593

cwfitzgerald added client: animats-viewer Needed for Animats-Viewer module: core Core issues with the renderer or interface module: routines Issues with the render routines tracking Tracks sets of issues to a larger end goal labels Feb 5, 2022

John-Nagle mentioned this issue Apr 13, 2022

[WIP] Arcanization gfx-rs/wgpu#2272

Closed

cwfitzgerald mentioned this issue Apr 21, 2022

Bind group deduplication gfx-rs/wgpu#2623

Merged

cwfitzgerald mentioned this issue May 15, 2022

Tracking Optimization and Rewrite gfx-rs/wgpu#2662

Merged

7 tasks

John-Nagle mentioned this issue Mar 18, 2023

New Rend3 4d10795 about 3x slower3 than old Rend3 f2b7df4 on low end GPU #477

Closed

cwfitzgerald closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Declining performance with unreleased 0.3.x vs 0.2.2 #350

Declining performance with unreleased 0.3.x vs 0.2.2 #350

John-Nagle commented Feb 5, 2022

John-Nagle commented Feb 5, 2022

cwfitzgerald commented Feb 5, 2022 •

edited

Loading

cwfitzgerald commented Feb 5, 2022 •

edited

Loading

John-Nagle commented Feb 6, 2022

cwfitzgerald commented Feb 7, 2022

John-Nagle commented Feb 7, 2022

John-Nagle commented Feb 7, 2022

cwfitzgerald commented Feb 7, 2022 •

edited

Loading

John-Nagle commented Feb 7, 2022

cwfitzgerald commented Feb 7, 2022

John-Nagle commented Feb 7, 2022

John-Nagle commented Feb 8, 2022 •

edited

Loading

cwfitzgerald commented Feb 19, 2022

John-Nagle commented Feb 19, 2022

cwfitzgerald commented Apr 23, 2024

Declining performance with unreleased 0.3.x vs 0.2.2 #350

Declining performance with unreleased 0.3.x vs 0.2.2 #350

Comments

John-Nagle commented Feb 5, 2022

John-Nagle commented Feb 5, 2022

cwfitzgerald commented Feb 5, 2022 • edited Loading

cwfitzgerald commented Feb 5, 2022 • edited Loading

John-Nagle commented Feb 6, 2022

cwfitzgerald commented Feb 7, 2022

John-Nagle commented Feb 7, 2022

John-Nagle commented Feb 7, 2022

cwfitzgerald commented Feb 7, 2022 • edited Loading

John-Nagle commented Feb 7, 2022

cwfitzgerald commented Feb 7, 2022

John-Nagle commented Feb 7, 2022

John-Nagle commented Feb 8, 2022 • edited Loading

cwfitzgerald commented Feb 19, 2022

John-Nagle commented Feb 19, 2022

cwfitzgerald commented Apr 23, 2024

cwfitzgerald commented Feb 5, 2022 •

edited

Loading

cwfitzgerald commented Feb 5, 2022 •

edited

Loading

cwfitzgerald commented Feb 7, 2022 •

edited

Loading

John-Nagle commented Feb 8, 2022 •

edited

Loading