Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Animation Performance Problems #101494

Open
mrjustaguy opened this issue Jan 13, 2025 · 24 comments
Open

Animation Performance Problems #101494

mrjustaguy opened this issue Jan 13, 2025 · 24 comments

Comments

@mrjustaguy
Copy link
Contributor

mrjustaguy commented Jan 13, 2025

Tested versions

Reproduced 4.4 Dev 7, 4.3 Stable
Not Reproduced 3.6 Stable

System information

Windows 11, i3 10105f

Issue description

I've noticed that Animation Tree and Animation Player seem to have a Ludicrously high cost

For Reference, in my game to have 60 FPS, I can have ~100 AI characters with active animation players and trees, but 3-4 times as many with them disabled.
Each of the AI's is dynamically typed GDScript, that isn't even all that optimized, gathering data of their environment and communicating with each other, moving around with Godot's physics and navigation calculations running in the background of all of that.

The GPU is not the limitation, given it's taking only a couple of milliseconds per frame to render and changing GPU taxing settings does nothing

Steps to reproduce

Open MRP, Observe poor performance, go to the instance, and set process mode of the animation tree and player to disabled, Observe Massive Improvement

Note: The MRP went a bit extreme (some 2.4k instances) to tank the performance even harder for those more powerful systems out there (It's a slide show on mine in the editor with animations activated in the instances)

Minimal reproduction project (MRP)

Animation.zip

@fire
Copy link
Member

fire commented Jan 13, 2025

@mrjustaguy Can you see if there is a version of godot 4 where it has better performance?

We can't fix this in the Godot 4.4 release, but we could look into this for Godot Engine 4.5.

@fire
Copy link
Member

fire commented Jan 13, 2025

Are you expecting Godot Engine to handle 2_400 animated characters?

An older comparison on https://vxtwitter.com/duroxxigar/status/1779234802048053443 says

Test case 500 skeleton meshes playing on loop:

Flax: 132 FPS
Godot Engine: 23 FPS
Unreal Engine (Faster now): 28 FPS

A faster animation subsystem requires some changes to the architecture if you want it to be like 5x faster.

@mrjustaguy
Copy link
Contributor Author

mrjustaguy commented Jan 13, 2025

I'm not expecting 2.4k animated characters, and this skeleton issue is a separate issue, where the skeletons themselves are just expensive, but this has a lot to do with the Rendering Code for them being expensive on the CPU for some reason (Reducing the number of Skeletal Mesh Surfaces rendered a frame heavily helps here)

What I am saying is the Animation system is obscenely heavy for something that's mostly just doing some math blending between various values plugged in, and that much more complicated logic and math (most of it implemented in a non performant oriented language, without attempts at optimizing the logic) is running orders of magnitude faster

I've tested 3.6 now, and It's running MUUUUCH better with the same 2.4k animation player&tree setup, and isn't a slide show like 4.4 is.

Here's the 3.6 project

3.6.zip

For reference I've gotten to 3x the balls in 3.6 before it getting into slide show territory, and I'd likely match 4.4 with 4x-5x the balls but the Editor is having trouble spawning them in 3.6, and it froze when trying to add more :/

@clayjohn
Copy link
Member

This is a nice MRP actually as it highlights the cost of AnimationTree and AnimationPlayer separately from Skeleton3D.

Here is what the profiler shows:

Image

So it looks like AnimationTree is the real culprit here. Drilling down into the function calls, it seems to be some sort of recursize set of calls to AnimationNode::process():

Image

@clayjohn
Copy link
Member

clayjohn commented Jan 13, 2025

I've tested 3.6 now, and It's running MUUUUCH better with the same 2.4k animation player&tree setup, and isn't a slide show like 4.4 is.

Here's the 3.6 project

3.6.zip

Tested on an M2 MBP:
3.6: 6.2 mspf
4.2.2: 11.75 mspf
4.3: 13.5 mspf
4.4-dev7: 11.0 mspf
b3a44e8: 10.8 mspf
#101285: 10.7 mspf

It is approximately twice as fast on my device. There is a lot of room for improvement

@fire
Copy link
Member

fire commented Jan 13, 2025

I'd like to see if we can get the 4.5 to match godot 3.6 on the vrm with an animation player & tree.

Thanks for creating a minimal reproduction project.

@TokageItLab
Copy link
Member

TokageItLab commented Jan 13, 2025

The major change there due to the change from 3.x to 4.0 is the use of time structs NodeTimeInfo, is that relevant?

Also I remember that recently what affects the whole thing is the generation of AnimationInctance, which was implemented in 4.3. Maybe we should compare 4.2 and 4.3.

I also think that the performance impact could be particularly significant with the inclusion of string processing and NodePath scene search, but this has existed since 3.x, so it may not matter much.

For now, there may be some improvements that can be made through cpp language features if it is a problem with struct or other qualifiers. Of course, some manual optimizations such as reducing push_back, caching pointers, etc. should also be made.

@clayjohn
Copy link
Member

The major change there due to the change from 3.x to 4.0 is the use of time structs NodeTimeInfo, is that relevant?

Possibly, I'm seeing a lot of time is spent in get_node_time_info() calls, in String operations (there is a lot of String creation, addition, and deletion happening at run time), and in make_animation_instance()

Overall there appears to be a lot of object creation happening on the heap for stuff that is being created and destroyed regularly. Ideally, the animation system wouldn't allocate from the heap regularly like that. In particular, I see the cost of creating and Vectors and Strings coming up a lot.

I also see a lot of indexing into hash maps. We should investigate using a better data structure wherever we can to avoid the cost of hashing so frequently

@clayjohn
Copy link
Member

clayjohn commented Jan 13, 2025

HashMap hashes

Here is some low hanging fruit. There is a pattern of checking if a hashmap has() some StringName, then either reading or writing to the StringName. This hashes the StringName twice and makes the check twice as expensive overall.

Ref<Animation> AnimationMixer::get_animation(const StringName &p_name) const {
ERR_FAIL_COND_V_MSG(!animation_set.has(p_name), Ref<Animation>(), vformat("Animation not found: \"%s\".", p_name));
const AnimationData &anim_data = animation_set[p_name];
return anim_data.animation;
}

(Even worse, in some places we call has_animation(), then get_animation right after. Which means that we do the hash 3 times to read the value once)

(same problem with can_apply_reset())

Using getptr instead allows us to considate the check and cut the cost in half. stuartcarnie did this optimization in a few places in: #96875

String addition

The String operations are really deadly. Over 10% of the time in the MRP is spent here:

// This is the slowest part of processing, but as strings process in powers of 2, and the paths always exist, it will not result in that many allocations.
if (p_new_parent) {
new_parent = p_new_parent;
new_path = String(node_state.base_path) + String(p_subpath) + "/";
} else {
ERR_FAIL_NULL_V(node_state.parent, NodeTimeInfo());
new_parent = node_state.parent;
new_path = String(new_parent->node_state.base_path) + String(p_subpath) + "/";
}

@Nazarwadim
Copy link
Contributor

You need to disable AnimationMixer active in AnimationPlayer and you will get the same fps as in 3.6.

output.mp4

But while I noticed it, I found RBMap and two hashmaps that I forgot to change to AHashMap, which add 15% to the performance.

@mrjustaguy
Copy link
Contributor Author

I've been unable to reproduce that change on my end

@Nazarwadim
Copy link
Contributor

Nazarwadim commented Jan 14, 2025

If you tested in the editor, then this is a problem of poor performance in the editor.
I also have about 5 fps in the editor. And normal in 3.6.

@mrjustaguy
Copy link
Contributor Author

No I've run the project, as I'm getting the data from Adrenalin which requires the game running at fullscreen, and disabling it didn't change a thing for me, I've even restarted Godot and ran again and it didn't do a thing for me...

@Nazarwadim
Copy link
Contributor

How much fps do you have in 4.4 and 3.6?

@mrjustaguy
Copy link
Contributor Author

in 3.6 I have over 3x better performance vs 4.4

@Nazarwadim
Copy link
Contributor

I will also ask @clayjohn to test it. If the result is the same as mine, then you'll have to do the profiling https://docs.godotengine.org/en/stable/contributing/development/debugging/using_cpp_profilers.html yourself because it's unknown what might be slow in 4.x for you.

@mrjustaguy
Copy link
Contributor Author

What's your CPU? it may be that mine is much more memory bound in 4.4 with like 7 MB of cache, 3200mhz 2x ram

IIRC M2 has quite a bit more memory bandwidth, and also is a different instruction set (ARM vs x86)

@clayjohn
Copy link
Member

I will also ask @clayjohn to test it. If the result is the same as mine, then you'll have to do the profiling https://docs.godotengine.org/en/stable/contributing/development/debugging/using_cpp_profilers.html yourself because it's unknown what might be slow in 4.x for you.

Testing with 4.4 dev7, I get 10.3 mspf with AnimationMixer active set to false. Slightly better, but not close to 3.6

@Nazarwadim
Copy link
Contributor

What's your CPU? it may be that mine is much more memory bound in 4.4 with like 7 MB of cache, 3200mhz 2x ram

IIRC M2 has quite a bit more memory bandwidth, and also is a different instruction set (ARM vs x86)

Godot v4.4.dev7 - Linux Mint 21.3 (Virginia) on X11 - X11 display driver, Multi-window, 1 monitor - Vulkan (Forward+) - dedicated NVIDIA GeForce RTX 3050 Laptop GPU - 11th Gen Intel(R) Core(TM) i5-11400H @ 2.70GHz (12 threads)

@mrjustaguy
Copy link
Contributor Author

mrjustaguy commented Jan 14, 2025

Yeah you have significantly more cache compared to me, so over 3x improvement for 3.6 for me could be just 3.6 being less bandwidth hungry. Though that'd only be so if the 4.4 results vs 3.6 uplift you're seeing is smaller compared to mine like clay's is

As clay is also not getting much of a difference I guess there's something else that's making the mixer cost significantly more for you.

@clayjohn
Copy link
Member

clayjohn commented Jan 15, 2025

Edit: Oh nice, this is already addressed by #101548

Another fun one:

We store the AnimationBlendTree nodes in an RBMap (it requires a O(log n) traversal for each lookup). For convenience, we do that same has() check before indexing. This means we traverse the RBMap twice each time we want to lookup the node

Ref<AnimationNode> AnimationNodeBlendTree::get_node(const StringName &p_name) const {
ERR_FAIL_COND_V(!nodes.has(p_name), Ref<AnimationNode>());
return nodes[p_name].node;
}

5% of the CPU time in the MRP is spent in AnimationNodeBlendTree::get_node(). We can easily cut that in half by avoiding the redundant call. Although realistically, there is probably a better data structure that would reduce that cost to almost nothing

@clayjohn
Copy link
Member

clayjohn commented Jan 15, 2025

@Nazarwadim After some more testing, including your PRs, I think I am actually hitting some type of display sync issue. I can't get process time less than 10 ms even with other optimizations. I think I am capped at 100 FPS for some reason. I will profile on a different machine to see if I can replicate your results

Edit: Tested on a Windows machine with a ryzen 3600 CPU. Applying #101564 and #101548 gives me about a 10% boost (30 mspf -> 27 mspf). Also tested with clayjohn@f5f3138 which reduced it down to about 24 mspf

@Nazarwadim
Copy link
Contributor

I have almost completed my 3rd PR, part of which is reducing the hash calculation.

@clayjohn
Copy link
Member

I have almost completed my 3rd PR, part of which is reducing the hash calculation.

Great! Feel free to copy from clayjohn@f5f3138 if you want to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Bugs
Development

No branches or pull requests

6 participants