-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterative Transform Propagation #4203
Iterative Transform Propagation #4203
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Algorithm looks good, although it's not immediately clear why you'd jump straight to the unsafe version without documenting the performance improvement over the safe version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, but I'm still reticent to approve new unsafe code if it's avoidable.
Fwiw, I think the unsafe is UB free, but I don't like it.
(Ideally we'd make all crates above bevy_ecs forbid(unsafe_code)
. Barring some needed for rendering interacting with windows, as is currently required)
I'm inclined to agree with you on the idea of avoiding unsafe wherever possible, but systems like these are critical to the engine's application level performance. Unless we're going deep enough where the soundness of the unsafe code is difficult to track (see bevy_ecs), we'd be leaving potential perf on the table. I eventually want to wrap this traversal code in its own dedicated HierarchyQuery, where the unsafe code is kept isolated and encapsulated, and we can keep the same performance profile. |
I just tested with the
NOTE: The above does not really test hierarchy propagation as it is a flat hierarchy. But it does test with a lot of entities. :) |
Good to see that the flat hierarchy + change detection still sees a marginal improvement from the system split. Probably going to need the hierarchy stress test actually test the changes here. |
Ran this using #4170 on main, #4180, this PR (as in with the unsafe pointers), and this PR (using
Median execution times for the systems (5950x, 200+ samples):
Note for this PR, the "flat" case that @superdump tested is negligible (less than 10 microseconds). A depth of 18 is a bit deeper than your typical humanoid rig, so I think this would be a closer approximation of a real scenario than the prior test on many_cubes, and I'm seeing a 35% improvement for this case with this PR over main. Any further improvement will likely require better data layout or parallelism. Moving this out of draft as this clarifies the performance benefit here. |
Did another test with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested the performance of the two vector solution?
That is, one vector of parent GlobalTransform
s, and one of pendings
with indices into that vector. That would completely avoid the unsafety, and limit the copies to only one per value.
We also need to rebase, since this doesn't include #4608
// SAFE: The pointers here are generated only during this one traversal | ||
// from one given run of the system. | ||
unsafe { | ||
*global_transform = current.parent.read().mul_transform(*transform); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this may be unsound. Consider a (malformed) entity whose parent is itself:
You'd read the current.parent
, which would be pointing to the same GlobalTransform
as global_transform
But global_transform
holds a mutable reference to that GlobalTransform
, which means you've read something to which a mutable reference already exists.
(Note that in the old code, this case would merely stack overflow)
Yes, that is the "clone" version seen above. It's noticeably faster given the size of the type requiring more to copy the value. |
Fair enough. I had naïvely assumed that the clone version put the |
Ah no, I misread (stupid 3AM groggy brain, haha). No it's not the same, the naive "clone" method above is as you think. Updated to current main, profiled ~3,000 samples from the updated transform_hierarchy stress test under the humanoid_mixed configuration:
Several observations:
|
With<Parent>, | ||
>, | ||
children_query: Query<(&Children, Changed<Children>), (With<Parent>, With<GlobalTransform>)>, | ||
// Stack space for the depth-first search of a given hierarchy. Used as a Local to | ||
// avoid reallocating the stack space used here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heap space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's technically used as a stack, just not the stack.
# Objective - Transform propogation could stack overflow when there was a cycle. - I think #4203 would use all available memory. ## Solution - Make sure that the child entity's `Parent`s are their parents. This is also required for when parallelising, although as noted in the comment, the naïve solution would be UB. (The best way to fix this would probably be an `&mut UnsafeCell<T>` `WorldQuery`, or wrapper type with the same effect)
# Objective - Transform propogation could stack overflow when there was a cycle. - I think bevyengine#4203 would use all available memory. ## Solution - Make sure that the child entity's `Parent`s are their parents. This is also required for when parallelising, although as noted in the comment, the naïve solution would be UB. (The best way to fix this would probably be an `&mut UnsafeCell<T>` `WorldQuery`, or wrapper type with the same effect)
# Objective - Transform propogation could stack overflow when there was a cycle. - I think bevyengine#4203 would use all available memory. ## Solution - Make sure that the child entity's `Parent`s are their parents. This is also required for when parallelising, although as noted in the comment, the naïve solution would be UB. (The best way to fix this would probably be an `&mut UnsafeCell<T>` `WorldQuery`, or wrapper type with the same effect)
Unless we come across a distinct case where we exhaust the normal stack via recursion, I'm closing this as I can't seem to get this to actually be an improvement over the current default case. After #4697, it takes about ~240us to run propagation on many_foxes on my machine, with this PR in it's current state, it averages around 340-370us. There's notable overhead here that just makes all this extra |
# Objective - Transform propogation could stack overflow when there was a cycle. - I think bevyengine#4203 would use all available memory. ## Solution - Make sure that the child entity's `Parent`s are their parents. This is also required for when parallelising, although as noted in the comment, the naïve solution would be UB. (The best way to fix this would probably be an `&mut UnsafeCell<T>` `WorldQuery`, or wrapper type with the same effect)
Objective
Make
transform_propagation_system
faster. Zoom zoom.Solution
root
andtransform
queries (which are now nicely symmetrical). Down to only oneQuery::get_mut
call per entity!Existing propagation tests seem to suggest both no soundness nor correctness issues.
Supercedes #4180. About 50% faster on flat hierarchies, and about 35% faster on deeper ones.