Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterative Transform Propagation #4203

Closed

Conversation

james7132
Copy link
Member

@james7132 james7132 commented Mar 14, 2022

Objective

Make transform_propagation_system faster. Zoom zoom.

Solution

  • Remove all queries except for root and transform queries (which are now nicely symmetrical). Down to only one Query::get_mut call per entity!
  • Transition from a recursive to an iterative approach. Using a Vec stack instead of the stack.
  • Split out root transforms that have children from those that don't. Two different systems enables additional parallelism here.
  • Use unsafe pointers. (Can we fenangle a lifetimed read-only reference here?)

Existing propagation tests seem to suggest both no soundness nor correctness issues.

Supercedes #4180. About 50% faster on flat hierarchies, and about 35% faster on deeper ones.

@github-actions github-actions bot added the S-Needs-Triage This issue needs to be labelled label Mar 14, 2022
@james7132 james7132 added C-Performance A change motivated by improving speed, memory usage or compile times A-Transform Translations, rotations and scales and removed S-Needs-Triage This issue needs to be labelled labels Mar 14, 2022
@james7132 james7132 added the M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide label Mar 14, 2022
Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algorithm looks good, although it's not immediately clear why you'd jump straight to the unsafe version without documenting the performance improvement over the safe version.

crates/bevy_transform/src/transform_propagate_system.rs Outdated Show resolved Hide resolved
crates/bevy_transform/src/transform_propagate_system.rs Outdated Show resolved Hide resolved
Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, but I'm still reticent to approve new unsafe code if it's avoidable.

Fwiw, I think the unsafe is UB free, but I don't like it.

(Ideally we'd make all crates above bevy_ecs forbid(unsafe_code). Barring some needed for rendering interacting with windows, as is currently required)

crates/bevy_transform/src/transform_propagate_system.rs Outdated Show resolved Hide resolved
@james7132
Copy link
Member Author

I'm inclined to agree with you on the idea of avoiding unsafe wherever possible, but systems like these are critical to the engine's application level performance. Unless we're going deep enough where the soundness of the unsafe code is difficult to track (see bevy_ecs), we'd be leaving potential perf on the table.

I eventually want to wrap this traversal code in its own dedicated HierarchyQuery, where the unsafe code is kept isolated and encapsulated, and we can keep the same performance profile.

@superdump
Copy link
Contributor

I just tested with the many_cubes -- sphere example, which has 160k root entities and I see the following median system execution times:

NOTE: The above does not really test hierarchy propagation as it is a flat hierarchy. But it does test with a lot of entities. :)

@james7132
Copy link
Member Author

Good to see that the flat hierarchy + change detection still sees a marginal improvement from the system split. Probably going to need the hierarchy stress test actually test the changes here.

@james7132
Copy link
Member Author

james7132 commented Mar 18, 2022

Ran this using #4170 on main, #4180, this PR (as in with the unsafe pointers), and this PR (using GlobalTransform::clone instead of using the pointer) using the defaults. The hierarchy generated the following properties (roughly) for each of them:

InsertResult {
    inserted_nodes: 257025,
    active_nodes: 128512,
    drawn_nodes: 0,
    maximum_depth: 18,
}

Median execution times for the systems (5950x, 200+ samples):

Note for this PR, the "flat" case that @superdump tested is negligible (less than 10 microseconds).

A depth of 18 is a bit deeper than your typical humanoid rig, so I think this would be a closer approximation of a real scenario than the prior test on many_cubes, and I'm seeing a 35% improvement for this case with this PR over main. Any further improvement will likely require better data layout or parallelism.

Moving this out of draft as this clarifies the performance benefit here.

@james7132 james7132 marked this pull request as ready for review March 18, 2022 01:28
@james7132 james7132 requested a review from DJMcNab March 18, 2022 01:28
@james7132
Copy link
Member Author

Did another test with GlobalTransform::clone and got 11.90ms as the median across several hundred frames in the same conditions. The unsafe version is about 14.3% faster just to validate that there is a performance gain from being unsafe here.

@alice-i-cecile alice-i-cecile added the X-Controversial There is active debate or serious implications around merging this PR label Apr 22, 2022
Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested the performance of the two vector solution?
That is, one vector of parent GlobalTransforms, and one of pendings with indices into that vector. That would completely avoid the unsafety, and limit the copies to only one per value.

We also need to rebase, since this doesn't include #4608

// SAFE: The pointers here are generated only during this one traversal
// from one given run of the system.
unsafe {
*global_transform = current.parent.read().mul_transform(*transform);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this may be unsound. Consider a (malformed) entity whose parent is itself:
You'd read the current.parent, which would be pointing to the same GlobalTransform as global_transform
But global_transform holds a mutable reference to that GlobalTransform, which means you've read something to which a mutable reference already exists.
(Note that in the old code, this case would merely stack overflow)

@james7132
Copy link
Member Author

Yes, that is the "clone" version seen above. It's noticeably faster given the size of the type requiring more to copy the value.

@DJMcNab
Copy link
Member

DJMcNab commented May 8, 2022

Fair enough. I had naïvely assumed that the clone version put the GlobalTransform into the pending struct.

@james7132
Copy link
Member Author

Ah no, I misread (stupid 3AM groggy brain, haha). No it's not the same, the naive "clone" method above is as you think.

Updated to current main, profiled ~3,000 samples from the updated transform_hierarchy stress test under the humanoid_mixed configuration:

Cfg {
    test_case: Humanoids {
        active: 2000,
        inactive: 2000,
    },
    update_filter: UpdateFilter {
        min_depth: 0,
        max_depth: 4294967295,
        probability: 1.0,
    },
}

InsertResult {
    inserted_nodes: 272000,
    active_nodes: 134000,
    maximum_depth: 12,
}
Attempt Mean time
main 8.1 ms
This PR (naive clone) 14.75 ms
This PR (pending + index) 16.94 ms
This PR (unsafe) 12.97 ms

Several observations:

With<Parent>,
>,
children_query: Query<(&Children, Changed<Children>), (With<Parent>, With<GlobalTransform>)>,
// Stack space for the depth-first search of a given hierarchy. Used as a Local to
// avoid reallocating the stack space used here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heap space?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's technically used as a stack, just not the stack.

bors bot pushed a commit that referenced this pull request May 16, 2022
# Objective

- Transform propogation could stack overflow when there was a cycle.
- I think #4203 would use all available memory.

## Solution

- Make sure that the child entity's `Parent`s are their parents.

This is also required for when parallelising, although as noted in the comment, the naïve solution would be UB.
(The best way to fix this would probably be an `&mut UnsafeCell<T>` `WorldQuery`, or wrapper type with the same effect)
robtfm pushed a commit to robtfm/bevy that referenced this pull request May 17, 2022
# Objective

- Transform propogation could stack overflow when there was a cycle.
- I think bevyengine#4203 would use all available memory.

## Solution

- Make sure that the child entity's `Parent`s are their parents.

This is also required for when parallelising, although as noted in the comment, the naïve solution would be UB.
(The best way to fix this would probably be an `&mut UnsafeCell<T>` `WorldQuery`, or wrapper type with the same effect)
exjam pushed a commit to exjam/bevy that referenced this pull request May 22, 2022
# Objective

- Transform propogation could stack overflow when there was a cycle.
- I think bevyengine#4203 would use all available memory.

## Solution

- Make sure that the child entity's `Parent`s are their parents.

This is also required for when parallelising, although as noted in the comment, the naïve solution would be UB.
(The best way to fix this would probably be an `&mut UnsafeCell<T>` `WorldQuery`, or wrapper type with the same effect)
@Weibye Weibye added the S-Adopt-Me The original PR author has no intent to complete this work. Pick me up! label Aug 10, 2022
@james7132 james7132 removed the S-Adopt-Me The original PR author has no intent to complete this work. Pick me up! label Oct 20, 2022
@james7132
Copy link
Member Author

james7132 commented Nov 21, 2022

Unless we come across a distinct case where we exhaust the normal stack via recursion, I'm closing this as I can't seem to get this to actually be an improvement over the current default case. After #4697, it takes about ~240us to run propagation on many_foxes on my machine, with this PR in it's current state, it averages around 340-370us. There's notable overhead here that just makes all this extra unsafe not worth it.

@james7132 james7132 closed this Nov 21, 2022
ItsDoot pushed a commit to ItsDoot/bevy that referenced this pull request Feb 1, 2023
# Objective

- Transform propogation could stack overflow when there was a cycle.
- I think bevyengine#4203 would use all available memory.

## Solution

- Make sure that the child entity's `Parent`s are their parents.

This is also required for when parallelising, although as noted in the comment, the naïve solution would be UB.
(The best way to fix this would probably be an `&mut UnsafeCell<T>` `WorldQuery`, or wrapper type with the same effect)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Transform Translations, rotations and scales C-Performance A change motivated by improving speed, memory usage or compile times M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide X-Controversial There is active debate or serious implications around merging this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants