Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Entity::generation from u32 to NonZeroU32 for niche optimization #9907

Conversation

notverymoe
Copy link
Contributor

@notverymoe notverymoe commented Sep 23, 2023

Objective

Discussion

Solution

  • Change Entity::generation from u32 to NonZeroU32 to allow for niche optimization.
    • The reason for changing generation rather than index is so that the costs are only encountered on Entity free, instead of on Entity alloc
    • There was some concern with generations being used, due to there being some desire to introduce flags. This was more to do with the original retirement approach, however, in reality even if generations were reduced to 24-bits, we would still have 16 million generations available before wrapping and current ideas indicate that we would be using closer to 4-bits for flags.
    • Additionally, another concern was the representation of relationships where NonZeroU32 prevents us using the full address space, talking with Joy it seems unlikely to be an issue. The majority of the time these entity references will be low-index entries (ie. ChildOf, Owes), these will be able to be fast lookups, and the remainder of the range can use slower lookups to map to the address space.
    • It has the additional benefit of being less visible to most users, since generation is only ever really set through from_bits type methods.
  • EntityMeta was changed to match
  • On free, generation now explicitly wraps:
    • Originally, generation would panic in debug mode and wrap in release mode due to using regular ops.
    • The first attempt at this PR changed the behavior to "retire" slots and remove them from use when generations overflowed. This change was controversial, and likely needs a proper RFC/discussion.
    • Wrapping matches current release behaviour, and should therefore be less controversial.
    • Wrapping also more easily migrates to the retirement approach, as users likely to exhaust the exorbitant supply of generations will code defensively against aliasing and that defensive code is less likely to break than code assuming that generations don't wrap.
    • We use some unsafe code here when wrapping generations, to avoid branch on NonZeroU32 construction. It's guaranteed safe due to how we perform wrapping and it results in significantly smaller ASM code.

Migration

  • Previous bevy_scene serializations have a high likelihood of being broken, as they contain 0th generation entities.

Current Issues

  • Entities::reserve_generations and EntityMapper wrap now, even in debug - although they technically did in release mode already so this probably isn't a huge issue. It just depends if we need to change anything here?

@A-Walrus
Copy link
Contributor

We might want to write a little migration script that increases the generation of every entity in a scene file by one

Copy link
Contributor

@atlv24 atlv24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appears sensible, should probably merge after #9797 though

@james7132 james7132 added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times C-Usability A targeted quality-of-life change that makes Bevy easier to use S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help labels Sep 24, 2023
@notverymoe
Copy link
Contributor Author

We might want to write a little migration script that increases the generation of every entity in a scene file by one

Do we have any prior art for this? Be good to look at before I give it a go

@notverymoe
Copy link
Contributor Author

appears sensible, should probably merge after #9797 though

Seems like this'll need a rework after that lands, which is fine - I'll keep my eyes out for it. I'll look at doing the benchmarking after that's in and I fix things up.

@Kolsky
Copy link

Kolsky commented Sep 25, 2023

The wrapping stuff

If you want to do the usual wrapping for addition of two integers, here is how (playground)

Proof that it works: adding two integers will result in exactly 0 or 1 overflows (2 × uN::MAX = (oveflow_flag, uN::MAX - 1)). The case with 0 overflows is trivial (just add them normally). If overflow happens, the result would be off-by-one with the normal wrapping. There are two extreme cases: if result was uN::MIN, adding 1 will make it NonZeroUN::MIN. If result was uN::MAX - 1, adding 1 will make it NonZeroUN::MAX. Thus, neither underflow nor overflow of NonZero domain is possible. There's also a commented out simple test for Miri, albeit it needs to be run locally.

Tl;dr: it's as simple as (x + y + overflow_flag)

@notverymoe
Copy link
Contributor Author

@Kolsky Oh, that's really excellent :O Thanks for that!

@notverymoe
Copy link
Contributor Author

The optimized assembly is also really clean, nice!

@notverymoe
Copy link
Contributor Author

notverymoe commented Sep 30, 2023

Still waiting on that other PR to merge, but I ran bench with 857fb9c:
https://gist.github.com/notverymoe/52b3d4605ccac8aaead2a730387f6e9d

Cut down table excluding < 5% changes grepped for "entit":

group                                           main_ecs                niche_ecs
-----                                           --------                ---------
busy_systems/01x_entities_03_systems            1.00     23.1±0.83µs    1.06     24.4±2.05µs
busy_systems/01x_entities_06_systems            1.12     48.0±9.59µs    1.00     42.7±9.09µs
busy_systems/01x_entities_09_systems            1.26    79.9±19.15µs    1.00     63.7±8.43µs
busy_systems/01x_entities_12_systems            1.38   122.1±15.87µs    1.00    88.3±15.01µs
busy_systems/01x_entities_15_systems            1.00   105.9±26.01µs    1.10   116.2±23.13µs
busy_systems/02x_entities_06_systems            1.14    84.9±20.00µs    1.00     74.4±9.35µs
busy_systems/02x_entities_09_systems            1.00   104.0±15.44µs    1.23   128.0±11.44µs
busy_systems/02x_entities_15_systems            1.15   230.6±32.33µs    1.00   200.9±44.35µs
busy_systems/03x_entities_06_systems            1.00   116.5±27.66µs    1.13   132.0±23.04µs
busy_systems/03x_entities_09_systems            1.00   168.3±40.31µs    1.11   186.2±40.31µs
busy_systems/03x_entities_12_systems            1.00   200.4±38.06µs    1.12   224.2±43.29µs
busy_systems/04x_entities_03_systems            1.20    77.4±13.84µs    1.00     64.8±7.06µs
busy_systems/04x_entities_12_systems            1.21   291.6±56.49µs    1.00   240.2±40.80µs
busy_systems/05x_entities_03_systems            1.00    80.4±13.44µs    1.06    85.0±14.62µs
contrived/01x_entities_03_systems               1.00     18.1±4.14µs    1.28     23.2±6.56µs
contrived/01x_entities_06_systems               1.00     29.1±6.59µs    1.17     34.1±6.44µs
contrived/01x_entities_12_systems               1.23    64.4±14.72µs    1.00    52.2±12.26µs
contrived/02x_entities_03_systems               1.19     31.4±6.97µs    1.00     26.4±6.17µs
contrived/02x_entities_06_systems               1.00     37.0±5.99µs    1.21    44.7±14.30µs
contrived/02x_entities_09_systems               1.00     52.3±9.52µs    1.19    62.3±18.32µs
contrived/02x_entities_12_systems               1.32    92.1±23.38µs    1.00    69.8±12.67µs
contrived/03x_entities_03_systems               1.08     39.6±9.53µs    1.00     36.5±6.89µs
contrived/03x_entities_09_systems               1.18    85.5±22.96µs    1.00    72.4±16.48µs
contrived/03x_entities_12_systems               1.00    90.6±18.26µs    1.09    99.1±25.23µs
contrived/04x_entities_09_systems               1.00   102.1±26.46µs    1.19   121.4±24.96µs
contrived/04x_entities_12_systems               1.23   141.9±31.90µs    1.00   115.5±21.08µs
contrived/05x_entities_03_systems               1.52     59.9±7.86µs    1.00     39.5±7.64µs
contrived/05x_entities_06_systems               1.00    72.0±13.10µs    1.20    86.6±18.86µs
contrived/05x_entities_12_systems               1.09   172.2±44.12µs    1.00   158.2±31.92µs
spawn_world/10000_entities                      1.00   473.3±42.80µs    1.10   520.0±84.80µs
spawn_world/1000_entities                       1.00     47.3±4.68µs    1.08     51.0±9.35µs
spawn_world/100_entities                        1.00      4.8±0.52µs    1.06      5.1±0.93µs
world_query_for_each/50000_entities_sparse      1.05     43.1±0.20µs    1.00     41.0±0.15µs
world_query_get/50000_entities_table_wide       1.00    119.4±0.39µs    1.12    133.3±0.64µs
world_query_iter/50000_entities_sparse          1.00     50.4±0.34µs    1.12     56.2±1.15µs

@notverymoe
Copy link
Contributor Author

notverymoe commented Sep 30, 2023

I'm not too familiar with those numbers, but they seem rather inconsistent between different magnitudes of test:
AMD 5950X, 32GiB ram, Running EndeavourOS with Linux 6.5.4, rustc 1.72.1

Shut everything down on the machine except the terminal running the test.

@alice-i-cecile
Copy link
Member

Yeah, the benchmarks are definitely a bit noisy, both here and in general 🤔 I can't make clear sense of this unfortunately.

@notverymoe
Copy link
Contributor Author

Elabajaba on discord pointed out that since I'm on a Zen 3 CPU, I'm probably getting different boosts on every core. I'll try disabling that and running the tests again, see if I get something more stable and less noisey.

@notverymoe
Copy link
Contributor Author

notverymoe commented Oct 12, 2023

Yep, that was it. I disabled Precision Boost and locked my CPU scaling to 2.2GHz on linux. Much more stable, not as much swing between tests, still a little noisey. But I think maybe i need to run a few more tests, because the results are little strange in places (30% gain on events_iter??)

Ran against bb13d06 on main, and merged into the branch, threshold of 5%:

group                                           main                    niche_ecs
-----                                           ------                  -----------
added_archetypes/archetype_count/100            1.35     59.2±1.25µs    1.00     43.8±0.33µs
added_archetypes/archetype_count/1000           1.14   1059.4±5.62µs    1.00    930.6±3.45µs
added_archetypes/archetype_count/10000          1.18     12.8±0.53ms    1.00     10.9±0.27ms
added_archetypes/archetype_count/200            1.24    108.4±2.56µs    1.00     87.4±4.70µs
added_archetypes/archetype_count/2000           1.13      2.2±0.02ms    1.00   1957.9±9.80µs
added_archetypes/archetype_count/500            1.22   413.1±29.71µs    1.00   338.7±22.49µs
added_archetypes/archetype_count/5000           1.16      5.9±0.03ms    1.00      5.1±0.02ms
build_schedule/1000_schedule                    1.08       3.6±0.02s    1.00       3.3±0.04s
build_schedule/1000_schedule_noconstraints      1.20     32.0±0.27ms    1.00     26.5±0.35ms
build_schedule/100_schedule                     1.09     22.0±0.06ms    1.00     20.3±0.04ms
build_schedule/500_schedule                     1.09   687.7±10.20ms    1.00    633.1±7.57ms
busy_systems/01x_entities_06_systems            1.00     72.9±4.26µs    1.08     78.4±4.59µs
busy_systems/01x_entities_12_systems            1.00    131.8±3.45µs    1.05    138.9±3.91µs
busy_systems/01x_entities_15_systems            1.00    164.4±4.84µs    1.06    173.7±7.20µs
busy_systems/02x_entities_06_systems            1.00    120.8±2.33µs    1.06    127.5±3.24µs
busy_systems/05x_entities_03_systems            1.00    139.9±1.11µs    1.13    158.4±3.67µs
busy_systems/05x_entities_06_systems            1.00    256.1±1.77µs    1.16    296.3±1.77µs
busy_systems/05x_entities_09_systems            1.00    379.2±2.08µs    1.15    437.7±6.34µs
busy_systems/05x_entities_12_systems            1.00    502.6±6.63µs    1.15    575.8±7.26µs
busy_systems/05x_entities_15_systems            1.00    622.2±8.26µs    1.15    713.9±8.03µs
contrived/01x_entities_06_systems               1.12     58.7±4.38µs    1.00     52.4±2.42µs
contrived/01x_entities_15_systems               1.08    117.7±9.58µs    1.00    109.3±8.53µs
contrived/03x_entities_03_systems               1.00     54.0±3.66µs    1.11     60.2±4.75µs
contrived/03x_entities_09_systems               1.08   134.5±10.70µs    1.00    124.6±3.17µs
contrived/04x_entities_06_systems               1.06    113.1±5.34µs    1.00    106.6±3.01µs
contrived/05x_entities_09_systems               1.06   192.8±11.52µs    1.00    182.3±7.92µs
empty_systems/002_systems                       1.25     18.5±1.74µs    1.00     14.8±0.36µs
empty_systems/025_systems                       1.05     34.8±2.43µs    1.00     33.1±1.79µs
empty_systems/030_systems                       1.00     37.5±1.44µs    1.07     40.3±5.48µs
empty_systems/035_systems                       1.00     42.7±5.36µs    1.25    53.6±11.26µs
empty_systems/040_systems                       1.33    64.1±12.02µs    1.00     48.0±7.18µs
empty_systems/045_systems                       1.33    65.8±13.64µs    1.00     49.5±0.81µs
empty_systems/050_systems                       1.12    69.1±12.46µs    1.00    61.8±11.82µs
empty_systems/055_systems                       1.13    78.0±16.51µs    1.00    69.3±11.69µs
empty_systems/060_systems                       1.08    80.5±13.07µs    1.00     74.5±9.37µs
empty_systems/065_systems                       1.12    86.8±13.46µs    1.00     77.7±9.74µs
empty_systems/070_systems                       1.00    95.0±15.66µs    1.07   101.4±13.63µs
empty_systems/075_systems                       1.09    97.3±14.35µs    1.00    89.4±11.86µs
empty_systems/080_systems                       1.11   114.7±16.54µs    1.00   103.2±19.28µs
empty_systems/085_systems                       1.00     97.7±7.42µs    1.17   114.0±18.74µs
events_iter/size_16_events_100                  1.45    146.6±0.13ns    1.00    100.8±0.51ns
events_iter/size_16_events_1000                 1.50   1380.8±5.29ns    1.00    920.5±0.44ns
events_iter/size_16_events_10000                1.50     13.7±0.03µs    1.00      9.1±0.00µs
events_iter/size_16_events_50000                1.50     68.4±0.25µs    1.00     45.7±0.50µs
events_iter/size_4_events_100                   1.46    146.7±0.17ns    1.00    100.8±0.53ns
events_iter/size_4_events_1000                  1.50  1381.9±30.73ns    1.00    920.4±0.59ns
events_iter/size_4_events_10000                 1.50     13.7±0.05µs    1.00      9.1±0.01µs
events_iter/size_4_events_50000                 1.50     68.6±0.11µs    1.00     45.6±0.14µs
events_send/size_16_events_100                  1.00    238.7±0.42ns    1.14    271.7±0.57ns
events_send/size_512_events_100                 1.00      2.6±0.00µs    1.06      2.8±0.01µs
iter_fragmented/base                            1.00   799.9±24.48ns    1.29  1030.2±24.57ns
iter_fragmented_sparse/wide                     1.13     85.2±6.57ns    1.00    75.5±18.44ns
iter_simple/system                              1.64     22.9±0.03µs    1.00     14.0±0.07µs
iter_simple/wide                                1.00     65.1±0.88µs    1.13     73.8±4.25µs
query_get/50000_entities_table                  1.08    476.5±1.61µs    1.00    440.5±4.31µs
query_get_component/50000_entities_table        1.07   1039.1±7.86µs    1.00    966.6±8.49µs
query_get_many_2/50000_calls_table              1.07    835.6±7.38µs    1.00    780.8±1.99µs
run_condition/no/021_systems                    1.00     14.9±0.15µs    1.08     16.1±1.63µs
run_condition/no/026_systems                    1.00     15.7±1.47µs    1.10     17.3±1.16µs
run_condition/no/041_systems                    1.00     16.5±0.90µs    1.06     17.5±1.65µs
run_condition/no/066_systems                    1.00     17.2±1.37µs    1.14     19.7±0.58µs
run_condition/no/071_systems                    1.08     20.1±0.63µs    1.00     18.6±1.81µs
run_condition/no/076_systems                    1.00     17.1±0.64µs    1.23     21.1±2.06µs
run_condition/no/081_systems                    1.00     19.0±1.54µs    1.06     20.2±1.68µs
run_condition/no/086_systems                    1.00     20.0±1.60µs    1.06     21.3±0.41µs
run_condition/no/096_systems                    1.21     22.3±0.73µs    1.00     18.4±0.67µs
run_condition/yes/016_systems                   1.13     31.1±4.65µs    1.00     27.6±2.89µs
run_condition/yes/031_systems                   1.00     39.3±5.34µs    1.07     42.0±7.37µs
run_condition/yes/036_systems                   1.05     45.1±3.84µs    1.00     42.8±2.24µs
run_condition/yes/051_systems                   1.00     63.0±7.82µs    1.05    66.4±12.62µs
run_condition/yes/056_systems                   1.00    70.1±10.10µs    1.15    80.7±14.06µs
run_condition/yes/061_systems                   1.00     74.6±8.93µs    1.07    80.1±15.08µs
run_condition/yes/066_systems                   1.15   104.7±15.10µs    1.00    90.7±16.92µs
run_condition/yes/071_systems                   1.00    89.3±11.59µs    1.13   100.6±22.13µs
run_condition/yes/076_systems                   1.00     89.4±9.15µs    1.12   100.1±14.48µs
run_condition/yes/081_systems                   1.06    99.5±11.61µs    1.00     93.9±8.29µs
run_condition/yes/086_systems                   1.40   142.7±14.29µs    1.00   102.3±10.79µs
run_condition/yes/091_systems                   1.12   150.2±16.58µs    1.00   133.8±20.46µs
run_condition/yes/096_systems                   1.00   126.7±17.67µs    1.19   150.3±17.23µs
run_condition/yes_using_query/016_systems       1.07     28.6±2.33µs    1.00     26.8±0.42µs
run_condition/yes_using_query/031_systems       1.00     40.6±6.55µs    1.07     43.7±7.94µs
run_condition/yes_using_query/036_systems       1.11     47.2±7.30µs    1.00     42.7±0.89µs
run_condition/yes_using_query/046_systems       1.12    64.8±13.27µs    1.00     58.0±6.96µs
run_condition/yes_using_query/051_systems       1.00     64.4±7.90µs    1.07    69.0±12.70µs
run_condition/yes_using_query/061_systems       1.00    82.2±11.63µs    1.06    87.5±14.79µs
run_condition/yes_using_query/066_systems       1.08    88.3±13.49µs    1.00     81.7±9.15µs
run_condition/yes_using_query/071_systems       1.00    91.4±12.91µs    1.08    98.4±11.83µs
run_condition/yes_using_query/081_systems       1.00   105.3±19.28µs    1.07   112.2±14.51µs
run_condition/yes_using_query/086_systems       1.29   130.5±22.39µs    1.00    101.2±7.04µs
run_condition/yes_using_query/091_systems       1.15   126.8±18.31µs    1.00   110.2±12.02µs
run_condition/yes_using_query/101_systems       1.08   138.4±18.16µs    1.00   128.7±21.49µs
run_condition/yes_using_resource/036_systems    1.00     49.7±9.44µs    1.12    55.6±11.58µs
run_condition/yes_using_resource/046_systems    1.00     58.2±7.93µs    1.15    67.0±13.45µs
run_condition/yes_using_resource/051_systems    1.14    67.9±11.02µs    1.00     59.7±5.49µs
run_condition/yes_using_resource/056_systems    1.00    70.1±10.68µs    1.16    81.1±14.84µs
run_condition/yes_using_resource/061_systems    1.00     72.3±6.30µs    1.10    79.4±11.07µs
run_condition/yes_using_resource/066_systems    1.08    93.1±16.91µs    1.00    86.1±12.60µs
run_condition/yes_using_resource/076_systems    1.14    117.5±9.96µs    1.00   103.3±18.74µs
run_condition/yes_using_resource/081_systems    1.06   118.9±17.52µs    1.00   112.4±14.63µs
run_condition/yes_using_resource/086_systems    1.15   124.9±17.57µs    1.00   108.7±16.91µs
run_condition/yes_using_resource/096_systems    1.00   119.5±15.52µs    1.17   139.9±18.14µs
sized_commands_0_bytes/2000_commands            1.17      7.5±0.01µs    1.00      6.4±0.00µs
sized_commands_0_bytes/4000_commands            1.14     14.6±0.01µs    1.00     12.8±0.01µs
sized_commands_0_bytes/6000_commands            1.16     22.3±0.60µs    1.00     19.3±0.02µs
sized_commands_0_bytes/8000_commands            1.14     29.6±0.03µs    1.00     25.9±0.23µs
world_query_get/50000_entities_sparse_wide      1.00    389.1±0.51µs    1.07   415.6±40.24µs
world_query_iter/50000_entities_table           1.34     91.3±0.19µs    1.00     68.3±0.03µs

Seems like gains on added_archetype, contrived, events_iter and losses on busy_systems and events_send. Definitely lots of noise on run_condition, it swings back and forth test to test.

@alice-i-cecile alice-i-cecile removed the S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help label Oct 12, 2023
@alice-i-cecile alice-i-cecile added this to the 0.13 milestone Oct 12, 2023
@scottmcm scottmcm mentioned this pull request Nov 12, 2023
@@ -147,7 +149,7 @@ mod tests {
let mut world = World::new();
let mut mapper = EntityMapper::new(&mut map, &mut world);

let mapped_ent = Entity::new(FIRST_IDX, 0);
let mapped_ent = Entity::new(FIRST_IDX, 1).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this file could just use from_raw instead of the cfg(test)-only function? And that'd save some unwrap clutter too:

Suggested change
let mapped_ent = Entity::new(FIRST_IDX, 1).unwrap();
let mapped_ent = Entity::from_raw(FIRST_IDX);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point, I can also make this change in the lib.rs tests in a few places.

@@ -156,7 +158,9 @@ mod tests {
"should persist the allocated mapping from the previous line"
);
assert_eq!(
mapper.get_or_reserve(Entity::new(SECOND_IDX, 0)).index(),
mapper
.get_or_reserve(Entity::new(SECOND_IDX, 1).unwrap())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.get_or_reserve(Entity::new(SECOND_IDX, 1).unwrap())
.get_or_reserve(Entity::from_raw(SECOND_IDX))

@@ -191,7 +196,7 @@ impl Entity {
pub const fn from_raw(index: u32) -> Entity {
Entity {
index,
generation: 0,
generation: NonZeroU32::MIN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this function's rustdoc says "and a generation of 0", so that probably needs to be updated.

@@ -174,7 +178,7 @@ mod tests {
let mut world = World::new();

let dead_ref = EntityMapper::world_scope(&mut map, &mut world, |_, mapper| {
mapper.get_or_reserve(Entity::new(0, 0))
mapper.get_or_reserve(Entity::new(0, 1).unwrap())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mapper.get_or_reserve(Entity::new(0, 1).unwrap())
mapper.get_or_reserve(Entity::from_raw(0))

github-merge-queue bot pushed a commit that referenced this pull request Nov 14, 2023
(This is my first PR here, so I've probably missed some things. Please
let me know what else I should do to help you as a reviewer!)

# Objective

Due to rust-lang/rust#117800, the `derive`'d
`PartialEq::eq` on `Entity` isn't as good as it could be. Since that's
used in hashtable lookup, let's improve it.

## Solution

The derived `PartialEq::eq` short-circuits if the generation doesn't
match. However, having a branch there is sub-optimal, especially on
64-bit systems like x64 that could just load the whole `Entity` in one
load anyway.

Due to complications around `poison` in LLVM and the exact details of
what unsafe code is allowed to do with reference in Rust
(rust-lang/unsafe-code-guidelines#346), LLVM
isn't allowed to completely remove the short-circuiting. `&Entity` is
marked `dereferencable(8)` so LLVM knows it's allowed to *load* all 8
bytes -- and does so -- but it has to assume that the `index` might be
undef/poison if the `generation` doesn't match, and thus while it finds
a way to do it without needing a branch, it has to do something slightly
more complicated than optimal to combine the results. (LLVM is allowed
to change non-short-circuiting code to use branches, but not the other
way around.)

Here's a link showing the codegen today:
<https://rust.godbolt.org/z/9WzjxrY7c>
```rust
#[no_mangle]
pub fn demo_eq_ref(a: &Entity, b: &Entity) -> bool {
    a == b
}
```
ends up generating the following assembly:
```asm
demo_eq_ref:
        movq    xmm0, qword ptr [rdi]
        movq    xmm1, qword ptr [rsi]
        pcmpeqd xmm1, xmm0
        pshufd  xmm0, xmm1, 80
        movmskpd        eax, xmm0
        cmp     eax, 3
        sete    al
        ret
```
(It's usually not this bad in real uses after inlining and LTO, but it
makes a strong demo.)

This PR manually implements `PartialEq::eq` *without* short-circuiting,
and because that tells LLVM that neither the generations nor the index
can be poison, it doesn't need to be so careful and can generate the
"just compare the two 64-bit values" code you'd have probably already
expected:
```asm
demo_eq_ref:
        mov     rax, qword ptr [rsi]
        cmp     qword ptr [rdi], rax
        sete    al
        ret
```

Since this doesn't change the representation of `Entity`, if it's
instead passed by *value*, then each `Entity` is two `u32` registers,
and the old and the new code do exactly the same thing. (Other
approaches, like changing `Entity` to be `[u32; 2]` or `u64`, affect
this case.)

This should hopefully merge easily with changes like
#9907 that also want to change
`Entity`.

## Benchmarks

I'm not super-confident that I got my machine fully consistent for
benchmarking, but whether I run the old or the new one first I get
reasonably consistent results.

Here's a fairly typical example of the benchmarks I added in this PR:

![image](https://github.com/bevyengine/bevy/assets/18526288/24226308-4616-4082-b0ff-88fc06285ef1)

Building the sets seems to be basically the same. It's usually reported
as noise, but sometimes I see a few percent slower or faster.

But lookup hits in particular -- since a hit checks that the key is
equal -- consistently shows around 10% improvement.

`cargo run --example many_cubes --features bevy/trace_tracy --release --
--benchmark` showed as slightly faster with this change, though if I had
to bet I'd probably say it's more noise than meaningful (but at least
it's not worse either):

![image](https://github.com/bevyengine/bevy/assets/18526288/58bb8c96-9c45-487f-a5ab-544bbfe9fba0)

This is my first PR here -- and my first time running Tracy -- so please
let me know what else I should run, or run things on your own more
reliable machines to double-check.

---

## Changelog

(probably not worth including)

Changed: micro-optimized `Entity::eq` to help LLVM slightly.

## Migration Guide

(I really hope nobody was using this on uninitialized entities where
sufficiently tortured `unsafe` could could technically notice that this
has changed.)
@scottmcm
Copy link
Contributor

30% gain on events_iter??

While I have no idea if that change is meaningful, note that in a microbenchmark dealing with iterators where the item type is Entity, a fairly substantial change isn't impossible. If it's dealing with .next() calls that are returning Option<Entity>, today that's an ABI like

define void @next1(ptr noalias nocapture noundef writeonly sret(%"core::option::Option<Entity1>") align 4 dereferenceable(12) %_0, ptr noalias nocapture noundef readonly align 4 dereferenceable(4) %it) unnamed_addr #0 {

where

%"core::option::Option<Entity1>" = type { i32, [2 x i32] }

vs with the niche it's a simple pair that can be returned directly (without going through stack)

define { i32, i32 } @next2(ptr noalias nocapture noundef readonly align 4 dereferenceable(4) %it) unnamed_addr #1 {

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=1019a1da83891162de90a4472f1c1b47

Hopefully inlining and LTO and such would remove the effects of those differences, but sometimes zero-cost abstractions aren't ☹️

(2-variant enums are sometimes worse than one would like -- see rust-lang/rust#85133 (comment) -- but once niched that stops happening.)

Comment on lines +915 to +916
let (lo, hi) = lhs.get().overflowing_add(rhs);
let ret = lo + hi as u32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, clever! This codegens really elegantly; nice work 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kolsky up above came up with it:
#9907 (comment)

I was really surprised at it's elegance, definitely going to be using it in one or two places in my personal projects now that i know about it. I think, unfortunately, given my reading of:
#9797 (comment)

it means we won't be able to keep it (I think)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it we will be able to do something similar, but we'd need to wrap on a however many bits we have available for the generation segment. So at the moment, that would be 31 bits. So you'd have to do something like:

pub const fn nonzero_wrapping_high_increment(value: NonZeroU32) -> NonZeroU32 {
     let next_value = value.get().wrapping_add(1);
     // Mask the overflow bit
     let overflowed = (next_value & 0x8000_0000) >> 31;

    // Remove the overflow bit from the next value, but then add to it
    unsafe { NonZeroU32::new_unchecked((next_value & 0x7FFF_FFFF) + overflowed) }
}

As long as we know we are only incrementing by one each time, then it should still output fairly terse asm: https://rust.godbolt.org/z/PnYTPfGb6 Basically the same principle as before, just applied to wrapping on 31 bits (or whatever amount of bits we need later)

Copy link
Contributor Author

@notverymoe notverymoe Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, that's interesting 🤔 We can definitely use that for the regular increment, it was just nice that this version worked for both slot incrementing and Entities::reserve_generations (which requires an arbitrary increment, but honestly might not even be correct).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried a version with checked_add, it has a similar instruction count, but is probably more expensive.

@JoJoJet JoJoJet self-requested a review December 13, 2023 18:34
ghost

This comment was marked as outdated.

Copy link
Contributor

@maniwani maniwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I don't really have anything to add since we discussed most concerns in the linked Discord discussion. It's cool that we even avoided adding a branch to free.

@alice-i-cecile alice-i-cecile added this pull request to the merge queue Jan 8, 2024
@alice-i-cecile alice-i-cecile added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Jan 8, 2024
Merged via the queue into bevyengine:main with commit b257fff Jan 8, 2024
23 checks passed
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
(This is my first PR here, so I've probably missed some things. Please
let me know what else I should do to help you as a reviewer!)

# Objective

Due to rust-lang/rust#117800, the `derive`'d
`PartialEq::eq` on `Entity` isn't as good as it could be. Since that's
used in hashtable lookup, let's improve it.

## Solution

The derived `PartialEq::eq` short-circuits if the generation doesn't
match. However, having a branch there is sub-optimal, especially on
64-bit systems like x64 that could just load the whole `Entity` in one
load anyway.

Due to complications around `poison` in LLVM and the exact details of
what unsafe code is allowed to do with reference in Rust
(rust-lang/unsafe-code-guidelines#346), LLVM
isn't allowed to completely remove the short-circuiting. `&Entity` is
marked `dereferencable(8)` so LLVM knows it's allowed to *load* all 8
bytes -- and does so -- but it has to assume that the `index` might be
undef/poison if the `generation` doesn't match, and thus while it finds
a way to do it without needing a branch, it has to do something slightly
more complicated than optimal to combine the results. (LLVM is allowed
to change non-short-circuiting code to use branches, but not the other
way around.)

Here's a link showing the codegen today:
<https://rust.godbolt.org/z/9WzjxrY7c>
```rust
#[no_mangle]
pub fn demo_eq_ref(a: &Entity, b: &Entity) -> bool {
    a == b
}
```
ends up generating the following assembly:
```asm
demo_eq_ref:
        movq    xmm0, qword ptr [rdi]
        movq    xmm1, qword ptr [rsi]
        pcmpeqd xmm1, xmm0
        pshufd  xmm0, xmm1, 80
        movmskpd        eax, xmm0
        cmp     eax, 3
        sete    al
        ret
```
(It's usually not this bad in real uses after inlining and LTO, but it
makes a strong demo.)

This PR manually implements `PartialEq::eq` *without* short-circuiting,
and because that tells LLVM that neither the generations nor the index
can be poison, it doesn't need to be so careful and can generate the
"just compare the two 64-bit values" code you'd have probably already
expected:
```asm
demo_eq_ref:
        mov     rax, qword ptr [rsi]
        cmp     qword ptr [rdi], rax
        sete    al
        ret
```

Since this doesn't change the representation of `Entity`, if it's
instead passed by *value*, then each `Entity` is two `u32` registers,
and the old and the new code do exactly the same thing. (Other
approaches, like changing `Entity` to be `[u32; 2]` or `u64`, affect
this case.)

This should hopefully merge easily with changes like
bevyengine#9907 that also want to change
`Entity`.

## Benchmarks

I'm not super-confident that I got my machine fully consistent for
benchmarking, but whether I run the old or the new one first I get
reasonably consistent results.

Here's a fairly typical example of the benchmarks I added in this PR:

![image](https://github.com/bevyengine/bevy/assets/18526288/24226308-4616-4082-b0ff-88fc06285ef1)

Building the sets seems to be basically the same. It's usually reported
as noise, but sometimes I see a few percent slower or faster.

But lookup hits in particular -- since a hit checks that the key is
equal -- consistently shows around 10% improvement.

`cargo run --example many_cubes --features bevy/trace_tracy --release --
--benchmark` showed as slightly faster with this change, though if I had
to bet I'd probably say it's more noise than meaningful (but at least
it's not worse either):

![image](https://github.com/bevyengine/bevy/assets/18526288/58bb8c96-9c45-487f-a5ab-544bbfe9fba0)

This is my first PR here -- and my first time running Tracy -- so please
let me know what else I should run, or run things on your own more
reliable machines to double-check.

---

## Changelog

(probably not worth including)

Changed: micro-optimized `Entity::eq` to help LLVM slightly.

## Migration Guide

(I really hope nobody was using this on uninitialized entities where
sufficiently tortured `unsafe` could could technically notice that this
has changed.)
github-merge-queue bot pushed a commit that referenced this pull request Jan 22, 2024
Since #9907 the generation starts at `1` instead of `0` so
`Entity::to_bits` now returns `4294967296` (ie. `u32::MAX + 1`) as the
lowest number instead of `0`.

Without this change scene loading fails with this error message:
`ERROR bevy_asset::server: Failed to load asset
'scenes/load_scene_example.scn.ron' with asset loader
'bevy_scene::scene_loader::SceneLoader': Could not parse RON: 8:6:
Invalid generation bits`
@Testare
Copy link
Contributor

Testare commented Feb 19, 2024

Couldn't find information on this in the migration guide

@alice-i-cecile alice-i-cecile added the M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide label Feb 19, 2024
Copy link
Contributor

It looks like your PR is a breaking change, but you didn't provide a migration guide.

Could you add some context on what users should update when this change get released in a new version of Bevy?
It will be used to help writing the migration guide for the version. Putting it after a ## Migration Guide will help it get automatically picked up by our tooling.

@notverymoe
Copy link
Contributor Author

I'm deeply sorry for the inconvenience I've caused with this. I'll look into some solutions.

github-merge-queue bot pushed a commit that referenced this pull request Mar 3, 2024
# Objective
Adoption of #2104 and #11843. The `Option<usize>` wastes 3-7 bytes of
memory per potential entry, and represents a scaling memory overhead as
the ID space grows.

The goal of this PR is to reduce memory usage without significantly
impacting common use cases.

Co-Authored By: @NathanSWard 
Co-Authored By: @tygyh 

## Solution
Replace `usize` in `SparseSet`'s sparse array with
`nonmax::NonMaxUsize`. NonMaxUsize wraps a NonZeroUsize, and applies a
bitwise NOT to the value when accessing it. This allows the compiler to
niche the value and eliminate the extra padding used for the `Option`
inside the sparse array, while moving the niche value from 0 to
usize::MAX instead.

Checking the [diff in x86 generated
assembly](james7132/bevy_asm_tests@6e4da65),
this change actually results in fewer instructions generated. One
potential downside is that it seems to have moved a load before a
branch, which means we may be incurring a cache miss even if the element
is not there.

Note: unlike #2104 and #11843, this PR only targets the metadata stores
for the ECS and not the component storage itself. Due to #9907 targeting
`Entity::generation` instead of `Entity::index`, `ComponentSparseSet`
storing only up to `u32::MAX` elements would become a correctness issue.

This will come with a cost when inserting items into the SparseSet, as
now there is a potential for a panic. These cost are really only
incurred when constructing a new Table, Archetype, or Resource that has
never been seen before by the World. All operations that are fairly cold
and not on any particular hotpath, even for command application.

---

## Changelog
Changed: `SparseSet` now can only store up to `usize::MAX - 1` elements
instead of `usize::MAX`.
Changed: `SparseSet` now uses 33-50% less memory overhead per stored
item.
spectria-limina pushed a commit to spectria-limina/bevy that referenced this pull request Mar 9, 2024
# Objective
Adoption of bevyengine#2104 and bevyengine#11843. The `Option<usize>` wastes 3-7 bytes of
memory per potential entry, and represents a scaling memory overhead as
the ID space grows.

The goal of this PR is to reduce memory usage without significantly
impacting common use cases.

Co-Authored By: @NathanSWard 
Co-Authored By: @tygyh 

## Solution
Replace `usize` in `SparseSet`'s sparse array with
`nonmax::NonMaxUsize`. NonMaxUsize wraps a NonZeroUsize, and applies a
bitwise NOT to the value when accessing it. This allows the compiler to
niche the value and eliminate the extra padding used for the `Option`
inside the sparse array, while moving the niche value from 0 to
usize::MAX instead.

Checking the [diff in x86 generated
assembly](james7132/bevy_asm_tests@6e4da65),
this change actually results in fewer instructions generated. One
potential downside is that it seems to have moved a load before a
branch, which means we may be incurring a cache miss even if the element
is not there.

Note: unlike bevyengine#2104 and bevyengine#11843, this PR only targets the metadata stores
for the ECS and not the component storage itself. Due to bevyengine#9907 targeting
`Entity::generation` instead of `Entity::index`, `ComponentSparseSet`
storing only up to `u32::MAX` elements would become a correctness issue.

This will come with a cost when inserting items into the SparseSet, as
now there is a potential for a panic. These cost are really only
incurred when constructing a new Table, Archetype, or Resource that has
never been seen before by the World. All operations that are fairly cold
and not on any particular hotpath, even for command application.

---

## Changelog
Changed: `SparseSet` now can only store up to `usize::MAX - 1` elements
instead of `usize::MAX`.
Changed: `SparseSet` now uses 33-50% less memory overhead per stored
item.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times C-Usability A targeted quality-of-life change that makes Bevy easier to use M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
Status: Merged PR
Development

Successfully merging this pull request may close these issues.

10 participants