Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bevy_<util/ecs>] Better SparseArray performance #2104

Closed

Conversation

NathanSWard
Copy link
Contributor

@NathanSWard NathanSWard commented May 5, 2021

Problem:

  • bevy_ecs::SparseArray uses an Option<T> internally to represent is the
    value is valid or not.
  • This causes extra branching as well as extra memory overhead due to
    Option's discriminatior.

Solution:

  • Implement a NonMaxUsize class which guarantees Option<NonMaxUsize> is the same size as NoneMaxUsize.

Fixes #1558

@alice-i-cecile alice-i-cecile added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times labels May 5, 2021
@alice-i-cecile
Copy link
Member

I'd love to see benchmarks on this if you have them <3 Not doubting that it's faster, but I'm curious about the impact (and it's nice for release notes).

@NathanSWard NathanSWard force-pushed the nward/sparse-array-option branch from da0235a to 8d1242f Compare May 5, 2021 03:02
@NathanSWard
Copy link
Contributor Author

Note, I'm totally up for bike-shedding on the NoneValue trait name, and it's functions' names.

@BoxyUwU
Copy link
Member

BoxyUwU commented May 5, 2021

Is there a reason we cant use types with niche's for all of our sparseset indexes? I havent had a chance to look through all our keys and see if this is possible but it would be nice to know before doing something like this

Edit: after looking through it seems like everything we use here is either a usize or contains a usize and so could we just use NonZeroUsize instead

@NathanSWard
Copy link
Contributor Author

NathanSWard commented May 5, 2021

Edit: after looking through it seems like everything we use here is either a usize or contains a usize and so could we just use NonZeroUsize instead

What about ArchetypeId

    #[inline]
    pub const fn empty() -> ArchetypeId {
        ArchetypeId(0)
    }

Also, I'm pretty sure we need to allow the values to be 0, since a lot of them are indices into arrays and the first valid index is 0 (see the SparseSetIndex trait).
So if there was a NonMaxUsize that would be quite convenient....

@cart
Copy link
Member

cart commented May 5, 2021

Yeah the primary purpose of these sparse sets is to be used in the context of arrays, so NonZeroUsize isn't really viable. I don't want to push "dummy data" into the zero index of each array or need to remember to add 1 in the context of sparse sets.

@cart
Copy link
Member

cart commented May 5, 2021

I'd also like to see benches, as the primary motivator here is perf.

@NathanSWard NathanSWard force-pushed the nward/sparse-array-option branch from 8d1242f to 7c9b6d7 Compare May 7, 2021 05:34
@NathanSWard
Copy link
Contributor Author

NathanSWard commented May 7, 2021

Another interesting find.
I used a different implementation that had a NonMaxUsize struct.

#[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
pub struct NonMaxUsize(NonZeroUsize);

impl NonMaxUsize {
    #[inline]
    pub const fn new(n: usize) -> Option<Self> {
        Some(Self(NonZeroUsize::new(n ^ usize::MAX)?))
    }

    #[inline]
    pub const unsafe fn new_unchecked(n: usize) -> Self {
        Self(NonZeroUsize::new_unchecked(n ^ usize::MAX))
    }

    #[inline(always)]
    pub const fn get(self) -> usize {
        self.0.get() ^ usize::MAX
    }
}

Granted the big downside is that whenever you want to access the value you have to do a XOR usize::MAX. But with the implementation, I try to limit the number of get() calls.

This only optimizes that path that we have a SparseArray<T, usize> where I replaced usize with NonMaxUsize.
This performance results seemed to have the best outcome thus far.

BEVY-NONMAXUSIZE.txt

You can see the branch/commit I used for testing here.

@BoxyUwU
Copy link
Member

BoxyUwU commented May 8, 2021

n XOR usize::MAX is just bitwise negation right? Could we just do !self.0.get() and !n instead?

@NathanSWard
Copy link
Contributor Author

n XOR usize::MAX is just bitwise negation right? Could we just do !self.0.get() and !n instead?

Yep, they're the same 😄
I would personally prefer the xor since I think its clearer what the implementation is doing, but I don't have really have a strong opinion.

Also in case performance is a concern, they both compile down to the same assembly.

@alice-i-cecile
Copy link
Member

I find the "!" form clearer, but I don't feel strongly about it :p

@NathanSWard NathanSWard force-pushed the nward/sparse-array-option branch from e791424 to 6edd244 Compare May 9, 2021 01:58
@NathanSWard
Copy link
Contributor Author

I just pushed changes changing the implementation to use the NoneMaxUsize impl.
From my benches (as seen in the comments above) this had the best performance.

However, I would like someone to run the benches as well just to confirm :)

@NathanSWard NathanSWard force-pushed the nward/sparse-array-option branch from 6edd244 to e3a6667 Compare May 9, 2021 02:05
@Frizi
Copy link
Contributor

Frizi commented May 13, 2021

I've ran the benchmarks against current main branch. Note that the insertion benchmarks are fairly noisy, but iteration ones is fairly stable.

Table sorted rougly by relative timing difference, from biggest regression first to biggest improvement last. Note that variance quite high in some cases, error bars being bigger than the absolute difference. So take those with a grain of salt.

group                                       main                     sparse-array-option
-----                                       ----                     -------------------
add_remove_component/bevy_sparse_set        1.00  1392.1±142.59µs    1.04  1449.9±358.26µs
add_remove_component_big/bevy_sparse_set    1.00  1529.3±299.08µs    1.02  1553.3±291.35µs
fragmented_iter/bevy                        1.00   460.0±41.16ns     1.01   464.1±33.63ns
fragmented_iter/bevy_foreach                1.00   309.5±24.98ns     1.01   312.4±19.65ns
heavy_compute/bevy                          1.00   1054.9±7.81µs     1.00  1057.9±12.34µs
add_remove_component/bevy_table             1.00  1732.8±149.43µs    1.00  1735.5±156.73µs
simple_iter/bevy_foreach                    1.00     12.6±0.94µs     1.00     12.5±0.57µs
sparse_fragmented_iter/bevy                 1.01     15.1±1.12ns     1.00     15.0±1.27ns
get_component/bevy_system                   1.01  1217.9±52.25µs     1.00  1209.3±105.16µs
add_remove_component_big/bevy_table         1.02      4.2±0.16ms     1.00      4.1±0.10ms
schedule/bevy                               1.03     59.9±2.78µs     1.00     58.4±2.96µs
get_component/bevy                          1.04  1509.7±257.13µs    1.00  1453.1±142.34µs
simple_iter/bevy_sparse_foreach             1.06    62.2±10.27µs     1.00    58.6±11.26µs
simple_iter/bevy                            1.09     23.5±2.21µs     1.00     21.5±1.68µs
simple_insert/bevy                          1.09   678.2±51.43µs     1.00   621.3±30.79µs
sparse_fragmented_iter/bevy_foreach         1.10     13.2±0.87ns     1.00     12.0±1.30ns
simple_iter/bevy_sparse                     1.11     63.1±4.82µs     1.00     56.9±2.38µs
simple_insert/bevy_unbatched                1.20  1963.5±364.22µs    1.00  1638.5±49.03µs

@NathanSWard
Copy link
Contributor Author

ping @cart for opinions/views on the perf tests.

@bjorn3
Copy link
Contributor

bjorn3 commented May 14, 2021

@Frizi If you are on Linux could you try using taskset to limit the benchmark process to a specific core and echo performance | sudo -k tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor to make the frequency scaling more predictable.

@Frizi
Copy link
Contributor

Frizi commented May 14, 2021

@Frizi If you are on Linux could you try using taskset to limit the benchmark process to a specific core and echo performance | sudo -k tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor to make the frequency scaling more predictable.

I'm a Windows desktop user :) Also with HyperV and WSL enabled, thus making the Windows kernel a guest OS. So probably not much I can do about the high variance without drastically changing my setup.

@cart cart added the S-Pre-Relicense This PR was made before Bevy added the Apache license. Cannot be merged or used for other work label Jul 23, 2021
@mockersf mockersf removed the S-Pre-Relicense This PR was made before Bevy added the Apache license. Cannot be merged or used for other work label Jul 24, 2021
@alice-i-cecile
Copy link
Member

This LGTM, but I prefer the approach taken in #3678.

@alice-i-cecile alice-i-cecile added the S-Adopt-Me The original PR author has no intent to complete this work. Pick me up! label Jul 14, 2022
github-merge-queue bot pushed a commit that referenced this pull request Mar 3, 2024
# Objective
Adoption of #2104 and #11843. The `Option<usize>` wastes 3-7 bytes of
memory per potential entry, and represents a scaling memory overhead as
the ID space grows.

The goal of this PR is to reduce memory usage without significantly
impacting common use cases.

Co-Authored By: @NathanSWard 
Co-Authored By: @tygyh 

## Solution
Replace `usize` in `SparseSet`'s sparse array with
`nonmax::NonMaxUsize`. NonMaxUsize wraps a NonZeroUsize, and applies a
bitwise NOT to the value when accessing it. This allows the compiler to
niche the value and eliminate the extra padding used for the `Option`
inside the sparse array, while moving the niche value from 0 to
usize::MAX instead.

Checking the [diff in x86 generated
assembly](james7132/bevy_asm_tests@6e4da65),
this change actually results in fewer instructions generated. One
potential downside is that it seems to have moved a load before a
branch, which means we may be incurring a cache miss even if the element
is not there.

Note: unlike #2104 and #11843, this PR only targets the metadata stores
for the ECS and not the component storage itself. Due to #9907 targeting
`Entity::generation` instead of `Entity::index`, `ComponentSparseSet`
storing only up to `u32::MAX` elements would become a correctness issue.

This will come with a cost when inserting items into the SparseSet, as
now there is a potential for a panic. These cost are really only
incurred when constructing a new Table, Archetype, or Resource that has
never been seen before by the World. All operations that are fairly cold
and not on any particular hotpath, even for command application.

---

## Changelog
Changed: `SparseSet` now can only store up to `usize::MAX - 1` elements
instead of `usize::MAX`.
Changed: `SparseSet` now uses 33-50% less memory overhead per stored
item.
@james7132
Copy link
Member

This has been implemented for non-component sparse sets with #12083, a PR that was (twice) adopted from this one, and it's likely infeasible to implement for component sparse sets without it being a correctness issue for the rest of the ECS. Closing this out.

@james7132 james7132 closed this Mar 4, 2024
spectria-limina pushed a commit to spectria-limina/bevy that referenced this pull request Mar 9, 2024
# Objective
Adoption of bevyengine#2104 and bevyengine#11843. The `Option<usize>` wastes 3-7 bytes of
memory per potential entry, and represents a scaling memory overhead as
the ID space grows.

The goal of this PR is to reduce memory usage without significantly
impacting common use cases.

Co-Authored By: @NathanSWard 
Co-Authored By: @tygyh 

## Solution
Replace `usize` in `SparseSet`'s sparse array with
`nonmax::NonMaxUsize`. NonMaxUsize wraps a NonZeroUsize, and applies a
bitwise NOT to the value when accessing it. This allows the compiler to
niche the value and eliminate the extra padding used for the `Option`
inside the sparse array, while moving the niche value from 0 to
usize::MAX instead.

Checking the [diff in x86 generated
assembly](james7132/bevy_asm_tests@6e4da65),
this change actually results in fewer instructions generated. One
potential downside is that it seems to have moved a load before a
branch, which means we may be incurring a cache miss even if the element
is not there.

Note: unlike bevyengine#2104 and bevyengine#11843, this PR only targets the metadata stores
for the ECS and not the component storage itself. Due to bevyengine#9907 targeting
`Entity::generation` instead of `Entity::index`, `ComponentSparseSet`
storing only up to `u32::MAX` elements would become a correctness issue.

This will come with a cost when inserting items into the SparseSet, as
now there is a potential for a panic. These cost are really only
incurred when constructing a new Table, Archetype, or Resource that has
never been seen before by the World. All operations that are fairly cold
and not on any particular hotpath, even for command application.

---

## Changelog
Changed: `SparseSet` now can only store up to `usize::MAX - 1` elements
instead of `usize::MAX`.
Changed: `SparseSet` now uses 33-50% less memory overhead per stored
item.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times S-Adopt-Me The original PR author has no intent to complete this work. Pick me up!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace SparseArray Option<T> with T::MAX to cut down on branching
8 participants