-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched ECS Query #1990
Comments
This is compelling, and I could see this having sizable performance impacts. Implementation will be fairly complex, but I don't think this will be a controversial feature. |
So, thinking about this a little more, to truly get the benefits of an SoA representation for calculations such as Ideally, you'd have the X components stored as [x,x,x,...], Y as [y,y,y...]... and so on. This makes the vectorization straightforward, because you simply load chunks at the native vector width (as OP suggested), and do the math in each SIMD lane. For example, for Px = [x0, x1, ... x7] t is a scalar Velocity (Vx, Vy, Vz) follows same form as position. Now, Px' = Px + tVx These are straightforward assembly instructions and bonus, the data for Px', Py', Pz' are already in the correct form to just replace directly in the storage as-is. References: https://www.intel.com/content/www/us/en/developer/articles/technical/memory-layout-transformations.html (AoS and SoA in general) |
Seems I just can't help myself! Digging around the code a bit, to facilitate this kind of change, we might need to ensure the allocator for the tables are correctly aligned for SIMD if this isn't already done. (I'm not sure what the Rust default is, but if it's just using the system malloc, then usually it's 16 bytes aligned for free). Normally this kind of change is pretty simple to implement and due to the way the ECS is stored, even if you didn't use SIMD, you'd only incur an overhead of a few bytes to achieve the desired "overalignment". For example, AVX2 requires a 32-byte alignment. The next observation I've made is that if the alignment is good, then this could probably be implemented without too much trouble (of course now that I've said that... :) ). It only really makes sense for Components using Dense storage. I would imagine the Sparse storage implementation would just have an "adapter" to allow it to work, and possibly kick out a warning that what you're doing doesn't really make sense. Or, just panic, because that would be a leaky abstraction? Not sure what the Bevy policy is on that one. Maybe there's some compile-time magic that could make it so doing a "batched" query over a Sparsely backed component would fail to compile. The only reason you'd use this API would be for performance, so this is unlikely to affect a beginner. It seems, due to the nice design of the Bevy ECS, that most of the work could be accomplished directly within the Query code. From what I can tell, aside from the alignment concerns I brought up earlier, no changes need to be made to the fundamental parts of the ECS to support this. Regarding the Query API additions, there is a question of what the best way to expose the batches might be. Ideally, since the data is already aligned, we could return a "Batch" type that simply references that data, e.g, For types that don't have a "direct" SIMD representation, we could leave it to the user to define their own rearrangement. Example usage, adapted from example in the OP: //These are just markers for dimensions
struct X;
struct Y;
struct Z;
#[derive(Component)]
struct Position<Dimension>(f32);
#[derive(Component)]
struct Velocity<Dimension>(f32);
// For convenience, something like:
//type PositionBundle = (Position<X>, Position<Y>, Position<Z>)
fn update_positions<Dimension>(ents: Query<(&mut Position<Dimension>, &Velocity<Dimension>)>, time: Res<Time>)
{
// This "8" for the batch size should be provided at compile time, maybe with a bevy_simd::NATIVE_F32_WIDTH constant?
ents.for_each_mut_batch::<8>::(|mut positions, velocities| {
if unsafe { likely(positions.len() == 8) } { // positions.len() == velocities.len()
// I'm being vague with the "simd" type here --
// the only thing that is important is the support for + and scalar multiplication in this case
//This updates the positions in place. The "Batch" type maps the result back to the storage,
// and ideally *is* the storage
positions = (positions.to_simd() + time*velocities.to_simd()).from_simd();
// Note that .to_simd() would only be available for "SIMD" types, e.g., i32, f32... if you had a Vec3 in your batch,
// you'd need to do this conversion yourself. This is the 'rearrangement' I mentioned earlier. And you'd
// need to 'rearrange' this data back, too.
} else {
// Necessary, because there's no guarantee that the query results will be
// a multiple of the batch size. I think this could happen if there's a query filter, or
// if we're at the end of the iteration and the number of remaining elements isn't a multiple
// of the batch size.
// This is a topic that could be explored a bit more later.
for (mut pos, vel) in positions.iter_mut().zip(velocities.iter()) {
pos = pos + t*vel; //Just scalar code
}
}
});
}
// In main
app.add_system(update_positions<X>);
app.add_system(update_positions<Y>);
app.add_system(update_positions<Z>); The reason in this example each coordinate is processed separately is that if you process them together, you're going to incur a ton of cache misses from the data access pattern. It's better to "keep on going" down each coordinate than switch from X to Y to Z, which live in different storages and could be quite far in memory from one another. Obviously this is a lot more complicated than the simple cases we're used to... but again, this is for cases where performance is a concern and we need to bring out the big guns. People should ideally prefer to not write things this way by default. The sample code only really makes sense when you start considering large numbers of entities, etc, that are being iterated over. Iteration could be combined with something like On a final note, I would recommend allowing users to specify the batch size they want to work with, as sometimes you get some benefits by using a multiple of the native vector width (e.g., using a batch size of 16 instead of 8 for AVX) due to pipelining. Sorry about the ramble -- I just find this really interesting, and I think Bevy is in a unique position to capitalize on SIMD this way. EDIT: one last thing, notice that these are now 3 separate systems in the example. So even without something like |
Just curious, is there interest in a prototype implementation of this? Or benchmarks to show potential speedup under (good) circumstances? I believe I could mock up a benchmark with the existing Bevy. I'd be willing to take this on as a fun little project. |
I'd be interested in seeing this! I'd try to steer clear of splitting Transform in the proof of concept, just to avoid derailing. |
Agreed! I'll keep the PoC focused on just a hypothetical custom "Position" and "Vector" component :) |
@alice-i-cecile Not quite ready to share my code yet, but my initial benchmarks are promising: "naive" is the normal Vec3 way of doing position + time*velocity (e.g., AoS representation) (BTW, this was done using the excellent The X axis represents the number of entities in the system. I went up in powers of 2. If there are some ideal numbers to reach here please let me know. The Y axis represents how long it takes to do one iteration of computing So what I did was I "faked" the approach using the existing Bevy ecosystem by having the "batched" version's entities actually manage 4 "virtual entities" each with their own position and velocity. Therefore, for the batched version, the true effective entity count was the same, despite the number of "entities" reported by Bevy being 1/4th of the ones in the other "naive" benchmarks. Now, I fully expect this to improve further when I add in AVX and use |
Those are some promising results! Doesn't the |
Err yes... my bad! I'll edit the post (very sleepy right now). |
While I'm working on some more in-depth benchmarks, I found this excellent post from the nalgebra project about different data layouts and their realization in Rust. The post is a few years old and I'm sure the state of the art has advanced a bit since then in how SIMD is actually realized in Rust, but the general concepts all still apply. There's some potential big wins on the table :) |
Wow, so here's a somewhat unexpected result: I implemented a rough AoSoA representation for comparison (improvements noted and needed -- currently relying on I think to make this benchmark conclusive, I'm going to have to use the Unfortunately, the implementation of AoSoA into Bevy isn't an obvious thing to me at this time. Whereas SoA is straightforward and scales nicely with vector size, AoSoA kind of permeates the system a bit. Things need to be aware of it. It's probably not a huge deal, but it would definitely need some consideration. Still, I want to include some actual, hard, reproducible data to guide whatever decisions may come out of this. |
@InBetweenNames These are some interesting results, thanks for investigating 👍 From what I gather the benchmark is very much leaning into the strengths of SIMD. I don't know that much about this topic, so my question is: For example, I'm thinking about cases where random access happens, which would require 3x the amount of memory fetches for |
For the SoA layout, it will indeed run you more cache misses than compared to AoSoA or AoS layouts for the reasons you've mentioned. For AoSoA, this won't be nearly as bad. I'm curious, does random access occur commonly in Bevy systems? I always figured people would use the Sparse layouts for that (which batching wouldn't apply to). I'm fairly new to Rust and Bevy, so I'm still getting a handle on things here. |
Thanks for clarifying!
I don't have enough experience to give an exact answer to this. But I'd expect it to be not that common since it's also harder to use than straight up iteration. Though, some problems just don't fit this linear access pattern. For example, I'd say anything that traverses hierarchy or some other kind of entity relation (e.g. transform and visibility propagation) ends up performing random* access. At least transform propagation requires access to all transform information (scale, rotation, position). iirc some form of acceleration structure for hierarchies was being discussed but I don't know about the current state or plans for this. * it's obviously not actually "random" but afaik the data storage does not care about the hierarchy order |
Ah, I hadn't even considered hierarchical relations! Thanks for that :) With that in mind, it may make sense to go for an AoSoA approach in the long run. You gain the benefits of efficient random access and SIMD potential. There are some complications on the implementation side of course... It would be super nice to be able to say: #[derive(Component)]
#[component_layout_aosoa(<width>)] //width is probably 4 for x86_64 generic, but if you have AVX2 or something, you might go for 8.
//Or, if unspecified, you might let the compiler choose the width.
struct Position
{
x: f32,
y: f32,
z: f32,
} Which desugars to: //with appropriate alignment, traits
struct PositionAoSoA4
{
x: f32x4
y: f32x4
z: f32x4
}
impl LayoutAoSoA for PositionAoSoA4
{
//number of lanes, metadata, etc
} Later, fn optimal_system(query: Query<...>)
{
//available only for AoSoA?
query.for_each_mut_batched(| batch | { ... }); //batch is generic and will work across a variety of widths
}
//perhaps, a batch might look like this:
enum Batch<T, N>
{
//alignment directives, etc
Partial([T]) //slice type, size < N
Full([T; N]) //annotated with 'likely', either within for_each_mut_batched or somehow on the type.
}
//Now, code that uses Batch does not need to add the likely() check to help the optimizer:
fn optimal_system(query: Query<(&mut Position, &Velocity)>)
{
query.for_each_mut_batched(| batch |
if let Full((mut positions, velocities)) = batch {
// some batch fast path
// Bevy added likely and assume FOR you! Yay!
} else if let Partial((positions, velocities)) = batch {
// slow path: positions.len() == velocities.len() < N
for (mut pos, vel) in positions.iter_mut().zip(velocities.iter()) {
// fallback scalar slow path
}
}
} Code that "doesn't care" about layout can safely ignore it using regular iterators, etc: fn straightforward_system(query: Query<...>)
{
query.for_each(| component | { ... }); // for_each provides the expected semantics independent of the underlying layout
for component in query.iter() { ... }; // iter also provides expected semantics
} which automatically pick out components for particular entities. Random access Aside: The astute reader might ask "can we transform the straightforward code into vectorized code?". I actually had a small bug in my benchmarks today, where the I included a "swizzle" benchmark as well to see what the performance difference would be between rearranging contiguous Vec3s into SIMD form and back. Just for comparison. As I have verified that the generated assembly is indeed using packed SIMD (on applicable benches, though I haven't guaranteed it is 'optimal'), I think the benchmark code is now ready for interested people to play around with. It has some limitations: it is limited to SSE4 due to the vector type in play, and it is not using the Rust equivalent of The benchmark code is available here: https://github.com/InBetweenNames/bevy-simd-bench |
ASIDE: I was a curious why the speedup for the non-AoS layouts wasn't as much as I expected, so I fired up vTune. With 4-wide SIMD, accounting for overhead, I expected somewhere around a 3.5x increase in speed. Instead, we observed only around a 2x increase. In this screenshot, you're seeing the inlined There's even more outside of that range. This could be a good optimization target for the future :) |
Okay so, not all roses as I'd hoped. It looks like more than just the Query types will need to be altered; the Tables themselves might need some notion of batching too. In particular, a query over tuple (&T, &U, &V) with a batch size of N will need to return a something like a (&[T], &[U], &[V]). These would need to be retrieved from the table(s). Queries have filtering functionality too, which it seems could maybe fragment iteration? Need to investigate the implications of that. If it does indeed cause fragmentation, does it make sense to have such a query and try to accelerate it with SIMD? Should this be excluded? Need some guidance on that. Ideally, the alignment of the batch is a multiple of the required SIMD alignment. But this may not always be the case. It would be really nice to use the conditions at compile time and provide a suitable type with appropriate alignment guarantees baked into the type system. This would require the Layouts passed to the allocator to provide said guarantees (for the underlying storage of the Tables). In fact, it might make sense to provide an additional "aligned" method if those alignment guarantees can be given. NOTE; The above case would require rearrangement in this case to get to a SIMD form. It may make sense to restrict the "aligned" method to SIMD-like types, as there's probably not much benefit to alignment in the above case. However, the benefits of alignment would be much more apparent when working with, say, the SoA form of Position defined earlier in this post. From what the benchmarks suggest, it may only make sense to implement the "aligned" batch and fail to compile otherwise (the "rearrangement" benchmark performed worse than even AoS). Since this is a performance-centric feature, it would actually be a leaky abstraction to downgrade to a "slow path" without informing the developer anyway. Next, I'll investigate what changes need to be done to get table storage to allocate to the native SIMD alignment of the worst case for the target platform. This will provide more opportunities for aligned loads and have very little overhead. Hopefully I can isolate that work into a separate PR to prepare for the later changes. NOTE: if the allocator isn't providing memory with the desired allocation for the tables, you there will need to be both a "prologue" and an "epilogue" for each table where a partial batch would be processed. Making the allocator itself give aligned memory to the tables makes only the epilogue necessary. This is outside of "filtering" considerations however, which will cause more of these partial batches if I'm understanding it right. |
About midway through the prototype implementation. There's a lot to be improved for sure, but it's coming along. I wanted to note here that the change ticks are a little tricky to deal with during batched queries. I'll have to think over a good way to handle that over the near future. A low-bloat solution would be ideal. In the meantime, I've needed to make a few additions to
@alice-i-cecile Question: What's a good type to represent a bounded slice with potential overalignment (e.g., a slice of The murky explorations of the ECS continues... |
This lines up with my intuition on the matter, but @james7132 or @bitshifter are better people to ask here. |
I don't really know enough about Bevy internals to answer. If you want to have an aligned array of
and have a If you are loading bytes into SIMD registers, you should be able to load unaligned data without too much of a penalty, so if that's what you are wanting to do you don't strictly need allocations to be aligned. |
Unfortunately, unaligned loads are not portable across architectures. They also have performance implications where they are supported in some cases, especially if they aren't natively supported. Otherwise I agree that on Intel at least, they aren't too bad. With the right design, we can fix this up front for no cost :) |
Made a good amount of progress. There's a few more changes than I originally anticipated but so far, reads are working and Miri isn't complaining. My working batch type definition is: #[repr(align(16), C)] //FIXME: align needs to be configured properly at compile time
#[derive(Clone, Debug)]
pub struct AlignedArray<T, const N: usize>([T; N]);
#[derive(Clone, Debug)]
pub enum QueryBatch<'a, T, const N: usize> {
Full(&'a AlignedArray<T, N>), // Full aligned batch of N
Partial(&'a [T]), //<= N. MAY be aligned... under certain conditions
} (thanks to @bitshifter for the repr(C) pointer with that). Internally, this enum will actually decompose nicely along the code paths involved, emitting aligned moves, etc, as possible. I also made an addition to ThinSlicePtr to enable "batched" retrieval of component elements. This is "smart" in that it will attempt to align your batches if possible by giving you a partial batch. If alignment is not possible, only partial batches will be returned... but honestly, this should probably be a static assert. On that topic, I found a nice little hack to get compile time static assertions in Rust, which are in the prototype. They can be replaced with debug asserts if desired. Hopefully will make some more progress this week. So far, it's looking very good, and I expect when |
Another update. Due to the //FIXME: derive this in a maintainable way
#[cfg(all(any(target_feature = "avx"), not(target_feature = "avx512f")))]
pub const MAX_SIMD_ALIGNMENT: usize = 32;
#[cfg(any(target_feature = "avx512f"))]
pub const MAX_SIMD_ALIGNMENT: usize = 64;
//All platforms get 16-byte alignment on tables guaranteed.
#[cfg(not(any(target_feature = "avx512f")))]
pub const MAX_SIMD_ALIGNMENT: usize = 16; I agree it's not pretty, and it's also not scalable. Is there a better way to do this? Please let me know. The main reason this constant is even in here is for generating "native" binaries. The good news is requiring minimum 16-bytes alignment on the table is hardly a problem. First, the alignment only applies to the table itself -- the elements remain exactly as they were -- so there is only an overhead of a few bytes per table. Second, most "C" allocators will give you a 16-bytes aligned chunk of memory anyway and that's never been a contentious thing (although on my system, the Rust one doesn't). Third, the tables will be quite large relative to the overhead size. (I would actually have chosen MAX_SIMD_ALIGNMENT to be 64 bytes just to have a "one size fits all" solution, but I figured that probably wouldn't get accepted 😆). So, I settled on 16 bytes as a conservative choice. I have writes working as well in a few small test cases. The change ticks were a little tricky to get right, but fortunately, with the alignment situation now "uniform" across tables, that can be vectorized as well. Query tuples are next: with MAX_SIMD_ALIGNMENT, now aligned chunks can be safely obtained from multiple tables for iteration. Also, I'm leaning more towards modifying The implementation is getting to the point now where it's a matter of "taste" on how it looks, etc. I'll migrate this issue to a PR once it's ready. Lots of "playing around" code to clean up first. |
Would disabling avx512f and enabling avx512vl, avx512dw and avx512bw instead work? Or do those have a hard dependency on avx512f? |
Great question. Actually, the behaviour you're after is entirely in the hands of the developer. I'm unsure if you can just straight up disable AVX-512F without also disabling all the extensions, but I have good news to share. Here's a rustc issue describing that the defaults are sane for Rust. The key is the On that note, in this circumstance, tables would still be 64-bytes aligned, in case someone did want to use those explicit intrinsics. I expect that the 64-bytes alignment might be beneficial anyway, as the table would start immediately at the start of a cache line. Actually, that may make a good case for increasing the alignment to 64 on |
We may have to stick with just 16-bytes aligned chunks for the initial commit until #[repr(C, align(16))]
#[derive(Clone, Debug)]
pub struct AlignedArray16<T, const N: usize>([T; N]);
#[repr(C, align(32))]
#[derive(Clone, Debug)]
pub struct AlignedArray32<T, const N: usize>([T; N]);
#[repr(C, align(64))]
#[derive(Clone, Debug)]
pub struct AlignedArray64<T, const N: usize>([T; N]); And select between arrays with some type-level machinery: pub type AlignedArrayT<T, const N: usize, const ALIGN: usize> = ... //one of the three types above I.e., pub type AlignedArray<T, const N: usize> = AlignedArrayT<T, N, { gcd::euclid_usize(core::mem::size_of::<[T;N]>(), MAX_SIMD_ALIGNMENT) }> Now This would be awesome to have... except you need One way forward is to add an query.for_each_batched::<4, 32>(scalar_path, vector_path); //Batches are aligned to 32 bytes and have this parameter be defaulted to 16 (the lowest AlignedArray type defined). This would be checked at compile time for correctness, but isn't the cleanest approach obviously. It does provide a natural upgrade path in the future when the necessary language features are stabilized though. The only risk of setting this parameter lower than it could be would be potentially lower performance on some architectures, but the code would otherwise still be correct. Setting the parameter too high would yield a compile time error. More An alternative would be to avoid this ALIGN parameter entirely and just stick with using an AlignedArray16 for the fast path, even if greater alignment is actually the case. This would provide speedups for SSE4 code (some/most instructions require an aligned memory operand), On AVX and higher, this might not be too bad either, since supporting microarchitectures are generally a lot more performant with unaligned loads anyway. I can't comment substantially on anything other than x86 on the CPU side, unfortunately. EDIT: it looks like the parameter can't be defaulted to 16 as that feature doesn't apply to const generics on functions. |
Just a small update, the code is coming along but I've found a few ways of making the design a bit nicer so I've held off on submitting for the moment. I'm also going to see if the batch alignment can be computed with a macro as a workaround for the above noted issue -- although even if it can't, the other workaround should still work fine. For the curious, the code lives in my fork here: https://github.com/InBetweenNames/bevy/tree/batched-queries but don't expect it to compile -- I mainly committed it just so I'd have a backup available :) EDIT: should have some time in the next 5 days to dedicate a solid day to getting it cleaned up and make a real PR out of it. |
Wow, has it been two weeks already? Time flies. Been a little busy with a few things and vacation, but I got the prototype actually compiling tonight on my
And some nice to haves for future work:
My latest work has been pushed up in case anyone is interested in trying it out. |
A few bugs to work out, I think. Performance is actually better than expected. In the below screenshot, you can see the "soa" representation simulated previously with the actual native SoA representation from the batched_queries branch (labeled It's a little kludgy, but this is how the prototype works so far in practice: As you can see, the desired alignment and batch size is passed in with a const generic parameter. There's a few things I'd like to add to make this cleaner -- there's some "bridge" code in there that I don't want to have in the initial PR (conversion of batches of 4 f32s to Vec4s for example). What I hope this screenshot emphasizes is how the types work out (editor is VSCode with rust-analyzer). It's very exciting to see it all start coming together like this! |
I wrote up a unit test today to comprehensively test things out and the results are computed as expected. This caught a few small bugs, but only of the "off by one" variety (nothing fundamental). From a cursory examination of the generated assembly, I can conclude that empty filters really do optimize out entirely too (e.g. However, there was a major bug caught that accounted for the performance increase: change ticks modification was acciidentally not being done due to a lazy iterator not being traversed. Oops. Results of the bugfix shown below: In this graph, the "simulated" batching is compared against the native batching offered by For native:
For simulated:
And for native, nochange:
The only real difference? The tick code updating. In the "simulated" case, each change tick corresponds to four "virtual entities", whereas with native batching, each entity gets its own change tick. Therefore, the native version is doing "more work". Notice also that ComponentTicks are defined as pairs of
Note the indices here: 0x4, 0xc, 0x14, 0x1c. Sure, this fits in a single cache line, but these are spread out. Not good! I would really like to support fast mutation code with change ticks. To do that, I'm going to try my hand at refactoring Since I refactored component ticks to be in batches of AlignedArray* now, part of this work is already complete. Ordinarily, |
As an FYI, we've been chewing with the idea of splitting change ticks out into their own phantom components. It would help with alignment, as you've seen here, but also make opting out completely much more feasible. |
So far it's been less work than I imagined -- you're right, they really do work like components. I suspect after getting that small refactor done, that the performance hit will be at the expected level. To facilitate this change, I'm adding a new type, I looked at other solutions on crates.io to achieve this and didn't see anything production ready, so hopefully this is OK for now :) Also, with this new model, |
Awesome :D Bias towards small, incremental PRs so they're easy to review! |
Yeah I may have misspoke on the last point 😆 ... enough changes have been made now that I feel the change detection should be a separate PR. However, I will still keep the AlignedVec stuff as it's necessary for keeping the batching invariants for Entities and ComponentTicks, and it's not a lot of code. Given the "fundamental" changes, this will still probably be a bigger PR as it makes small changes in a number of different places. Maybe we could work together on converting ComponentTicks over to two phantom components after this PR is made? |
I'm on board to see what that looks like :) Hard to fully assess without seeing the changelist honestly. |
The PR is finally here: #6161 |
PR bevyengine#6161 Issue bevyengine#1990 * Only Dense queries are accelerated currently * Code refactored to use GATs * Batch type does not encode alignment as per discussion * Simplified calling for_each_{mut_}batched (no longer need _ arguments) * Documentation about SIMD and batching
PR bevyengine#6161 Issue bevyengine#1990 * Only Dense queries are accelerated currently * Code refactored to use GATs * Batch type does not encode alignment as per discussion * Simplified calling for_each_{mut_}batched (no longer need _ arguments) * Documentation about SIMD and batching
PR bevyengine#6161 Issue bevyengine#1990 * Only Dense queries are accelerated currently * Code refactored to use GATs * Batch type does not encode alignment as per discussion * Simplified calling for_each_{mut_}batched (no longer need _ arguments) * Documentation about SIMD and batching * Added an example and an AVX-specific benchmark
PR bevyengine#6161 Issue bevyengine#1990 * Only Dense queries are accelerated currently * Code refactored to use GATs * Batch type does not encode alignment as per discussion * Simplified calling for_each_{mut_}batched (no longer need _ arguments) * Documentation about SIMD and batching * Added an example and an AVX-specific benchmark
PR bevyengine#6161 Issue bevyengine#1990 * Only Dense queries are accelerated currently * Code refactored to use GATs * Batch type does not encode alignment as per discussion * Simplified calling for_each_{mut_}batched (no longer need _ arguments) * Documentation about SIMD and batching * Added an example and an AVX-specific benchmark
PR bevyengine#6161 Issue bevyengine#1990 * Only Dense queries are accelerated currently * Code refactored to use GATs * Batch type does not encode alignment as per discussion * Simplified calling for_each_{mut_}batched (no longer need _ arguments) * Documentation about SIMD and batching * Added an example and an AVX-specific benchmark
Closed PR #6161 during backlog cleanup. Could still be adopted in 2024, as the PR has been updated several times over the years! |
Hard to believe it's been so long since I worked on this. I've since moved onto other things but I still think this is worth doing based on the benchmarks I showed previously. My PR is probably way out of date by now and it's probably best to start fresh. Good luck! |
Yes, someone on Discord said much the same (still worth doing) so it's on my list of issues to open! Thanks for the original work, we'll link it as a reference for anyone looking to implement. |
What problem does this solve or what need does it fill?
More optimizations could be done in systems that could iterate over queries in a batched, packed way rather than individually. A clear example would be SIMD.
What solution would you like?
I'd like to be able to use queries in the following way:
I'm using
Deref
&DerefMut
informally in this example. I imagine it would be some wrapper likeRes
, but I don't want to suggest any names.What alternative(s) have you considered?
Manual buffering to reconstruct this information, but it's hard to implement especially for parallel iteration.
Additional context
For densely stored archetypes, I would like it to return however many spatially contiguous items as it's able to, up to the
batch_size
setting of course. If the total number of compatible rows for a batched query is equal tobatch_size
, but it's spread across different archetypes/tables and therefore not contiguous, I would prefer to get N separate batches underbatch_size
. That is what the&[Component]
API would require anyway.It would also be great if Bevy could prove to the compiler that all of these component-slices are of the same length. That would perhaps improve optimization.
The text was updated successfully, but these errors were encountered: