Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a Prepare stage for the sprite pipeline #5247

Closed

Conversation

ManevilleF
Copy link
Contributor

@ManevilleF ManevilleF commented Jul 7, 2022

Objective

The entire logic for rendering preparation is in the queue stage and should be moved in a prepare stage.

Solution

Add a prepare stage.

  • The separation logic between white and colored sprites is move to the extraction stage
  • The new prepare stage spawns the sprite batches and computes the vertex data
  • The queue stage only handles bind groups and render phases

I also added some documentation.

Related MRs:

@ManevilleF ManevilleF marked this pull request as ready for review July 7, 2022 18:31
@ManevilleF ManevilleF mentioned this pull request Jul 7, 2022
11 tasks
@ManevilleF
Copy link
Contributor Author

Output from many_sprites stress test:

This branch:

2022-07-07T18:51:54.244366Z  INFO many_sprites: Sprites: 102400
2022-07-07T18:51:54.244426Z  INFO bevy diagnostic: frame_time                      :    0.047533s (avg 0.047724s)
2022-07-07T18:51:54.244438Z  INFO bevy diagnostic: fps                             :   21.038135  (avg 20.982425)
2022-07-07T18:51:54.244443Z  INFO bevy diagnostic: frame_count                     : 507.000000

Main:

2022-07-07T18:53:37.460995Z  INFO many_sprites: Sprites: 102400
2022-07-07T18:53:37.461047Z  INFO bevy diagnostic: frame_time                      :    0.048746s (avg 0.050301s)
2022-07-07T18:53:37.461059Z  INFO bevy diagnostic: fps                             :   20.514588  (avg 19.940816)
2022-07-07T18:53:37.461063Z  INFO bevy diagnostic: frame_count                     : 504.000000

~1 FPS gain

@ManevilleF ManevilleF force-pushed the feat/sprite_pipeline_rework branch from dbea56a to 725cba2 Compare July 8, 2022 09:24
@alice-i-cecile alice-i-cecile added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times C-Code-Quality A section of code that is hard to understand or change labels Jul 8, 2022
@ManevilleF ManevilleF force-pushed the feat/sprite_pipeline_rework branch from d506de5 to 304798d Compare July 9, 2022 10:53
Comment on lines +287 to +292
};
if sprite.color == Color::WHITE {
extracted_sprites.sprites.alloc().init(sprite);
} else {
extracted_sprites.colored_sprites.alloc().init(sprite);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By merging #5103 this kind of code duplication would disappear

@ManevilleF
Copy link
Contributor Author

New output from many_sprites stress test:

This branch:

2022-07-09T11:27:08.678547Z  INFO many_sprites: Sprites: 102400
2022-07-09T11:27:08.678605Z  INFO bevy diagnostic: frame_time                      :    0.047834s (avg 0.042740s)
2022-07-09T11:27:08.678612Z  INFO bevy diagnostic: fps                             :   20.905776  (avg 23.421536)
2022-07-09T11:27:08.678617Z  INFO bevy diagnostic: frame_count                     : 500.000000

main:

2022-07-09T11:28:57.559113Z  INFO many_sprites: Sprites: 102400
2022-07-09T11:28:57.559164Z  INFO bevy diagnostic: frame_time                      :    0.050723s (avg 0.050071s)
2022-07-09T11:28:57.559171Z  INFO bevy diagnostic: fps                             :   19.714747  (avg 20.006181)
2022-07-09T11:28:57.559176Z  INFO bevy diagnostic: frame_count                     : 500.000000

~3 FPS gain

@superdump
Copy link
Contributor

superdump commented Jul 9, 2022

New output from many_sprites stress test:

This branch:

2022-07-09T11:27:08.678547Z  INFO many_sprites: Sprites: 102400
2022-07-09T11:27:08.678605Z  INFO bevy diagnostic: frame_time                      :    0.047834s (avg 0.042740s)
2022-07-09T11:27:08.678612Z  INFO bevy diagnostic: fps                             :   20.905776  (avg 23.421536)
2022-07-09T11:27:08.678617Z  INFO bevy diagnostic: frame_count                     : 500.000000

main:

2022-07-09T11:28:57.559113Z  INFO many_sprites: Sprites: 102400
2022-07-09T11:28:57.559164Z  INFO bevy diagnostic: frame_time                      :    0.050723s (avg 0.050071s)
2022-07-09T11:28:57.559171Z  INFO bevy diagnostic: fps                             :   19.714747  (avg 20.006181)
2022-07-09T11:28:57.559176Z  INFO bevy diagnostic: frame_count                     : 500.000000

~3 FPS gain

Cool. :) Could you get set up to use Tracy for profiling and publish the frame time graphs comparing the run from main and from this PR? There is documentation here: https://github.com/bevyengine/bevy/blob/main/docs/profiling.md#backend-trace_tracy and use Tracy 0.8.1 as that works with current bevy main. I have copied one of the CI testing ron files like: .github/example-run/breakout.ron to ../benchmark.ron and edited the number of frames to execute to 1500. Then for each branch of interest, I run the Tracy capture tool outputting to an appropriately-named file like ./capture-release -o TOPIC-BRANCH-$(date +%Y%m%d-%H%M).tracy (renaming TOPIC and BRANCH as appropriate) and then run an example like CI_TESTING_CONFIG=../benchmark.ron cargo run --release --features bevy_ci_testing,trace_tracy --example many_sprites and leave it focused and let it run. Then switch branches and run again. Then open the reference trace and click the compare button at the top and open the other trace, select the frames radio button and screenshot the graph and timing statistics.

Copy link
Contributor

@superdump superdump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't taken the time to closely compare the cut and pasted code to make sure it's the same. But this is looking pretty good. Interesting that it brings a performance benefit. :)

Copy link
Contributor

@superdump superdump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now checked over the cut and pasted code to look for differences. I did find some but I think it will be ultimately the same from a functional perspective, just possibly different performance with a bit more work for the post-phase-sort batching code to do.


for (colored, sprites) in [
(false, &extracted_sprites.sprites),
(true, &extracted_sprites.colored_sprites),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that doing this would be wrong as previously all sprites whether colour or not were sorted together but I guess the batches will be split as part of the batching code that is run after the phases are sorted. It could create more work for the batching to do though. Maybe need to compare performance for something with a mix of colour and not colour sprites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced this separation in 0a0d5c4, previously colored and non colored were merged together.
I can rollback the changes but the performance improvements would go back to 1FPS instead of 3FPS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems maybe they aren't being split in the batching system... maybe. I'm speculating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may have figured out the problem. Because this processes the tinted and plain sprites separately, it only ends up inserting 2 sprite batches below and I hadn't realised that.

The previous code in queue_sprites adds a SpriteBatch for every sprite because we don't know if they need to be merged or not because there are other things that can be drawn (Mesh2ds with custom materials or so) and in the sprite code we don't know at what z they are, nor that they even exist.

So you can't just add one SpriteBatch per batch we perceive here. At least not with the current batch_phase_system. You have to add one per sprite.

crates/bevy_sprite/src/render/mod.rs Outdated Show resolved Hide resolved
@ManevilleF
Copy link
Contributor Author

ManevilleF commented Jul 9, 2022

@superdump

Red: main branch
Yellow: This branch

Screenshot 2022-07-09 at 18 13 16

do you want me to do the same test without the color separation?

It feels like this MR brings perf gains but also a more consistent framerate

@superdump
Copy link
Contributor

@superdump

Red: main branch Yellow: This branch

Screenshot 2022-07-09 at 18 13 16

Cool! This is why I like seeing the Tracy graphs of the frame time distribution. It makes the differences much more clear than a sliding window average fps. :)

do you want me to do the same test without the color separation?

Ideally it would be good to see if the PR in its current state incurs a cost or benefit for the sorting/batching systems in the render app sort stage. I don't have a good gut feeling for that at the moment. And then yes, I suppose it would be good to compare before/after the split. This is performance-sensitive code and is 'hard-won' so it's great that this PR brings a performance benefit, but we should also check that it doesn't introduce a regression in a more varied and real-world-ish test.

It feels like this MR brings perf gains but also a more consistent framerate

Yup, it's looking very good so far! :) A 10% frame time reduction is awesome!

@cart cart added this to the Bevy 0.8 milestone Jul 13, 2022
@ManevilleF ManevilleF force-pushed the feat/sprite_pipeline_rework branch from 5084583 to ddfbf91 Compare July 14, 2022 09:08
bors bot pushed a commit that referenced this pull request Jul 14, 2022
# Objective

Allow better performance testing for #5247


## Solution

I added color tints to the `many_sprites` example stress test.
@ManevilleF ManevilleF force-pushed the feat/sprite_pipeline_rework branch from ddfbf91 to fdf09f3 Compare July 14, 2022 12:14
@ManevilleF
Copy link
Contributor Author

@superdump Shocking results imo:

For many_sprites ran with the --colored feature we have this:

Screenshot 2022-07-14 at 14 26 18

Yellow: This PR
Red: main

It's very suspicious how much faster this branch is, I wonder if something maybe broke

@superdump
Copy link
Contributor

I'll have to try it out. That looks great if true! :)

@ManevilleF
Copy link
Contributor Author

If the separation I did works correctly I think ExtractedSprites should store a HashMap<Color, Vec<ExtractedSprite>> intead

@inodentry
Copy link
Contributor

IIRC, sorting all the sprites was one of the major bottlenecks in the sprites renderer. It makes sense to me that splitting the sprites preemptively to sort them as multiple smaller arrays, rather than one huge array, would improve perf.

Potentially, now that there are multiple separate buckets of data, these sorts could also be done in parallel (maybe just if the arrays are large). Maybe that could further improve perf?


The other major bottleneck for sprites rendering is the memory bandwidth bloat from all the data in vertex attributes. However, afaik, we have all thought hard about that area and haven't found ways to reduce it much, without moving to a different rendering approach using storage buffers (incompatible with WebGL2).

@superdump
Copy link
Contributor

I responded about sprite performance that is unrelated to this change in #rendering-dev on Discord as it's quite a deep topic: https://discord.com/channels/691052431525675048/743663924229963868/997122417795284993

@superdump
Copy link
Contributor

Could you update this on top of main please? :)

@ManevilleF
Copy link
Contributor Author

It's in progress, #5310 brought some conflicts

@alice-i-cecile
Copy link
Member

@superdump do you think this is worth keeping in the milestone as we draw close to release? Seems great but not critical to me?

const QUAD_VERTEX_POSITIONS: [Vec2; 4] = [
// Top left
Copy link
Contributor

@superdump superdump Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bottom left. y goes up? And similar for the rest.

Comment on lines +341 to +342
0, 2, 3, // Bottom left triangle
0, 1, 2, // Top right triangle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
0, 2, 3, // Bottom left triangle
0, 1, 2, // Top right triangle
0, 2, 3, // Top-left triangle
0, 1, 2, // Bottom-right triangle

@superdump
Copy link
Contributor

superdump commented Jul 18, 2022

@superdump do you think this is worth keeping in the milestone as we draw close to release? Seems great but not critical to me?

It isn't critical, but it is a night and day enormous performance improvement for many_sprites when using many separate batches.

I just tested on an M1 Max and I see similar performance uplift. Here is the distribution of frame times:
Screenshot 2022-07-18 at 16 32 23
Yellow - main at the root of the PR branch at the time of testing.
Red - this PR.

@superdump
Copy link
Contributor

This does look suspicious though. I've been trying to figure out what I'm missing. I just thought I would compare the statistics for main at the root of the PR for queue_sprites + system commands overhead + main_pass_2d, and then this PR with prepare_sprites + system commands overhead + queue_sprites + system command overhead + main_pass_2d. Everything but prepare_sprites takes very little time, which seems very odd. So now my guess is that the batching logic was broken and everything is being drawn in one batch but we didn't notice. :) This needs to be investigated. @ManevilleF

I did see some z-fighting on one sprite in one run of many_sprites but I thought it was just that they had the same z and the same texture, so their order could vary. I think if the z is the same and the image handle the same, then we should also compare the entity id perhaps? Anyway, that's a separate issue to this PR.

@ManevilleF
Copy link
Contributor Author

I'm not sure I understand, are you saying that this PR is fixing batching logic?
If so I think we should compare bevy 0.7, main and this PR. Maybe the logic was broken on main ?


for (colored, sprites) in [
(false, &extracted_sprites.sprites),
(true, &extracted_sprites.colored_sprites),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may have figured out the problem. Because this processes the tinted and plain sprites separately, it only ends up inserting 2 sprite batches below and I hadn't realised that.

The previous code in queue_sprites adds a SpriteBatch for every sprite because we don't know if they need to be merged or not because there are other things that can be drawn (Mesh2ds with custom materials or so) and in the sprite code we don't know at what z they are, nor that they even exist.

So you can't just add one SpriteBatch per batch we perceive here. At least not with the current batch_phase_system. You have to add one per sprite.

@superdump
Copy link
Contributor

I'm not sure I understand, are you saying that this PR is fixing batching logic? If so I think we should compare bevy 0.7, main and this PR. Maybe the logic was broken on main ?

I was saying that this PR breaks batching logic. See above. :)

@ManevilleF
Copy link
Contributor Author

So the prepare stage should spawn one batch per ExtractedSprite ? With a range of 1? When are the batches merged together?

@superdump
Copy link
Contributor

No, it’s rather that you add a SpriteBatch for each sprite with the batch range that is to the best of your knowledge. So in this case the plain sprites would have just 0, then 0 to 1 inclusive, 0 to 2, etc. and the tinted ones would also have something similar. But I think each has its own index within the batch and its own sort key. Then all the plain and tinted and mesh2d and everything get sorted so now they’re interleaved with each other. And then the batch system will merge / split batches based on whether there are other things that are from other batches or that are non-batchable between. So if you have sprite batches with plain index 0 batch 0 to 0, then tinted index 0 batch 0 to 0, then plain index 1 batch 0 to 1, then it would have to split the plain batch because there is a tinted batch item between. I’d suggest to look closely at how the previous code worked.

@cart cart removed this from the Bevy 0.8 milestone Jul 20, 2022
inodentry pushed a commit to IyesGames/bevy that referenced this pull request Aug 8, 2022
# Objective

Allow better performance testing for bevyengine#5247


## Solution

I added color tints to the `many_sprites` example stress test.
james7132 pushed a commit to james7132/bevy that referenced this pull request Oct 28, 2022
# Objective

Allow better performance testing for bevyengine#5247


## Solution

I added color tints to the `many_sprites` example stress test.
@ManevilleF ManevilleF closed this Nov 1, 2022
@inodentry
Copy link
Contributor

I know this PR is closed now, but i just have a comment regarding the phase item batching. @superdump @ManevilleF

It is not strictly necessary to have one phase item per sprite for correct sorting and batching.

As you have said previously, the problem is that other non-sprite items (like Mesh2d or Text2d or whatever) could exist, and if their Z coordinate is in-between sprites, then those batches have to be broken up.

So the thing is, what we actually care about are the Z coordinates. Respecting the Z layering.

So you could preemptively batch things somewhat and spawn a pre-batched phase item of sprites, as long as they have the same Z coordinate. Group the sprites based on Z coordinate. This should still work correctly (allow other things to be rendered sandwiched between sprites) and also reduce the work for the phase item batching stage somewhat, when there are many sprites at the same Z.

@ManevilleF
Copy link
Contributor Author

I know this PR is closed now, but i just have a comment regarding the phase item batching. @superdump @ManevilleF

It is not strictly necessary to have one phase item per sprite for correct sorting and batching.

As you have said previously, the problem is that other non-sprite items (like Mesh2d or Text2d or whatever) could exist, and if their Z coordinate is in-between sprites, then those batches have to be broken up.

So the thing is, what we actually care about are the Z coordinates. Respecting the Z layering.

So you could preemptively batch things somewhat and spawn a pre-batched phase item of sprites, as long as they have the same Z coordinate. Group the sprites based on Z coordinate. This should still work correctly (allow other things to be rendered sandwiched between sprites) and also reduce the work for the phase item batching stage somewhat, when there are many sprites at the same Z.

I closed this in favor of #6621 which doesn't try to change any logic ( since I ended up breaking the batching)
But from what you are saying there could be a way to spawn the sprite batchs in the prepare stage and then handle the phases in the queue stage?
Btw I'm only doing this in order to implement Slicing and tiling (#5213) But I'm in way over my head here, as my understanding of the render pipeline is not great

@inodentry
Copy link
Contributor

@ManevilleF

I'm just trying to help you understand why sprites are represented in phase items the way they are. :)

Conceptually, you can think of each phase item is a "thing to draw". Bevy's phase items are the abstraction that allows the engine to generically optimize the entire set of objects to be rendered in the whole scene, without having to know the details of how each one is to be rendered exactly.

The original purpose for all this phase items stuff (and the PhaseSort render stage) is to be able to sort everything in the entire scene (front-to-back for opaque render passes to reduce overdraw, back-to-front for blended render passes for correct transparency / alpha blending).

Everything that needs to be rendered (sprites, meshes, shapes, other custom user things, ...) represents itself as some sort of phase item. This is the job of the Queue systems: to generate "phase items" for the things they want to render, and add them to bevy.

Each phase item contains a "sort key" (typically the Z distance from the camera) and other metadata, as well as the "draw function", which is what contains the code to do the actual rendering (wgpu draw calls).

During the PhaseSort stage, Bevy will sort all the phase items. During the Render stage, it will call their "draw functions", so they can render themselves however they like.

When sprite batching was implemented in Bevy, PhaseSort/PhaseItems were extended to also do batching as well as sorting. After all the items in the scene are sorted, there is a system that will try to opportunistically batch them together. It will check if successive items are "compatible" and merge them. For now, this is only used for sprites (AFAIK), but it is a general abstraction that could be extended to other things, too.

So, this is how sprites become "batched". If there are many "sprite" phase items one after another, and there is nothing in between, they get merged into a single phase item (whose draw function will then render them all using one draw call).

There can be other 2d things, besides sprites. For example, Mesh2d, Text2d, shapes/tilemaps from 3rd party plugins, some sort of custom things implemented by users, ... the list goes on. These things would also have their own phase items.

Imagine you have 5 sprites at Z = 1.0, and 15 sprites at Z = 3.0. All of these 20 sprites can be batched together.
Now, imagine that the user spawns some Text2d with Z = 2.0. It has to be rendered in-between. Now, Bevy needs to batch the first 5 sprites together, then draw the text, and then batch the remaining 15 sprites. If you draw all 20 sprites as one batch, the result will be incorrect.

If every individual sprite is a separate phase item, all of this "just works". Bevy just iterates through all phase items and merges together what it can, replacing the phase items of the sprites with phase items that draw a whole batch at once.

Now, the question is, could you "pre-batch" things? Could you create these "sprite batch" phase items ahead of time? Instead of the queue system creating one phase item per sprite (which would then be merged by the general "phase batching" system), could it create already-batched phase items to begin with?

And the answer is: yes, but only if it respects all the aforementioned considerations. It must respect the fact that other non-sprite items might exist.

What this really means is that you can group sprites together by Z coordinate, but no more than that. Using the example from above, the queue system could have created a phase item for the 5 sprites at Z=1.0, and another phase item for the 15 sprites at Z=3.0. Bevy will then happily merge those two phase items into one big 20-sprite batch, if it can (when there are no non-sprite items in between). This would have been just as correct as creating one phase item for each individual sprite.

@ManevilleF
Copy link
Contributor Author

@inodentry Thank you very much ! This explanation is very clear and helps a lot.

ItsDoot pushed a commit to ItsDoot/bevy that referenced this pull request Feb 1, 2023
# Objective

Allow better performance testing for bevyengine#5247


## Solution

I added color tints to the `many_sprites` example stress test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Code-Quality A section of code that is hard to understand or change C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants