Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Objective
Speed up extract_sprite on sprite-heavy workloads.
Solution
I stumbled upon some optimization discussions focusing on the cost of
extract_sprite
in sprite-heavy workloads such as bevymark. It got me curious so I had a look at bevymark specifically on Linux using perf.Here is the base profile (bevymark): https://share.firefox.dev/3f160Wb
Asset::get
Note that since it was recorded with perf, threads are sampled only when they are running. This means that if a thread calls only a single function for a millisecond and then sleeps for multiple seconds, that function will get assigned 100% (of the active time) even though it ran for only a small portion of the actual time. Just something to keep in mind to not get confused when interpreting the results.
In that profile you can see
extract_sprite
occupying about 14% of the active time. so it's a visible slice of CPU time.Inside
extract_sprite
, a bit more than half of the time is spent inAssets::get
, which is visibly not cheap.The first commit in this PR addresses this by caching the previous atlas and sprite query and reusing it when multiple consecutive queries operate on the same handle. On a synthetic benchmark such as bevymark this pretty much makes
Assets::get
disappear from the profile and halves the cost ofextract_sprite
since we probably end up only executing the expensive getter once per frame.In a real game the wins would likely not be as dramatic, however since reusing the same textures is important for rendering performance I expect that consecutive access to the same handle is frequent. I'd like to have a more real-world sprite-heavy workload to validate thus assumption.
I think that this optimization could be lifted in a generic helper in the assets code. And could benefit other areas where it is known that consecutive access to the same handles are commonplace.
Here is a profile after optimizing away the getters: https://share.firefox.dev/3zCUNEF
_memcpy*
Another visible slice of CPU time is spent moving memory via
__memcpy_avx_unaligned_erms
inVec::push
. I have seen this pattern a lot in various rust projects. This happens when pushing a large structs (ExtractedSprite
) into vectors or other allocating containers.The code looks like
vector.push(Structure { .. })
, a structure being initialized directly as an argument ofpush
and what we would want here is forthe struct be initialized directly into the vector's storage. Howeverpush
may first have allocate which can panic, and you can't reorder operations across potentially panicking ones (the reality is probably more nuanced than that but that's a pretty good mental model).So the struct is first initialized on the stack, then the vector ensures it has room for it and then a memcpy is used to move the data and calling memcpy ends up being expensive enough show up as the majority of the remaining CPU time spent in
extract_sprite
.Thankfully there is a very simple and effective workaround for this: the
copyless
crate. It provides an helper methods toVec
that separates the allocation of a slot into the vector and writing data into it.vector.push(Structure { .. })
becomesvector.alloc().init(Structure { .. })
, which while looking very similar has the particularity of letting us initialize the struct after ensuring its spot is allocated. This lets llvm reliably optimize away the move.Sorry this is a lot of chatter for a small thing but this trick is very useful and can help in many areas of a typical code base. I've seen a few other places in the profile where this would help.
Note that it's not useful when pushing large values that already have to exist on the stack for other reasons. It's also not necessary to use it for very small data moved with memcpy.
I think that the remaining things in
extract_sprite
like the cost of clearing the vector when items have aDrop
impl were already discussed and have PRs open or on the way by other people.