Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request (post-Stageless): "inline" flag for systems #7208

Closed
inodentry opened this issue Jan 15, 2023 · 6 comments
Closed

Feature request (post-Stageless): "inline" flag for systems #7208

inodentry opened this issue Jan 15, 2023 · 6 comments
Labels
A-ECS Entities, components, systems, and events C-Feature A new feature, making something new possible C-Performance A change motivated by improving speed, memory usage or compile times

Comments

@inodentry
Copy link
Contributor

inodentry commented Jan 15, 2023

This is just an idea I had, that I'd like to document somewhere so it does not get lost. It will only really be actionable after Stageless is implemented.

What problem does this solve or what need does it fill?

Spawning tasks onto the multithreaded task executor runtime, for running each system, is currently a major source of overhead for Bevy. Many Bevy projects, ironically, perform better if their schedule is set to single-threaded execution.

It is considered "idiomatic" (and we often encourage users to do it) to have many small systems that often do little to no work in Bevy. We say this to "improve parallelism opportunities". I don't think this is bad advice and I don't think we should change what we consider "idiomatic". It also leads to clean code.

However, in practice, this leads to bad performance. A typical trace is full of "bubbles" and "system execution overhead" from Bevy preparing and spawning tasks for all systems, only for many of them to do nothing or very little actual work when run. We need to address this overhead.

There are two separate fronts to attack here:

  • Improving the quality of our implementation to reduce the perf overhead of running systems with multithreading. There are many great efforts and PRs for this, and this is not the purpose of this issue.
  • Giving users the APIs/tools/mechanisms they need to control the runtime/execution of their systems and be able to save themselves the overhead when possible. This is the scope of this issue.

Bevy needs better APIs for allowing users to control and reduce the overheads of multithreaded execution. One major tool ("run conditions") will come with Stageless. This issue proposes one more such mechanism.

Additional Context: How will Stageless improve things?

Stageless "Run Conditions" (which I helped design with the same perf considerations as I described above) give users the ability to tell Bevy when to run or not run their systems. Run Conditions are evaluated by the executor itself, without spawning a task (and therefore without the associated perf overhead), and a task for running the actual system will only be spawned if the system should run. Unlike our legacy "Run Criteria", users can easily add and compose multiple "Run Conditions", allowing for flexibility and precise control.

After stageless, a common development workflow to optimize a Bevy game would be: "if you know when your system shouldn't run, go and add some run conditions to control it!". Users will have the power to prevent systems that have nothing to do that frame, from being run by Bevy and slowing things down.

What solution would you like?

I would like to propose one more mechanism for giving users more control over the overhead of parallel system execution: the ability to tell Bevy to not spawn a task for a given system, and instead run it inside the executor itself (similar to how Run Conditions and exclusive systems work).

This only makes sense if a user knows that a certain system, even when it has work to do, does very little and will always finish quickly. If the user determines that the overhead of parallel execution is more than the actual work the system is doing.

Like other optimizations (such as ECS table/sparse-set storage), it is a trade-off and a balancing act. It is a lever that users can pull in niche scenarios when they consider it worth doing.

Setting this property on a system will remove the task spawning overhead for that system, but may bottleneck the executor task if the system takes too long to run (holding it up from running other systems that may have become ready).

What alternative(s) have you considered?

Using exclusive systems. They are run like I described above ("inline" inside the executor, without spawning a task onto the thread pool), and always on the main thread (which is not necessary for the "inline" systems proposed here). Therefore, they can also accomplish the goal of being efficient and not introducing runtime overhead.

However, they, by definition, do not allow other parallel systems to run alongside them. They completely disable parallel system execution.

The solution proposed by this issue is better, because it allows for a "middle ground". A system marked with the proposed "inline execution flag", while it runs inside the executor, still allows for other systems (that were run using tasks) to run in parallel.

@inodentry inodentry added C-Feature A new feature, making something new possible S-Needs-Triage This issue needs to be labelled A-ECS Entities, components, systems, and events and removed S-Needs-Triage This issue needs to be labelled labels Jan 15, 2023
@alice-i-cecile alice-i-cecile added the X-Controversial There is active debate or serious implications around merging this PR label Jan 15, 2023
@alice-i-cecile
Copy link
Member

Bevy needs better APIs for allowing users to control and reduce the overheads of multithreaded execution.

Definitely agree here: this is an area that I'd like to see improve. Atomic schedule groups were my suggestion along these lines: bevyengine/rfcs#46, allowing you to "group" systems into a single access set that's run in a single task.

I definitely think we should avoid the "inline" terminology, like you proposed on Discord, calling these systems "single-threaded" is much clearer.

Long-term, I'd really like to have tools to automate the analysis of these schedule optimizations, but this seems like a sensible primitive tuning lever to have access to.

@alice-i-cecile alice-i-cecile added C-Performance A change motivated by improving speed, memory usage or compile times and removed X-Controversial There is active debate or serious implications around merging this PR labels Jan 15, 2023
@Guvante
Copy link
Contributor

Guvante commented Jan 15, 2023

I think it would be important to clarify whether the intent here is to require that the system run inline or allow the system to be ran inline.

It could become the case that the overhead of parallelism could be negligible at times. If that were the case treating this akin to how compilers often treat inline (as a suggestion) could allow doing useful work on the main thread in a safer fashion (since the work being done is light weight) while the system is working hard but still parallelizing the work if the system isn't.

Of course the dilemma if that is done is that the version proposed here has an implicit bonus feature, it can control the ordering that systems execute. If that feature is used an the inlining becomes optional conflicts can occur.

@hymm
Copy link
Contributor

hymm commented Jan 16, 2023

I think this makes sense to try. The overhead of spawning a task is probably an order of magnitude higher than running a simple system.

The question that has to be answered is whether running a system "inline" is worth blocking the executor from running more systems in parallel. I think the actual call to spawn a task is fast. Most of the extra overhead from running systems comes from having to move tasks between threads. So I'm unsure if a system just being fast is worth it or not. Probably depends on how many other systems are available to run.

Besides that I have concerns about usability. I worry it'll be hard for the average user to know when it's worth it to inline your system or not. It'll be a balance between how fast your system runs and how much parallelism is available in your system graph. It is potentially also very footgunny if the performance profile of your system changes.

@maniwani
Copy link
Contributor

maniwani commented Jan 19, 2023

I want to note that exclusive systems are not run "inline" (yet).

Because (1) those are the only systems that can insert or remove thread-local (!Send) resources, and (2) we can't choose which thread a system runs on unless its the local thread, we have to run exclusive systems on the local thread.

However, the multi-threaded executor's main logic is not locked to the local thread. It can move to other threads so that it doesn't block systems with NonSend params, so we have to spawn tasks using spawn_on_scope to force them on the local thread.

This feature is not possible to impl cleanly until we split !Send resources from the World in a way that we won't have to care which thread a system runs on. (We are working on that, so hopefully won't take long.)

@james7132
Copy link
Member

james7132 commented Feb 15, 2023

Note, if you're currently using Tracy for profiling, the largest overhead for running a system is currently creating the spans for profiling. The actual overhead in a normal run of an app is much smaller.

IMO we should attempt to lower the overhead as much as we can before exposing too many fine tuning knobs publicly. Performance should be the default, not the exception. Brainstorming a few approaches to this:


I've been contemplating a measurement based flow where we track the last X executions of the system via a rolling average, and opportunistically inline them into the multithreaded executor task when the average is significantly lower than the average task start time.

We'd need to buffer the inlined systems to execute after each wave of spawning tasks to avoid blocking other systems while they're running, and an escape hatch for kicking off another executor task if our guess was wrong and one of the systems ends up taking longer than usual to run.


Alternatively, we can tackle the problem at the source. I've been repeatedly saying that we should look into more latency-oriented alternatives to async_exeuctor. While it is by far the simplest option that universally works, it's current design leaves a lot to be desired, and the devs don't seem to be interested in making optimizations in favor of code simplicity.

@james7132
Copy link
Member

james7132 commented Mar 23, 2024

In a way #11906 resolves this in the opposite direction: systems now immediately run the executor (or tries and fails to grab the mutex on it) upon completion. Any new systems that can run are immediately scheduled. With optimizations like local queue scheduling in async_executor (hasn't landed in a not yanked release yet), this can easily result in the same performance benefits without needing explicit annotation, and it still preserves the capability for the task scheduler to work-steal system tasks onto other threads when possible.

Thinking on this more, I'm leaning heavily against putting system execution on the hotpath for the executor, as any non-cooperative system may block numerous other systems from even starting without an explicit before/after relationship.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Feature A new feature, making something new possible C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

No branches or pull requests

6 participants