-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request (post-Stageless): "inline" flag for systems #7208
Comments
Definitely agree here: this is an area that I'd like to see improve. Atomic schedule groups were my suggestion along these lines: bevyengine/rfcs#46, allowing you to "group" systems into a single access set that's run in a single task. I definitely think we should avoid the "inline" terminology, like you proposed on Discord, calling these systems "single-threaded" is much clearer. Long-term, I'd really like to have tools to automate the analysis of these schedule optimizations, but this seems like a sensible primitive tuning lever to have access to. |
I think it would be important to clarify whether the intent here is to require that the system run inline or allow the system to be ran inline. It could become the case that the overhead of parallelism could be negligible at times. If that were the case treating this akin to how compilers often treat inline (as a suggestion) could allow doing useful work on the main thread in a safer fashion (since the work being done is light weight) while the system is working hard but still parallelizing the work if the system isn't. Of course the dilemma if that is done is that the version proposed here has an implicit bonus feature, it can control the ordering that systems execute. If that feature is used an the inlining becomes optional conflicts can occur. |
I think this makes sense to try. The overhead of spawning a task is probably an order of magnitude higher than running a simple system. The question that has to be answered is whether running a system "inline" is worth blocking the executor from running more systems in parallel. I think the actual call to spawn a task is fast. Most of the extra overhead from running systems comes from having to move tasks between threads. So I'm unsure if a system just being fast is worth it or not. Probably depends on how many other systems are available to run. Besides that I have concerns about usability. I worry it'll be hard for the average user to know when it's worth it to inline your system or not. It'll be a balance between how fast your system runs and how much parallelism is available in your system graph. It is potentially also very footgunny if the performance profile of your system changes. |
I want to note that exclusive systems are not run "inline" (yet). Because (1) those are the only systems that can insert or remove thread-local ( However, the multi-threaded executor's main logic is not locked to the local thread. It can move to other threads so that it doesn't block systems with This feature is not possible to impl cleanly until we split |
Note, if you're currently using Tracy for profiling, the largest overhead for running a system is currently creating the spans for profiling. The actual overhead in a normal run of an app is much smaller. IMO we should attempt to lower the overhead as much as we can before exposing too many fine tuning knobs publicly. Performance should be the default, not the exception. Brainstorming a few approaches to this: I've been contemplating a measurement based flow where we track the last X executions of the system via a rolling average, and opportunistically inline them into the multithreaded executor task when the average is significantly lower than the average task start time. We'd need to buffer the inlined systems to execute after each wave of spawning tasks to avoid blocking other systems while they're running, and an escape hatch for kicking off another executor task if our guess was wrong and one of the systems ends up taking longer than usual to run. Alternatively, we can tackle the problem at the source. I've been repeatedly saying that we should look into more latency-oriented alternatives to async_exeuctor. While it is by far the simplest option that universally works, it's current design leaves a lot to be desired, and the devs don't seem to be interested in making optimizations in favor of code simplicity. |
In a way #11906 resolves this in the opposite direction: systems now immediately run the executor (or tries and fails to grab the mutex on it) upon completion. Any new systems that can run are immediately scheduled. With optimizations like local queue scheduling in async_executor (hasn't landed in a not yanked release yet), this can easily result in the same performance benefits without needing explicit annotation, and it still preserves the capability for the task scheduler to work-steal system tasks onto other threads when possible. Thinking on this more, I'm leaning heavily against putting system execution on the hotpath for the executor, as any non-cooperative system may block numerous other systems from even starting without an explicit before/after relationship. |
This is just an idea I had, that I'd like to document somewhere so it does not get lost. It will only really be actionable after Stageless is implemented.
What problem does this solve or what need does it fill?
Spawning tasks onto the multithreaded task executor runtime, for running each system, is currently a major source of overhead for Bevy. Many Bevy projects, ironically, perform better if their schedule is set to single-threaded execution.
It is considered "idiomatic" (and we often encourage users to do it) to have many small systems that often do little to no work in Bevy. We say this to "improve parallelism opportunities". I don't think this is bad advice and I don't think we should change what we consider "idiomatic". It also leads to clean code.
However, in practice, this leads to bad performance. A typical trace is full of "bubbles" and "system execution overhead" from Bevy preparing and spawning tasks for all systems, only for many of them to do nothing or very little actual work when run. We need to address this overhead.
There are two separate fronts to attack here:
Bevy needs better APIs for allowing users to control and reduce the overheads of multithreaded execution. One major tool ("run conditions") will come with Stageless. This issue proposes one more such mechanism.
Additional Context: How will Stageless improve things?
Stageless "Run Conditions" (which I helped design with the same perf considerations as I described above) give users the ability to tell Bevy when to run or not run their systems. Run Conditions are evaluated by the executor itself, without spawning a task (and therefore without the associated perf overhead), and a task for running the actual system will only be spawned if the system should run. Unlike our legacy "Run Criteria", users can easily add and compose multiple "Run Conditions", allowing for flexibility and precise control.
After stageless, a common development workflow to optimize a Bevy game would be: "if you know when your system shouldn't run, go and add some run conditions to control it!". Users will have the power to prevent systems that have nothing to do that frame, from being run by Bevy and slowing things down.
What solution would you like?
I would like to propose one more mechanism for giving users more control over the overhead of parallel system execution: the ability to tell Bevy to not spawn a task for a given system, and instead run it inside the executor itself (similar to how Run Conditions and exclusive systems work).
This only makes sense if a user knows that a certain system, even when it has work to do, does very little and will always finish quickly. If the user determines that the overhead of parallel execution is more than the actual work the system is doing.
Like other optimizations (such as ECS table/sparse-set storage), it is a trade-off and a balancing act. It is a lever that users can pull in niche scenarios when they consider it worth doing.
Setting this property on a system will remove the task spawning overhead for that system, but may bottleneck the executor task if the system takes too long to run (holding it up from running other systems that may have become ready).
What alternative(s) have you considered?
Using exclusive systems. They are run like I described above ("inline" inside the executor, without spawning a task onto the thread pool), and always on the main thread (which is not necessary for the "inline" systems proposed here). Therefore, they can also accomplish the goal of being efficient and not introducing runtime overhead.
However, they, by definition, do not allow other parallel systems to run alongside them. They completely disable parallel system execution.
The solution proposed by this issue is better, because it allows for a "middle ground". A system marked with the proposed "inline execution flag", while it runs inside the executor, still allows for other systems (that were run using tasks) to run in parallel.
The text was updated successfully, but these errors were encountered: