Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default feature for pinning TaskPool threads to CPU cores #6471

Closed
wants to merge 6 commits into from

Conversation

james7132
Copy link
Member

@james7132 james7132 commented Nov 5, 2022

Objective

Fixes #4803. When the OS changes which physical CPU core a thread is running on, the CPU cache is effectively flushed as a result of the context switch, forcing the local context to refetch the entire execution context from main memory. This can be a significant performance killer in data-oriented programming, which Bevy and ECS in general heavily leverages. The current situation also allows threads within a process to be assigned to the same physical cores, which can cause an app's threads to fight for the same physical resources.

Solution

Use core_affinity to forcibly pin OS threads to specific CPU cores via CPU pinning. This isn't a silver bullet for dealing with the impact of context switches, but makes the tradeoff of limiting the scheduling opportunities provided by the OS for ensuring that a task on a given thread cannot be moved to a different physical core..

This PR fetches a list of all of the physical core IDs, and pins threads 1:1 to each one. This is turned on by an optional but default feature on bevy_tasks.

This is currently a draft as the current configuration will cause all 3 TaskPools (Compute, AsyncCompute, and IO) to be pinned to the same threads, which is not ideal. Preferably this would be best merged after #4740 is completed.

@james7132 james7132 marked this pull request as draft November 5, 2022 00:23
@james7132 james7132 added C-Performance A change motivated by improving speed, memory usage or compile times A-Tasks Tools for parallel and async work labels Nov 5, 2022
@james7132 james7132 requested a review from hymm November 5, 2022 00:32
@alice-i-cecile alice-i-cecile added the S-Blocked This cannot move forward until something else changes label Nov 5, 2022
@alice-i-cecile
Copy link
Member

At work, we've been seeing anomalous lag spikes without a corresponding gameplay reason.
image

This seems like a very plausible candidate for this problem.

@james7132
Copy link
Member Author

james7132 commented Nov 5, 2022

At work, we've been seeing anomalous lag spikes without a corresponding gameplay reason.

This seems like a very plausible candidate for this problem.

I'm not sure it is the case. While the preemptive scheduler on all of the OSes we're supporting can make large scale thread interruptions, CPU pinning won't solve that alone. The system itself may be oversubscribed momentarily and there's not much we can do about that in userspace. For that, we'd need to set thread priority, to prevent the app's threads from being preempted by the scheduler. Though I'm much less confident we should make that move since altering process/thread priorities without good reason to do so can easily choke a system by preventing it from being interrupted by high realtime priority tasks (OS level IO, etc).

One potential option in this space is temporarily setting a thread to a higher priority while executing a system or a par_for_each task to avoid preemption inside that block, but that might add more overhead due to the system calls necessary to set and reset the priority. Worth testing though.

One other well-known area that definitely could benefit from this is audio processing, which is known to benefit greatly from the lowered preemption rates and higher cache coherency, so we could also scope that out as a focus area for the use of either of these two config options.

Ultimately, CPU pinning should be viewed as a performance knob exposed to developers (and maybe end users via a config file) that exposes an explicit tradeoff between the time to schedule the app's threads and the cache coherency of normal operation. It may be perf win, or a perf loss depending on the use case.

@james7132 james7132 marked this pull request as ready for review November 29, 2022 18:05
@james7132 james7132 removed the S-Blocked This cannot move forward until something else changes label Nov 29, 2022
@hymm
Copy link
Contributor

hymm commented Dec 16, 2022

Using tracy I can see this is working on my Windows machine.

image

Initial investigation is that this is performing worse, but something weird is going on and it deserves more investigation.

So the frame times are worse on many foxes

image

but frames are randomly getting stalled out
image

I see it mostly during the prepare stage, but also in sometimes in other systems like transform propagation and check visibility. No idea why it might be happening.

@james7132
Copy link
Member Author

Ah I think I know why. Setting the core affinity means that the thread can only be scheduled to those cores. Meaning the OS will defer scheduling that thread until the core is available again. This includes when the thread willingly yields to the OS. There's probably some other process that is holding the core for longer that is forcing that stall. The only real way we can prevent this is by altering thread priority, and that might not be what we want.

@hymm
Copy link
Contributor

hymm commented Dec 16, 2022

That makes sense. Looks like the time slice that a process is given on windows is around 20ms which is about the time I'm seeing. And it's happening to the prepare systems mostly because those park the thread when the gpu is busy.

@richchurcher
Copy link
Contributor

Backlog cleanup: this mentions #4740 (which was closed without merging) as best merged first. Given that, and inactivity since Dec 2022, suspect it's reasonable to close this one too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Tasks Tools for parallel and async work C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TaskPool threads should utilize CPU affinity
4 participants