Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_bpfland: server workload improvements #1094

Merged
merged 6 commits into from
Dec 11, 2024
Merged

Conversation

arighi
Copy link
Contributor

@arighi arighi commented Dec 11, 2024

A set of improvements for scx_bpfland to enhance performance in server-oriented workloads.

When interactive task classification is disabled (-c 0), scx_bpfland can be suitable for more server-oriented scenarios, such as build servers. It can also serve as a "server profile" scheduler when used with tools like cachyos-kernel-manager in CachyOS.

With these changes and this "server profile" enabled, I'm able to speed up parallel kernel builds by ~2-3% (that seems to be consistent both AMD and Intel systems).

When nvcsw-based task weight is disable (-c 0) always consider the
static task weight (p->scx.weight).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
If we wake-up an idle CPU when the system is busy there's a chance to
pick a CPU that is already awake (due to the fact that CPU idle
selection is not perfectly "race-safe").

Therefore always use SCX_KICK_IDLE to kick the CPU to prevent
unnecessary overhead (especially when the system is busy).

Signed-off-by: Andrea Righi <arighi@nvidia.com>
When the system is not completely busy, tasks are always dispatched
directly from ops.select_cpu(), unless they can only run on a single
CPU, in this case ops.select_cpu() is skipped.

Therefore, in ops.enqueue(), there is no need to aggressively wake up
idle CPUs, since most of the time the system is already busy, so
searching for an idle CPU would only introduce unnecessary overhead.

A better approach is to wake up an idle CPU from ops.enqueue() only for
tasks that are limited to run on a single CPU, as they didn't get an
opportunity to be dispatched from ops.select_cpu().

Signed-off-by: Andrea Righi <arighi@nvidia.com>
If a per-CPU kthread runs on the wakee's previous CPU, treat the run of
the wakee as a sync wakeup.

A typical example of this behavior is IO completion: a task had queued
work for the per-CPU kthread, then the kthread has finished and it can
give control back to the task.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
ops.select_cpu() is skipped for tasks that can only run on a single CPU,
so they never get a chance to be dispatched directly.

Give them a chance to be dispatched directly from ops.enqueue() if their
assigned CPU is idle.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
If nvcsw_max_thresh is disabled (via `-c 0`), interactivity becomes less
of a concern, allowing us to focus more on optimizing batch workloads.

A particularly effective approach is to prioritize per-CPU tasks and
sync wakeups, by dispatching these tasks directly to their local CPU.

However, this comes at the cost of reduced fairness and increased
susceptibility to bouncy scheduling behavior, making this configuration
more suitable for batch-oriented workloads on systems that aer not
massively over committed.

With this change, running `scx_bpfland -c 0` seems to consistently
improve parallel kernel build time by approximately 2-3%.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
/*
* Update task vruntime charging the weighted used time slice.
*/
p->scx.dsq_vtime += scale_inverse_fair(p, tctx, slice);
Copy link
Contributor

@hodgesds hodgesds Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a silly question, but if bpfland were to implement a yield handler would the same vtime scaling need to be done there as well? (trying to get the correct mental model of how bpfland scales task vtime)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hodgesds if I'm not mistaked ops.stopping() is also called when a task calls sched_yield(), and the fact that yield is implemented by setting its time slice to zero it should follow the same pattern as a task using all its assigned time slice.

And since we measure the used time doing a diff between the timestamp at ops.stopping() - the timestamp at ops.running(), we should get an accurate used time independently on how much p->scx.slice has been used.

Does it make sense? Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks for the explaination!

@arighi arighi added this pull request to the merge queue Dec 11, 2024
Merged via the queue into main with commit e280e6b Dec 11, 2024
69 checks passed
@arighi arighi deleted the bpfland-server-profile branch December 11, 2024 20:46
vnepogodin added a commit to CachyOS/scx that referenced this pull request Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants