scx_bpfland: server workload improvements #1094

arighi · 2024-12-11T12:21:30Z

A set of improvements for scx_bpfland to enhance performance in server-oriented workloads.

When interactive task classification is disabled (-c 0), scx_bpfland can be suitable for more server-oriented scenarios, such as build servers. It can also serve as a "server profile" scheduler when used with tools like cachyos-kernel-manager in CachyOS.

With these changes and this "server profile" enabled, I'm able to speed up parallel kernel builds by ~2-3% (that seems to be consistent both AMD and Intel systems).

When nvcsw-based task weight is disable (-c 0) always consider the static task weight (p->scx.weight). Signed-off-by: Andrea Righi <arighi@nvidia.com>

If we wake-up an idle CPU when the system is busy there's a chance to pick a CPU that is already awake (due to the fact that CPU idle selection is not perfectly "race-safe"). Therefore always use SCX_KICK_IDLE to kick the CPU to prevent unnecessary overhead (especially when the system is busy). Signed-off-by: Andrea Righi <arighi@nvidia.com>

When the system is not completely busy, tasks are always dispatched directly from ops.select_cpu(), unless they can only run on a single CPU, in this case ops.select_cpu() is skipped. Therefore, in ops.enqueue(), there is no need to aggressively wake up idle CPUs, since most of the time the system is already busy, so searching for an idle CPU would only introduce unnecessary overhead. A better approach is to wake up an idle CPU from ops.enqueue() only for tasks that are limited to run on a single CPU, as they didn't get an opportunity to be dispatched from ops.select_cpu(). Signed-off-by: Andrea Righi <arighi@nvidia.com>

If a per-CPU kthread runs on the wakee's previous CPU, treat the run of the wakee as a sync wakeup. A typical example of this behavior is IO completion: a task had queued work for the per-CPU kthread, then the kthread has finished and it can give control back to the task. Signed-off-by: Andrea Righi <arighi@nvidia.com>

ops.select_cpu() is skipped for tasks that can only run on a single CPU, so they never get a chance to be dispatched directly. Give them a chance to be dispatched directly from ops.enqueue() if their assigned CPU is idle. Signed-off-by: Andrea Righi <arighi@nvidia.com>

If nvcsw_max_thresh is disabled (via `-c 0`), interactivity becomes less of a concern, allowing us to focus more on optimizing batch workloads. A particularly effective approach is to prioritize per-CPU tasks and sync wakeups, by dispatching these tasks directly to their local CPU. However, this comes at the cost of reduced fairness and increased susceptibility to bouncy scheduling behavior, making this configuration more suitable for batch-oriented workloads on systems that aer not massively over committed. With this change, running `scx_bpfland -c 0` seems to consistently improve parallel kernel build time by approximately 2-3%. Signed-off-by: Andrea Righi <arighi@nvidia.com>

hodgesds · 2024-12-11T15:22:09Z

scheds/rust/scx_bpfland/src/bpf/main.bpf.c

+	/*
+	 * Update task vruntime charging the weighted used time slice.
+	 */
+	p->scx.dsq_vtime += scale_inverse_fair(p, tctx, slice);


Maybe this is a silly question, but if bpfland were to implement a yield handler would the same vtime scaling need to be done there as well? (trying to get the correct mental model of how bpfland scales task vtime)

@hodgesds if I'm not mistaked ops.stopping() is also called when a task calls sched_yield(), and the fact that yield is implemented by setting its time slice to zero it should follow the same pattern as a task using all its assigned time slice.

And since we measure the used time doing a diff between the timestamp at ops.stopping() - the timestamp at ops.running(), we should get an accurate used time independently on how much p->scx.slice has been used.

Does it make sense? Am I missing something?

That makes sense, thanks for the explaination!

ref: sched-ext#1094

arighi added 6 commits December 11, 2024 08:42

scx_bpfland: correctly handle nvcsw_max_thresh in task_weight()

d75710d

When nvcsw-based task weight is disable (-c 0) always consider the static task weight (p->scx.weight). Signed-off-by: Andrea Righi <arighi@nvidia.com>

hodgesds reviewed Dec 11, 2024

View reviewed changes

htejun approved these changes Dec 11, 2024

View reviewed changes

arighi added this pull request to the merge queue Dec 11, 2024

Merged via the queue into main with commit e280e6b Dec 11, 2024
69 checks passed

arighi deleted the bpfland-server-profile branch December 11, 2024 20:46

vnepogodin added a commit to CachyOS/scx that referenced this pull request Dec 11, 2024

scx_loader: add mode for server-oriented workloads

13ec357

ref: sched-ext#1094

vnepogodin mentioned this pull request Dec 11, 2024

scx_loader: add mode for server-oriented workloads #1096

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_bpfland: server workload improvements #1094

scx_bpfland: server workload improvements #1094

arighi commented Dec 11, 2024

hodgesds Dec 11, 2024 •

edited

Loading

arighi Dec 11, 2024

hodgesds Dec 11, 2024

scx_bpfland: server workload improvements #1094

scx_bpfland: server workload improvements #1094

Conversation

arighi commented Dec 11, 2024

hodgesds Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

arighi Dec 11, 2024

Choose a reason for hiding this comment

hodgesds Dec 11, 2024

Choose a reason for hiding this comment

hodgesds Dec 11, 2024 •

edited

Loading