-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx_bpfland: server workload improvements #1094
Conversation
When nvcsw-based task weight is disable (-c 0) always consider the static task weight (p->scx.weight). Signed-off-by: Andrea Righi <arighi@nvidia.com>
If we wake-up an idle CPU when the system is busy there's a chance to pick a CPU that is already awake (due to the fact that CPU idle selection is not perfectly "race-safe"). Therefore always use SCX_KICK_IDLE to kick the CPU to prevent unnecessary overhead (especially when the system is busy). Signed-off-by: Andrea Righi <arighi@nvidia.com>
When the system is not completely busy, tasks are always dispatched directly from ops.select_cpu(), unless they can only run on a single CPU, in this case ops.select_cpu() is skipped. Therefore, in ops.enqueue(), there is no need to aggressively wake up idle CPUs, since most of the time the system is already busy, so searching for an idle CPU would only introduce unnecessary overhead. A better approach is to wake up an idle CPU from ops.enqueue() only for tasks that are limited to run on a single CPU, as they didn't get an opportunity to be dispatched from ops.select_cpu(). Signed-off-by: Andrea Righi <arighi@nvidia.com>
If a per-CPU kthread runs on the wakee's previous CPU, treat the run of the wakee as a sync wakeup. A typical example of this behavior is IO completion: a task had queued work for the per-CPU kthread, then the kthread has finished and it can give control back to the task. Signed-off-by: Andrea Righi <arighi@nvidia.com>
ops.select_cpu() is skipped for tasks that can only run on a single CPU, so they never get a chance to be dispatched directly. Give them a chance to be dispatched directly from ops.enqueue() if their assigned CPU is idle. Signed-off-by: Andrea Righi <arighi@nvidia.com>
If nvcsw_max_thresh is disabled (via `-c 0`), interactivity becomes less of a concern, allowing us to focus more on optimizing batch workloads. A particularly effective approach is to prioritize per-CPU tasks and sync wakeups, by dispatching these tasks directly to their local CPU. However, this comes at the cost of reduced fairness and increased susceptibility to bouncy scheduling behavior, making this configuration more suitable for batch-oriented workloads on systems that aer not massively over committed. With this change, running `scx_bpfland -c 0` seems to consistently improve parallel kernel build time by approximately 2-3%. Signed-off-by: Andrea Righi <arighi@nvidia.com>
/* | ||
* Update task vruntime charging the weighted used time slice. | ||
*/ | ||
p->scx.dsq_vtime += scale_inverse_fair(p, tctx, slice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is a silly question, but if bpfland were to implement a yield
handler would the same vtime scaling need to be done there as well? (trying to get the correct mental model of how bpfland scales task vtime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hodgesds if I'm not mistaked ops.stopping() is also called when a task calls sched_yield()
, and the fact that yield is implemented by setting its time slice to zero it should follow the same pattern as a task using all its assigned time slice.
And since we measure the used time doing a diff between the timestamp at ops.stopping()
- the timestamp at ops.running()
, we should get an accurate used time independently on how much p->scx.slice
has been used.
Does it make sense? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks for the explaination!
A set of improvements for scx_bpfland to enhance performance in server-oriented workloads.
When interactive task classification is disabled (
-c 0
), scx_bpfland can be suitable for more server-oriented scenarios, such as build servers. It can also serve as a "server profile" scheduler when used with tools likecachyos-kernel-manager
in CachyOS.With these changes and this "server profile" enabled, I'm able to speed up parallel kernel builds by ~2-3% (that seems to be consistent both AMD and Intel systems).