In the following a short tutorial for creating a minimal scheduler written with sched_ext in C. This scheduler uses a global scheduling queue from which every CPU gets its tasks to run for a time slice. The scheduler order is First-In-First-Out. So it essentially implements a round-robin scheduler:
This short tutorial covers the basics; to learn more, visit the resources from the scx wiki.
We need a 6.12 kernel or a patched 6.11 kernel to build a custom scheduler. You can get a kernel patched with the scheduler extensions on Ubuntu 24.10 from here, or you can use CachyOS and install a patched kernel from there.
Furthermore, you also need
- a recent
clang
for compilation bpftool
for attaching the scheduler
On Ubuntu, for example, you can run: apt install clang linux-tools-common linux-tools-$(uname -r)
.
Nothing more is needed to run it, and you can find the code of this tutorial in the minimal-scheduler repository. I would advise to just cloning it:
git clone https://github.com/parttimenerd/minimal-scheduler
cd minimal-scheduler
The scheduler lives in the sched_ext.bpf.c
file, but before we take a look at it,
I want to show you how you can use this scheduler:
In short, we only need two steps:
# build the scheduler binary
./build.sh
# start the scheduler
sudo ./start.sh
# do something ...
# stop the scheduler
sudo ./stop.sh
I'll show you later what's in these scripts, but first, let's get to the scheduler code:
We assume that you have some experience writing eBPF programs. If not, Liz Rice's book Learning eBPF is a good starting point.
The scheduler code only depends on the Linux bpf kernel headers and sched-ext. So here is the code from sched_ext.bpf.c
:
// This header is autogenerated, as explained later
#include <vmlinux.h>
// The following two headers come from the Linux headers library
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
// Define a shared Dispatch Queue (DSQ) ID
// We use this as our global scheduling queue
#define SHARED_DSQ_ID 0
// Two macros that make the later code more readable
// and place the functions in the correct sections
// of the binary file
#define BPF_STRUCT_OPS(name, args...) \
SEC("struct_ops/"#name) BPF_PROG(name, ##args)
#define BPF_STRUCT_OPS_SLEEPABLE(name, args...) \
SEC("struct_ops.s/"#name) \
BPF_PROG(name, ##args)
// Initialize the scheduler by creating a shared dispatch queue (DSQ)
s32 BPF_STRUCT_OPS_SLEEPABLE(sched_init) {
// All scx_ functions come from vmlinux.h
return scx_bpf_create_dsq(SHARED_DSQ_ID, -1);
}
// Enqueue a task to the shared DSQ that wants to run,
// dispatching it with a time slice
int BPF_STRUCT_OPS(sched_enqueue, struct task_struct *p, u64 enq_flags) {
// Calculate the time slice for the task based on the number of tasks in the queue
// This makes the system slightly more responsive than a basic round-robin
// scheduler, which assigns every task the same time slice all the time
// The base time slice is 5_000_000ns or 5ms
u64 slice = 5000000u / scx_bpf_dsq_nr_queued(SHARED_DSQ_ID);
scx_bpf_dispatch(p, SHARED_DSQ_ID, slice, enq_flags);
return 0;
}
// Dispatch a task from the shared DSQ to a CPU,
// whenever a CPU needs something to run, usually after it is finished
// running the previous task for the allotted time slice
int BPF_STRUCT_OPS(sched_dispatch, s32 cpu, struct task_struct *prev) {
scx_bpf_consume(SHARED_DSQ_ID);
return 0;
}
// Define the main scheduler operations structure (sched_ops)
SEC(".struct_ops.link")
struct sched_ext_ops sched_ops = {
.enqueue = (void *)sched_enqueue,
.dispatch = (void *)sched_dispatch,
.init = (void *)sched_init,
// There are more functions available, but we'll focus
// on the important ones for a minimal scheduler
.flags = SCX_OPS_ENQ_LAST | SCX_OPS_KEEP_BUILTIN_IDLE,
// A name that will appear in
// /sys/kernel/sched_ext/root/ops
// after we attached the scheduler
// The name has to be a valid C identifier
.name = "minimal_scheduler"
};
// All schedulers have to be GPLv2 licensed
char _license[] SEC("license") = "GPL";
We can visualize the interaction of all functions in the scheduler with the following diagram:
Now, after you've seen the code,
run the build.sh
script to generate the vmlinux.h
BPF header
and then compile the scheduler code to BPF bytecode:
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
clang -target bpf -g -O2 -c sched_ext.bpf.c -o sched_ext.bpf.o -I.
Then run the start.sh
script as a root user to attach the scheduler using the bpftool
:
bpftool struct_ops register sched_ext.bpf.o /sys/fs/bpf/sched_ext
The custom scheduler is now the scheduler of this system. You can check this
by accessing the /sys/kernel/sched_ext/root/ops
file:
> cat /sys/kernel/sched_ext/root/ops
minimal_scheduler
And by checking dmesg | tail
:
> sudo dmesg | tail
# ...
[32490.366637] sched_ext: BPF scheduler "minimal_scheduler" enabled
Play around with your system and see how it behaves.
If you're done, you can detach the scheduler by running the stop.sh
script
using root privileges. This removes the /sys/fs/bpf/sched_ext/sched_ops
file.
Now that you know what a basic scheduler looks like, you can start modifying it. Here are a few suggestions:
How does your system behave when you increase or decrease the time slice?
For example, try a time slice of 1s. Do you see any difference in how your cursor moves? Or try a small time slice of 100us and run a program like that that does some computation, do you see a difference in its performance?
How does your system behave when you change the scheduler to use the same time slice, ignoring the number of enqueued processes?
Try, for example, to create load on your system and see how it behaves.
How does your system behave if the scheduler only schedules to specific CPUs?
Try, for example, to make your system effectively single-core by only consuming tasks
on CPU 0 in sched_dispatch
(Hint: the cpu
parameter is the CPU id).
How does your system behave with multiple scheduling queues for different CPUs and processes?
Try, for example, to create two scheduling queues, with one scheduling queue only
for a process with a specific id (Hint: task_struct#tgid
gives you the process id)
which is scheduled on half of your CPUs.
Look into the linux/sched.h header to learn more about task_struct
.
If you already know how to write basic eBPF programs,
use bpf_trace_printk
and the running
and stopping
hooks.
The running
hook is called whenever a task starts running on a CPU; get the current CPU id via smp_processor_id()
:
int BPF_STRUCT_OPS(sched_running, struct task_struct *p) {
// ...
return 0; // there are no void functions in eBPF
}
The stopping
hook is called whenever a task stops running:
int BPF_STRUCT_OPS(sched_stopping, struct task_struct *p, bool runnable) {
// ...
return 0;
}
You can use this to create visualizations of the scheduling order.
To do even more, you can look at the collected resources in the scx wiki, especially the well-documented sched-ext code in the kernel.
If you're interested in how to use it in Rust, take a look at scx and scx_rust_scheduler, and for Java at hello-ebpf.
GPLv2