sched-sim is a "microbenchmark" for scheduling. It is designed both to be used as an actual benchmark (to measure the effects of schedulers on artificial workloads in a repeatable manner) and to provide "background competition" for testing the effect of schedulers on real-work workloads or other benchmarks.
The basic problem with testing schedulers is that schedulers are only really needed when there's not enough cpu to go around. So to really test the effectiveness of a scheduler, you need to see how workloads compete with each other.
But normal workloads -- even benchmarks -- typically vary how much cpu they use over time; which means that when you run several workloads together, how they happen to align can have a dramatic difference on how they end up performing -- much more so than the inherent performance of the scheduler itself.
Moreover, different aspects of the workload can get lost in the noise, making it difficult to see how changes in the scheduler affect a single aspect of scheduling.
The basic idea of schedbench is to have artificial workloads whose cpu utilization properties an be parametrized to isolate specific aspects of workloads, and which are constant over time; and then to have a controller which will start up a number of them, collect their performance results, and then can report on the results.
The Xen version is compiled against rumpkernels. This works fairly well in general, but the NetBSD kernel's nanosleep system call quantizes sleep times to the system timer tick rate, which defaults to 100; meaning the shortest amount of time you can sleep is 10ms -- far too long for the kinds of testing we want to do. The rumpkernel guys were kind enough to give me a hack to override the NetBSD system call with one which would actually sleep for the exact time requested without quantization.
So to build, you need to download my branch:
https://github.com/gwd/rumprun out/nanosleep-fix/v1
Then see the Rumprun Build Tutorial tutorial for building rumprun.
Then after building, add the rumprun binary directory to your path, thus:
export PATH="$PATH:/path/to/rumprun.git/rumprun/bin/"
NB also that the rumprun build system seems to build things in
different sub-directories based on the branch name you're on (if the
branch is not master
); so you may want to add a symbolic link from
rumprun.git/rumprun
to whatever directory it ends up making.
The xen code also builds against libxl. At the moment this is
hardcoded to the directory where I have libxl built on my dev box. On
your system you can either edit Makefile/controller
, putting your
built xen path in XENLIB_PATH, or remove both XENLIB_PATH and the two
runes which reference it (leaving the library names intact).
To use schedbench
, first copy and modify the included
sample.bench
, and modify as appropriate. Then run the following
four commands on your Xen host in order:
-
schedbench [-t template] [-f filename ] plan
: Initialize "plan" for the benchmark in benchmark file (default:test.bench
). Iftemplate
is given, then a new plan will be made infilename
which is identical to the one found intemplate
. -
schedbench [-f filename ] run
: Run the runs in benchmark file which haven't been completed yet -
schedbench [-f filename ] [-v N ] report
: Collate the data and give a text report to stdout with verbosityN
-
schedbench [-f filename ] htmlreport
: Collate the data into a self-contained html document tostdout
schedbench
is compiled statically, so the report / plan side should
run even on a system that doesn't have libxl installed (such as,
perhaps, your dev box).
This is sample.bench:
{
"Input": {
"WorkerPresets": {
"A": { "Args": [ "burnwait", "70", "200000" ] },
"B": { "Args": [ "burnwait", "10", "300000",
"burnwait", "20", "300000",
"burnwait", "10", "300000",
"burnwait", "10", "300000",
"burnwait", "10", "300000",
"burnwait", "10", "300000",
"burnwait", "30", "300000" ] }
},
"SimpleMatrix": {
"Schedulers": [
"credit",
"credit2"
],
"Workers": [ "A", "B" ],
"Count": [ 1, 2, 4, 8, 16 ],
"NumaDisable": [ true, false ]
}
},
"WorkerType": 1,
"RunConfig": {
"Pool": "schedbench",
"Cpus": [ 12, 13, 14, 15 ]
}
}
The WorkerPresets
section defines two workers, A
and B
. See
below for a description of the workers.
The SimpleMatrix
section tells schedbench
to make a plan that
includes baseline
runs for both A
and B
, and then runs that
include Count
numbers of both A
and B
. In the default case, it
will run one A
and one B
, then two A
workers and two B
workers, then four A
workers and four B
workers, and so on.
The Schedulers
list tells SimpleMatrix to add the listed schedulers
into its matrix; i.e., run all the tests with credit
, then run all the
tests with credit2
.
By default, libxl will do automatic NUMA placement for guests with no
cpu affinity specified. Inside SimpleMatrix
, you can also include a
NumaDisable
list, of the form "NumaDisable": [ true, false ]
. For
runs with NumaDisable
set to true
, the soft affinity of all
workers will be set equal to the cpumask of the pool (thus disabling
libxl's automatic NUMA placement). For NumaDisable
set to false
(the default), no soft affinity will be set.
RunConfig
Contains global configuration inherited by each run if
none are given. If you specify a Pool
name, it will try to run all
the workers in that pool. If no name is given, it defaults to
Pool-0
. You can also specify Cpus
, which is a list of cpus that
should be in the target pool.
When schedbench
runs each test, it will check to see if the
specified RunConfig
configuration items match the pool to run the
VMs in. If everything matches, then it runs the test.
If things don't match, and schedbench
is able to create the pool,
then it creates the pool with the new parameters. schedbench
is
able to create a pool if the pool is not Pool-0
and if the Cpus
are
specified.
If the parameters don't match, and schedbench
is not able to create
the pool, then it skips the test.
This allows you either to let schedbench
manage all the cpupool
operations (by specifying a Pool
other than the default pool, and
Cpus
), or to use a pool but manage it yourself (by specifying Pool
but no cpupool), or to simply use the default cpupool (by not
specifying Pool
).
Note that if you don't specify Pool
, but you do specify Cpus
,
schedbench
will check to make sure that the default pool contains
the selected cpus and skip the tests if not. Specifying Cpus
when
using the default cpupool is recommended to make sure that you haven't
forgotten anything.
This is definitely a work-in-progress. My initial goal is just to get a basic framework up to speed so that others can add to it.
For short-term work items, see [TODO.md]
The workers take a queue of "work" and do it. At the moment the only type of work is called "burnwait", which takes two parameters: kilo-ops and wait time time in nanoseconds. Each burnwait, when run will will:
- "Burn" cpu by doing the specified number of memory operations on a page of memory
- Queue up another iteration of the current worker for NOW()+ specified wait time.
And you can specify several of these; work is done in a sequential non-preemptible fashion.
The key thing about #2 from the scheduling perspective is that adding itself to the queue happens after doing the specified amount of work. Which means that if the guest is preempted (or delayed) from doing the work, it will take longer for the next bit of work to start.
Say that your burn time was 100us, and your sleep time was 100us. You start at t0us, burn for 100us (now at t100us), then set a timer for 100us and sleep; you wake up at t200us, do 100us of work (now at t300us), then set a timer for 100us and sleep.
Say now that after running for 50us, you get preempted for 100us. Now t0 you start burning; t50us you're interrupted; t150us you start burning again; t200 you start burning and set a timer for 100us, waking up at 300us.
Having several "threads" going on mitigates this: the particular thread you interrupted is delayed, but the timers of the threads which have already completed aren't. So a worker configured to have a single 50us burn cycle will be much more sensitive to scheduling decisions than a worker configured with five 10us burn cycles.
Each benchmark does a range of 'runs'; each 'run' starts a fixed number of workers, collects their throughput metrics, measures how much cpu they're getting. Workers report their total throughput about every second; actual throughput is measured by
I'm running this on kodo2, an Intel with 2 sockets, 8 cores, and hyperthreading enabled (so 16 logical cpus). And I'm running the test in a cpupool with 4 threads, with dom0 in a separate pool.
Data is collected and reported in averages:
-
Each worker reports cumulative time and cumulative operations once per second; controller collects cumulative cpu utilization once per second
-
This is used to calculate througput (ops / second) and utilization (%) for individual report "windows"
-
Maximum, minimum, and stddev of throughput per each window
-
Average throughput for the whole run is collected for each worker
-
For each type of worker, that average is then fed into stats acrross all workers -- an "average average", "max / min average" and "stdev average"
A good scheduler will be consistent within a run -- it will have a small range between max and min hand have a small standard deviation. It will also be consistent between workers -- workers of the same type should have similar averages i.e., (max-min) and stdev of averages between workers should be small.
Additionally, a scheduler should be "fair" between different kinds of workloads. Not all workloads will want to run 100% of the time; but a workload should get either 1) the maximum fair share if time it's entitled to, or 2) as much as it wants, whichever is less. How much it 'wants' we can tell by running it on an empty system.
So suppose a workload by itself wants 50%, and it's run on a 4-cpu system with 5 other workloads. Its "fair share" of cpu is 67% (4 / 6 = 0.67), so it should ideally get 50% of the cpu. If instead it's run on a system with 10 other workloads, its "fair share" is 40% (4 / 10 = 0.4), so it should ideally get 40%.
Each run will have a report that looks something like this:
== RUN 4a+4b ==
Set 0: kHZ 2261062 burnwait 70 200000
Set 1: kHZ 2261062 burnwait 10 300000 burnwait 20 300000 burnwait 10 300000 burnwait 10 300000 burnwait 10 300000 burnwait 10 300000 burnwait 30 300000
set ttotal tavgavg tstdev tavgmax tavgmin ttotmax ttotmin utotal uavgavg ustdev uavgmax uavgmin utotmax utotmin
0 297265.05 74316.26 199.38 74631.67 74134.55 76383.90 73778.25 1.99 0.50 0.00 0.50 0.49 0.50 0.44
1 301779.89 75444.97 852.14 76621.31 74213.47 84898.68 65949.73 1.96 0.49 0.01 0.50 0.48 0.52 0.39
The first is the title -- 4a+4b means there are 4 of worker type 'a', and 4 of worker type 'b'; so in a 4-cpu pool, this is 2x overcommitted; the "fair share" of each workload will be 50%.
The second is the configuration of each worker. "Set 0" is worker 'a', which has a single "thread" tha twill burn for 70 kilo-ops, then wait for 200us (200,000 nanoseconds). "Set 1" is worker "b", which has 7 workers, which burn for various amounts (10, 20, 10, 10, 10, 10, and 30 kops respecpively) and each of which sleep for 300us.
Next we have a slew of summary data:
-
set: The set number
-
ttotal: Througput total for this set. This is the throughputs (i.e., kops/sec) of all the workers of the set added together.
-
tavgavg: Average of the average throughputs of each individual worker in this set
-
tstdev, tavgmax, tavgmin: Standard devation, maximum, and minimum of average throughputs
-
ttotmax, ttotmin: Maximum and minimum throughput seen in any "window" of any worker of this set
-
utotal: Total average utilization of all workers in this set. This is the individual utilizations of all workers added together
-
uavgavg, ustdev, uavdmax, uavdmin: Average, stdev, max, and min of the average utilization of all the workers in the set.
-
utotmax, utotmin: Maximum and minimum utilization seen in any "window" of any worker in this set.
Scheduler performance in this case is not bad overall: The system is fully loaded (total utilization around 4.0); the aggregate throughput of both types of workers is very close to 300Mops/sec; range of averages pretty tight.