Skip to content

Commit

Permalink
blkcg: implement blk-iocost
Browse files Browse the repository at this point in the history
This patchset implements IO cost model based work-conserving
proportional controller.

While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others.  In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern.  However, the cost isn't a complete mystery.  Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.

The function which determines the cost of a given IO is the IO cost
model for the device.  This controller distributes IO capacity based
on the costs estimated by such model.  The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.

Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well.  All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.

Please see the top comment in blk-iocost.c and documentation for
more details.

v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
  • Loading branch information
htejun authored and axboe committed Aug 29, 2019
1 parent 6f816b4 commit 7caa471
Show file tree
Hide file tree
Showing 7 changed files with 2,656 additions and 0 deletions.
94 changes: 94 additions & 0 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1435,6 +1435,100 @@ IO Interface Files
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021

io.cost.qos
A read-write nested-keyed file with exists only on the root
cgroup.

This file configures the Quality of Service of the IO cost
model based controller (CONFIG_BLK_CGROUP_IOCOST) which
currently implements "io.weight" proportional control. Lines
are keyed by $MAJ:$MIN device numbers and not ordered. The
line for a given device is populated on the first write for
the device on "io.cost.qos" or "io.cost.model". The following
nested keys are defined.

====== =====================================
enable Weight-based control enable
ctrl "auto" or "user"
rpct Read latency percentile [0, 100]
rlat Read latency threshold
wpct Write latency percentile [0, 100]
wlat Write latency threshold
min Minimum scaling percentage [1, 10000]
max Maximum scaling percentage [1, 10000]
====== =====================================

The controller is disabled by default and can be enabled by
setting "enable" to 1. "rpct" and "wpct" parameters default
to zero and the controller uses internal device saturation
state to adjust the overall IO rate between "min" and "max".

When a better control quality is needed, latency QoS
parameters can be configured. For example::

8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0

shows that on sdb, the controller is enabled, will consider
the device saturated if the 95th percentile of read completion
latencies is above 75ms or write 150ms, and adjust the overall
IO issue rate between 50% and 150% accordingly.

The lower the saturation point, the better the latency QoS at
the cost of aggregate bandwidth. The narrower the allowed
adjustment range between "min" and "max", the more conformant
to the cost model the IO behavior. Note that the IO issue
base rate may be far off from 100% and setting "min" and "max"
blindly can lead to a significant loss of device capacity or
control quality. "min" and "max" are useful for regulating
devices which show wide temporary behavior changes - e.g. a
ssd which accepts writes at the line speed for a while and
then completely stalls for multiple seconds.

When "ctrl" is "auto", the parameters are controlled by the
kernel and may change automatically. Setting "ctrl" to "user"
or setting any of the percentile and latency parameters puts
it into "user" mode and disables the automatic changes. The
automatic mode can be restored by setting "ctrl" to "auto".

io.cost.model
A read-write nested-keyed file with exists only on the root
cgroup.

This file configures the cost model of the IO cost model based
controller (CONFIG_BLK_CGROUP_IOCOST) which currently
implements "io.weight" proportional control. Lines are keyed
by $MAJ:$MIN device numbers and not ordered. The line for a
given device is populated on the first write for the device on
"io.cost.qos" or "io.cost.model". The following nested keys
are defined.

===== ================================
ctrl "auto" or "user"
model The cost model in use - "linear"
===== ================================

When "ctrl" is "auto", the kernel may change all parameters
dynamically. When "ctrl" is set to "user" or any other
parameters are written to, "ctrl" become "user" and the
automatic changes are disabled.

When "model" is "linear", the following model parameters are
defined.

============= ========================================
[r|w]bps The maximum sequential IO throughput
[r|w]seqiops The maximum 4k sequential IOs per second
[r|w]randiops The maximum 4k random IOs per second
============= ========================================

From the above, the builtin linear model determines the base
costs of a sequential and random IO and the cost coefficient
for the IO size. While simple, this model can cover most
common device classes acceptably.

The IO cost model isn't expected to be accurate in absolute
sense and is scaled to the device behavior dynamically.

io.weight
A read-write flat-keyed file which exists on non-root cgroups.
The default is "default 100".
Expand Down
10 changes: 10 additions & 0 deletions block/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,16 @@ config BLK_CGROUP_IOLATENCY

Note, this is an experimental interface and could be changed someday.

config BLK_CGROUP_IOCOST
bool "Enable support for cost model based cgroup IO controller"
depends on BLK_CGROUP=y
select BLK_RQ_ALLOC_TIME
---help---
Enabling this option enables the .weight interface for cost
model based proportional IO control. The IO controller
distributes IO capacity between different groups based on
their share of the overall weight distribution.

config BLK_WBT_MQ
bool "Multiqueue writeback throttling"
default y
Expand Down
1 change: 1 addition & 0 deletions block/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
Expand Down
Loading

0 comments on commit 7caa471

Please sign in to comment.