Skip to content

Commit

Permalink
Reduce latency effects of non-interactive I/O.
Browse files Browse the repository at this point in the history
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
  • Loading branch information
amotin committed Nov 6, 2020
1 parent 52e585a commit 2cddbdb
Show file tree
Hide file tree
Showing 4 changed files with 113 additions and 20 deletions.
3 changes: 3 additions & 0 deletions include/sys/vdev_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,9 @@ struct vdev_queue {
avl_tree_t vq_write_offset_tree;
avl_tree_t vq_trim_offset_tree;
uint64_t vq_last_offset;
zio_priority_t vq_last_prio; /* Last sent I/O priority. */
int32_t vq_ia_active; /* Active interactive I/Os. */
int32_t vq_nia_credit; /* Non-interactive I/Os credit. */
hrtime_t vq_io_complete_ts; /* time last i/o completed */
hrtime_t vq_io_delta_ts;
zio_t vq_io_search; /* used as local for stack reduction */
Expand Down
4 changes: 3 additions & 1 deletion include/sys/zio_priority.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,17 @@ typedef enum zio_priority {
ZIO_PRIORITY_SYNC_WRITE, /* ZIL */
ZIO_PRIORITY_ASYNC_READ, /* prefetch */
ZIO_PRIORITY_ASYNC_WRITE, /* spa_sync() */
ZIO_PRIORITY_TRIM, /* trim I/O (discard) */
ZIO_PRIORITY_SCRUB, /* asynchronous scrub/resilver reads */
ZIO_PRIORITY_REMOVAL, /* reads/writes for vdev removal */
ZIO_PRIORITY_INITIALIZING, /* initializing I/O */
ZIO_PRIORITY_TRIM, /* trim I/O (discard) */
ZIO_PRIORITY_REBUILD, /* reads/writes for vdev rebuild */
ZIO_PRIORITY_NUM_QUEUEABLE,
ZIO_PRIORITY_NOW, /* non-queued i/os (e.g. free) */
} zio_priority_t;

#define ZIO_PRIORITY_MAX_INTERACTIVE ZIO_PRIORITY_TRIM

#ifdef __cplusplus
}
#endif
Expand Down
32 changes: 32 additions & 0 deletions man/man5/zfs-module-parameters.5
Original file line number Diff line number Diff line change
Expand Up @@ -2165,6 +2165,38 @@ See the section "ZFS I/O SCHEDULER".
Default value: \fB1\fR.
.RE

.sp
.ne 2
.na
\fBzfs_vdev_nia_delay\fR (int)
.ad
.RS 12n
To reduce effects of non-interactive I/O on interactive I/O latency
the first are limited to *_min_active while there are second active,
plus at least this number of I/Os after in case interactive return.
See the section "ZFS I/O SCHEDULER".
.sp
Default value: \fB5\fR.
.RE

.sp
.ne 2
.na
\fBzfs_vdev_nia_credit\fR (int)
.ad
.RS 12n
Some HDDs tend to prioritize sequential I/O so high, that concurrent
random I/O latency reaches several seconds. On some HDDs it happens
even if sequential I/Os are submitted one at a time, and so setting
*_max_active to 1 does not help. To handle this in case of scrub
and other non-interactive I/O this tunable limits the number of their
I/Os that can be sent until at least one interactive I/O completes
without the enforced wait, making the HDD to stop the spree.
See the section "ZFS I/O SCHEDULER".
.sp
Default value: \fB5\fR.
.RE

.sp
.ne 2
.na
Expand Down
94 changes: 75 additions & 19 deletions module/zfs/vdev_queue.c
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 2;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;
uint32_t zfs_vdev_scrub_max_active = 3;
uint32_t zfs_vdev_removal_min_active = 1;
uint32_t zfs_vdev_removal_max_active = 2;
uint32_t zfs_vdev_initializing_min_active = 1;
Expand All @@ -171,6 +171,24 @@ uint32_t zfs_vdev_rebuild_max_active = 3;
int zfs_vdev_async_write_active_min_dirty_percent = 30;
int zfs_vdev_async_write_active_max_dirty_percent = 60;

/*
* To reduce effects of non-interactive I/O on interactive I/O latency
* the first are limited to *_min_active while there are second active,
* plus at least this number of I/Os after in case interactive return.
*/
int zfs_vdev_nia_delay = 5;

/*
* Some HDDs tend to prioritize sequential I/O so high, that concurrent
* random I/O latency reaches several seconds. On some HDDs it happens
* even if sequential I/Os are submitted one at a time, and so setting
* *_max_active to 1 does not help. To handle this in case of scrub
* and other non-interactive I/O this tunable limits the number of their
* I/Os that can be sent until at least one interactive I/O completes
* without the enforced wait, making the HDD to stop the spree.
*/
int zfs_vdev_nia_credit = 5;

/*
* To reduce IOPs, we aggregate small adjacent I/Os into one large I/O.
* For read I/Os, we also aggregate across small adjacency gaps; for writes
Expand Down Expand Up @@ -261,7 +279,7 @@ vdev_queue_timestamp_compare(const void *x1, const void *x2)
}

static int
vdev_queue_class_min_active(zio_priority_t p)
vdev_queue_class_min_active(vdev_queue_t *vq, zio_priority_t p)
{
switch (p) {
case ZIO_PRIORITY_SYNC_READ:
Expand All @@ -272,16 +290,22 @@ vdev_queue_class_min_active(zio_priority_t p)
return (zfs_vdev_async_read_min_active);
case ZIO_PRIORITY_ASYNC_WRITE:
return (zfs_vdev_async_write_min_active);
case ZIO_PRIORITY_TRIM:
return (zfs_vdev_trim_min_active);
case ZIO_PRIORITY_SCRUB:
return (zfs_vdev_scrub_min_active);
#define M(X) if (vq->vq_ia_active > 0) { \
return (MIN(vq->vq_nia_credit, \
zfs_vdev_##X##_min_active)); \
} \
return (zfs_vdev_##X##_min_active)
M(scrub);
case ZIO_PRIORITY_REMOVAL:
return (zfs_vdev_removal_min_active);
M(removal);
case ZIO_PRIORITY_INITIALIZING:
return (zfs_vdev_initializing_min_active);
case ZIO_PRIORITY_TRIM:
return (zfs_vdev_trim_min_active);
M(initializing);
case ZIO_PRIORITY_REBUILD:
return (zfs_vdev_rebuild_min_active);
M(rebuild);
#undef M
default:
panic("invalid priority %u", p);
return (0);
Expand Down Expand Up @@ -337,7 +361,7 @@ vdev_queue_max_async_writes(spa_t *spa)
}

static int
vdev_queue_class_max_active(spa_t *spa, zio_priority_t p)
vdev_queue_class_max_active(spa_t *spa, vdev_queue_t *vq, zio_priority_t p)
{
switch (p) {
case ZIO_PRIORITY_SYNC_READ:
Expand All @@ -348,16 +372,23 @@ vdev_queue_class_max_active(spa_t *spa, zio_priority_t p)
return (zfs_vdev_async_read_max_active);
case ZIO_PRIORITY_ASYNC_WRITE:
return (vdev_queue_max_async_writes(spa));
case ZIO_PRIORITY_TRIM:
return (zfs_vdev_trim_max_active);
case ZIO_PRIORITY_SCRUB:
return (zfs_vdev_scrub_max_active);
#define M(X) if (vq->vq_ia_active > 0) { \
return (MIN(vq->vq_nia_credit, \
zfs_vdev_##X##_min_active)); \
} else if (vq->vq_nia_credit < zfs_vdev_nia_delay) \
return (zfs_vdev_##X##_min_active); \
return (zfs_vdev_##X##_max_active);
M(scrub);
case ZIO_PRIORITY_REMOVAL:
return (zfs_vdev_removal_max_active);
M(removal);
case ZIO_PRIORITY_INITIALIZING:
return (zfs_vdev_initializing_max_active);
case ZIO_PRIORITY_TRIM:
return (zfs_vdev_trim_max_active);
M(initializing);
case ZIO_PRIORITY_REBUILD:
return (zfs_vdev_rebuild_max_active);
M(rebuild);
#undef M
default:
panic("invalid priority %u", p);
return (0);
Expand All @@ -372,17 +403,22 @@ static zio_priority_t
vdev_queue_class_to_issue(vdev_queue_t *vq)
{
spa_t *spa = vq->vq_vdev->vdev_spa;
zio_priority_t p;
zio_priority_t p, n;

if (avl_numnodes(&vq->vq_active_tree) >= zfs_vdev_max_active)
return (ZIO_PRIORITY_NUM_QUEUEABLE);

/* find a queue that has not reached its minimum # outstanding i/os */
for (p = 0; p < ZIO_PRIORITY_NUM_QUEUEABLE; p++) {
p = vq->vq_last_prio;
p = (p == ZIO_PRIORITY_NUM_QUEUEABLE - 1) ? 0 : p + 1;
for (n = ZIO_PRIORITY_NUM_QUEUEABLE; n > 0; n--) {
if (avl_numnodes(vdev_queue_class_tree(vq, p)) > 0 &&
vq->vq_class[p].vqc_active <
vdev_queue_class_min_active(p))
vdev_queue_class_min_active(vq, p)) {
vq->vq_last_prio = p;
return (p);
}
p = (p == ZIO_PRIORITY_NUM_QUEUEABLE - 1) ? 0 : p + 1;
}

/*
Expand All @@ -392,8 +428,10 @@ vdev_queue_class_to_issue(vdev_queue_t *vq)
for (p = 0; p < ZIO_PRIORITY_NUM_QUEUEABLE; p++) {
if (avl_numnodes(vdev_queue_class_tree(vq, p)) > 0 &&
vq->vq_class[p].vqc_active <
vdev_queue_class_max_active(spa, p))
vdev_queue_class_max_active(spa, vq, p)) {
vq->vq_last_prio = p;
return (p);
}
}

/* No eligible queued i/os */
Expand Down Expand Up @@ -502,6 +540,11 @@ vdev_queue_pending_add(vdev_queue_t *vq, zio_t *zio)
ASSERT(MUTEX_HELD(&vq->vq_lock));
ASSERT3U(zio->io_priority, <, ZIO_PRIORITY_NUM_QUEUEABLE);
vq->vq_class[zio->io_priority].vqc_active++;
if (zio->io_priority <= ZIO_PRIORITY_MAX_INTERACTIVE) {
if (vq->vq_ia_active++ == 0)
vq->vq_nia_credit = 1;
} else if (vq->vq_ia_active > 0)
vq->vq_nia_credit--;
avl_add(&vq->vq_active_tree, zio);

if (shk->kstat != NULL) {
Expand All @@ -520,6 +563,13 @@ vdev_queue_pending_remove(vdev_queue_t *vq, zio_t *zio)
ASSERT(MUTEX_HELD(&vq->vq_lock));
ASSERT3U(zio->io_priority, <, ZIO_PRIORITY_NUM_QUEUEABLE);
vq->vq_class[zio->io_priority].vqc_active--;
if (zio->io_priority <= ZIO_PRIORITY_MAX_INTERACTIVE) {
if (--vq->vq_ia_active == 0)
vq->vq_nia_credit = 0;
else
vq->vq_nia_credit = zfs_vdev_nia_credit;
} else if (vq->vq_ia_active == 0)
vq->vq_nia_credit++;
avl_remove(&vq->vq_active_tree, zio);

if (shk->kstat != NULL) {
Expand Down Expand Up @@ -1065,6 +1115,12 @@ ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, rebuild_max_active, INT, ZMOD_RW,
ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, rebuild_min_active, INT, ZMOD_RW,
"Min active rebuild I/Os per vdev");

ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, nia_credit, INT, ZMOD_RW,
"Number of non-interactive I/Os to allow in sequence");

ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, nia_delay, INT, ZMOD_RW,
"Number of non-interactive I/Os before _max_active");

ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, queue_depth_pct, INT, ZMOD_RW,
"Queue depth percentage for each top-level vdev");
/* END CSTYLED */

0 comments on commit 2cddbdb

Please sign in to comment.