Scrub heavily impacting application IO performance #10253

shodanshok · 2020-04-25T11:34:44Z

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	8.1
Linux Kernel	4.18.0-147.5.1.el8_1.x86_64
Architecture	x86_64
ZFS Version	0.8.3-1
SPL Version	0.8.3-1

Describe the problem you're observing

The new scub code heavily impacts application IO performance when used with HDD-based pools. Application IOPs are reducted by up to 10x factor.

Using a 4x SAS 15k 300 GB disks test pool which can provide ~250 IOPs for 4K single-thread sync random read (as measured by fio), starting a scrub degrades application random 4K reads to 20-60 (so 4-10x lower random read speed).

The older ZFS 0.7.x relase had a zfs_scrub_delay which can be used to limit how much scrub "conflicts" with other read/write operation, but this parameter is gone with the new 0.8.x relase. The rationale is that management of the different IO classes should be done excluvely via ZIO scheduler tuning, adjusting the relative weight via *_max_active tunables, but I can't see any meaningful difference even when setting zfs_vdev_scrub_max_active=1 and zfs_vdev_sync_read_max_active=1000.

I think the problem is due to the new scrub code batching read in very large block sizes, leading to long depth on the scrub queue and, finally, on the vdev queue. Indeed the new scrub code is very fast (reading at 400-500 MB/s on that test array), but this leads to poor random IOPs delivered to (test) application.

While faster scrub is great, we need a method to limit its impact on production pool (even if this means a longer scrub time).

Describe how to reproduce the problem

start a 4k random read workload with somethins as fio --name=test --filename=/tank/test.img --rw=randread --size=32G and look at the current IOPs
start a scrub via zpool scrub tank
check again the instantaneous IOPs are reported by fio

NOTE: using a 128k random read (matching the dataset recordsize) will not change IOPs number (except that the raw throughput value is higher).

The text was updated successfully, but these errors were encountered:

shodanshok · 2020-04-28T16:32:40Z

After some more investigation, I found that the very low scrub performance was not directly related to the new zfs scan mode, but due to the interaction of

mq-scheduler IO sched (rather than none)
the lack of scrub throttling as defined in 0.7.x by zfs_scrub_delay and zfs_scan_idle

If the first point was my fault (well, I set it as noop but it is not valid anymore on CentOS 8; rather, none must be used), the second one (lack of scrub throttling) is a real concern: it generally means that, even setting zfs_vdev_scrub_max_active=1, single-threaded / low queue depth application running on HDD pools will face a ~50% reduction in random IO speed.

Lets see the output of zpool iostat -q 1 during a concurrent fio and zpool scrub:

              capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool        alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
tank         108G   972G    214      0   144M      0      0      1      0      0      0      0      0      0     24      1      0      0
tank         108G   972G    222      0   137M      0      0      1      0      0      0      0      0      0     24      1      0      0
tank         108G   972G    206     59   130M   855K      0      1      0      0      0      0      0      0     24      1      0      0
tank         108G   972G    206      0   140M      0      0      1      0      0      0      0      0      0     24      1      0      0
tank         108G   972G    206      0   140M      0      0      1      0      0      0      0      0      0     24      1      0      0
tank         108G   972G    203      0   137M      0      0      1      0      0      0      0      0      0     24      1      0      0
tank         108G   972G    195      0   123M      0      0      1      0      0      0      0      0      0     24      1      0      0
tank         108G   972G    213     51   133M   855K      0      1      0      0      0      0      0      0     23      1      0      0
tank         108G   972G    217      0   148M      0      0      1      0      0      0      0      0      0     24      1      0      0

scrubq_read has 1 request always active/issued, with no throttling. On rotational media this means the seek rate effectively doubles, halving application performance for random reads.

While I really like the new scrub/resilver performance, I think we need an "escape hatch" to throttle scrubbing when application IO should be affected as little as possible.

shodanshok · 2020-05-04T08:55:24Z

An update: I considered restoring some form of delay, taking it from 0.7.x branch. However, dsl_scan.c and the scrub approach as a whole are sufficiently different that I am not sure this would be reasonable, much less accepted.

I found that limiting zfs_scan_vdev_limit (in addition to zfs_vdev_scrub_max_active) can reduce scrub impact on low queue depth random reads. Moreover, and more importantly, multi-thread random reads (ie: higher queue depth reads) are much less impacted by scrub overhead (as the zfs_vdev_scrub_max_active vs zfs_vdev_sync_read_max_active comparison is, by default, 2 vs 10).

Finally, a scrub can be stopped/paused during work hours.

@behlendorf feel free to close the ticket. I am not closing it now only because I don't know if you (or other maintainers) want to track the problem described above. Thanks.

shodanshok · 2020-05-30T20:12:36Z

@behlendorf @ahrens (I do not remember who contributed the sequential scrub code, please feel free to add the right person)

I would like to add another datapoint. Short summary: scrub so heavily impacts performance that VMs sometime see 0 (zero) read IOPs. This is a small pool with 4x 2TB HDD + 2x L2ARC SSD + 1x NVME SLOG and a running scrub:

[root@localhost parameters]# zpool iostat -q 1 -v
                                    capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool                              alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
tank                              1.52T  2.11T  1.33K     16   515M   311K      0      1      0      0     16      3      0      0    242      8      0      0
  mirror                           776G  1.05T    681      0   253M      0      0      0      0      0      0      0      0      0     66      4      0      0
    pci-0000:02:00.1-ata-1.0          -      -    381      0   133M      0      0      0      0      0      0      0      0      0      0      2      0      0
    pci-0000:02:00.1-ata-2.0          -      -    300      0   120M      0      0      0      0      0      0      0      0      0     66      2      0      0
  mirror                           777G  1.05T    679      0   262M      0      0      1      0      0     16      3      0      0    176      4      0      0
    pci-0000:02:00.1-ata-5.0          -      -    342      0   131M      0      0      1      0      0      0      0      0      0     69      2      0      0
    pci-0000:02:00.1-ata-6.0          -      -    337      0   131M      0      0      0      0      0     16      3      0      0    107      2      0      0
logs                                  -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -
  nvme0n1                         94.5M  26.9G      0     16      0   311K      0      0      0      0      0      0      0      0      0      0      0      0
cache                                 -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -
  pci-0000:02:00.1-ata-3.0-part6  11.1G   245G      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0      0
  pci-0000:02:00.1-ata-4.0-part6  11.5G   245G      0     51      0  5.16M      0      0      0      0      0      0      0      0      0      0      0      0
--------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----


[root@localhost parameters]# iostat -x -k 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.76    0.00    3.39   41.50    0.00   53.35

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00    0.00   20.00     0.00   816.00    81.60     0.01    0.30    0.00    0.30   0.10   0.20
sda               0.00     0.00  376.00    0.00 72480.00     0.00   385.53     6.19    6.36    6.36    0.00   2.66 100.00
sdb               0.00     0.00  390.00    0.00 70576.00     0.00   361.93     4.95   16.39   16.39    0.00   2.56 100.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00   56.00     0.00  5684.00   203.00     0.02    0.43    0.00    0.43   0.21   1.20
sdf               0.00     0.00  440.00    0.00 123236.00     0.00   560.16     6.76   18.85   18.85    0.00   2.27 100.00
sdg               0.00     0.00  437.00    0.00 125008.00     0.00   572.12     6.72    6.24    6.24    0.00   2.29 100.00
md127             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md126             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Please note how the HDDs are overwelmed by pending ZFS scrub request: while the scrub itself is very fast, it completely saturates the HDDs with very bad resulting performance for running VMs. Setting zfs_scan_vdev_limit to 128K and zfs_vdev_scrub_max_active only slightly lessen the problem, while fiddling with zfs_no_scrub_prefetch and zfs_scrub_min_time_ms seems to have no effect at all.

Any idea on what can be done to further decrease scrub load?

shodanshok · 2020-05-30T23:17:04Z

Well, I did an interesting discovery: setting sd[abfg]/device/queue_depth=1 (effectively disabling NCQ) solved the VMs stalling problem. I can confirm with a fio --rw=randread that no 0 (or very low) IOPs are recorded.

I got curious and tested a disk (WD Gold 2 TB) in isolation. I can replicate the issue by concurrently running the following two fio commands:
fio --name=test --filename=/dev/sda --direct=1 --rw=read #sequential read
fio --name=test --filename=/dev/sda --direct=1 --rw=randread #random read

While the first fio sucked almost all IOPs, the second one was mostly stalled. In short, it seems that the new scrub code, which is much more sequential than the old behavior, causes some disks (WD Gold in this case) to stall random read requests. I suppose this is due to over-aggressive read-ahead enabled by "seeing" multiple concurrent requests (setting sda/device/queue_depth=2, using a minimal NCQ amount, give the same stalling outcome) , but the exact cause is probably not so important. The old scrub code, with its more random IO pattern, did not expose the problem.

As a side note, an older WD Green did not show any issue.

I am leaving this issue open for some days only because I don't know if someone want to comment and/or share other relevant experiences. Anyway, feel free to close it.

Thanks.

behlendorf · 2020-06-01T18:38:01Z

That's really interesting. It definitely sounds like an issue with the WD Gold drives, and it's not something I would have expected from an enterprise branded drive. You might want to check if there's a firmware update available. Thanks for posting a solution for anyone else who may encounter this.

dechamps · 2020-07-10T20:02:56Z

I was able to reproduce this on a different model of Western Digital hard drives: WD Red 10 TB (WD100EFAX). I am using 6 of these drives in a zpool made of 3 mirrors. See #10535.

My experience closely matches @shodanshok's: following the steps to reproduce in his original post, with default settings (scheduler=none, queue_depth=32), I get roughly 150-180 IOPs in fio, falling to a measly trickle of about 10 IOPs when a scrub is ongoing. But if I set queue_depth=1, then I get about 60-100 IOPs - a huge improvement. So thank you, @shodanshok, for getting to the bottom of this issue! Your workaround seems to work quite well. In fact, I get the impression that I'm getting better performance with queue_depth=1 during normal operation even when a scrub is not running (about 200-250 IOPs).

Now, if only Western Digital could fix their firmware… stalling all random reads when sequential reads are inflight sounds pretty bad. One can easily imagine such behaviour causing problems with production services becoming unresponsive just because some random user decided to scan the contents of a file.

gdevenyi · 2020-07-10T23:27:04Z

Is it time for the ZFS wiki or related documentation to make a "known bad" list of drives/firmwares that have been definitively identified as interacting badly with ZFS?

shodanshok · 2020-07-11T04:51:47Z

@gdevenyi Rather than a list (which will become outdated pretty fast), I suggest inserting a note in the hardware/performance page stating that if excessive performance degradation is observed during scrub, disabling NCQ is a possible workaround (maybe even linking to this issue).

dechamps · 2020-07-11T09:09:06Z

I was actually planning to add a "Queue depth" section on the Performance tuning OpenZFS wiki page to describe this problem, but that page doesn't seem to have open edit access.

dechamps · 2020-07-11T10:38:28Z

…and for reference, I used the following udev rule to automatically apply the workaround to all my affected disks:

DRIVER=="sd", ATTR{model}=="WDC WD100EFAX-68", ATTR{queue_depth}="1"

richardelling · 2020-07-13T19:48:52Z

On Jul 11, 2020, at 3:38 AM, Etienne Dechamps ***@***.***> wrote: …and for reference, I used the following udev rule to automatically apply the workaround to all my affected disks: DRIVER=="sd", ATTR{model}=="WDC WD100EFAX-68", ATTR{queue_depth}="1"

This has long been a behaviour seen by HDDs, with some firmware better than others. You might find queue_depth=2 works better, but higher queue depths are worse. For some background, see http://blog.richardelling.com/2012/03/iops-and-latency-are-not-related-hdd.html <http://blog.richardelling.com/2012/03/iops-and-latency-are-not-related-hdd.html>

…

-- richard

shodanshok · 2020-07-13T20:13:34Z

@richardelling Unfortunately, for the specific case of WD Gold disks (and I suppose @dechamps WD Red too), using anything over 1 causes the read starvation issue described above.

bghira · 2020-07-13T20:30:37Z

with WD Gold disks, disabling the disk scheduler (using noop) resolves this issue, I don't need to set queue_depth=1. however, if I use any other disk scheduler than noop, zpool scrub will cause IO starvation.

dechamps · 2020-07-13T20:37:03Z

To clarify, in my case, /sys/class/block/sd*/queue/scheduler was [none] from the very beginning, so clearly that didn't help with my WD Red WD100EFAX. Only setting queue_depth to 1 fixed the issue.

shodanshok · 2020-07-13T20:41:17Z

@misterbigstuff I my case, using noop or none made no difference on IOPS recorded during a scrub. Limiting queue_depth to 1 was the only solution, matching @dechamps experience.

bghira · 2020-07-14T00:20:59Z

notably i'm using the SATA revision of these devices, which have different firmware from the SAS counterpart.

shodanshok · 2020-07-14T08:02:25Z

@misterbigstuff Interesting: I also have multiple SATA WD Gold, but these drives show the described issued unless I set queue_depth=1, irrespective of the IO scheduler (which is consistent with the fio tests which, by using direct=1 and issuing a single IOP at a times, should be unaffected by the scheduler). Maybe some newer firmware fixed it? For reference, here are my disk details:

Model Family:     Western Digital Gold
Device Model:     WDC WD2005FBYZ-01YCBB2
Serial Number:    WD-XXX
LU WWN Device Id: 5 0014ee 0af18d58f
Firmware Version: RR07
User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Can you share your disk model/firmware version? Did you try the reproducer involving concurrently running these two fio commands? Can you post the results of the tests below?

fio --name=test --filename=/dev/sda --direct=1 --rw=read #sequential read
fio --name=test --filename=/dev/sda --direct=1 --rw=randread #random read

dechamps · 2020-07-14T09:18:37Z

@misterbigstuff

i'm using the SATA revision of these devices

I am also using SATA, so that shouldn't make a difference.

Here are the details of one of my drives:

Model Family:     Western Digital Red
Device Model:     WDC WD100EFAX-68LHPN0
Serial Number:    JEJEGWKM
LU WWN Device Id: 5 000cca 267e24fb9
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jul 14 10:12:02 2020 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

OS details: Debian Unstable/Sid, Linux 5.7.0-1, ZFS/SPL 0.8.4-1.

@misterbigstuff One thing that might be different in your case is that you might be running an older Linux Kernel - you keep mentioning the noop scheduler, but in modern kernels with mq, that scheduler is called none:

$ cat /sys/block/sda/queue/scheduler
[none] mq-deadline

wildente · 2020-08-11T13:38:11Z

i am also having this issue. zpool scrub seems runs without any throttle at all and thus
impacting io latency and overall system load.

with zfs 0.7.* i throttled scrubs with those parameters:

zfs_scrub_delay=60
zfs_top_maxinflight=6
zfs_scan_idle=150

this slowed down scrubs without impacting the application (much).

with zfs 0.8 these parameter do not exist anymore. i've been reading the zfs module parameters man page and began playing around with these parameters, but i am unable to slow down the scrub at all:

zfs_vdev_scrub_max_active
zfs_scan_strict_mem_lim
zfs_scan_mem_lim_soft_fact
zfs_scan_mem_lim_fact
zfs_scan_vdev_limit
zfs_scrub_min_time_ms
zfs_no_scrub_prefetch

i also made sure, that the system parameters for queue depth and io scheduler are set as seen above.

$ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/device/queue_depth ; done | sort | uniq
1

$ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/queue/scheduler ; done | sort | uniq
[none] mq-deadline

system configuration:

dell md3060e enclosure
sas hba
12 raidz1 pools of 5 nl-sas hdds (4TB) (manufacturer toshiba, seagate, hgst)
os: debian buster
zfs version: 0.8.4-1~bpo10+1
kernel version: 4.19.0-9-amd64

graphs from the prometheus node exporter. i did stop the scrub after some time:

i could use some help on how to go on with this. which other parameters might be helpful in decreasing the scrub speed? what else can i try?

ahrens · 2020-08-11T15:33:41Z

At Delphix, we have investigated reducing the impact of scrub by having it run at a reduced i/o rate. Several years back, one of our interns prototyped this. It would be wonderful if we took this discussion as motivation to complete that work with a goal of having scrub on by default in more deployments of ZFS! If anyone is interested in working on that, I can dig up the design documents and any code.

shodanshok · 2020-08-11T16:17:55Z

@wildente from the graphs you posted, it seems the pools had almost no load excluding scrub itself. Did you scrub all your pool at the same time? Can you set zfs_vdev_scrub_max_active=1 and run the following fio command on both idle and scrubbing pool?

fio --name=test --filename=/yourpool/test.img --rw=randread --size=32G

@ahrens excluding bad interactions with hardware queues, setting zfs_vdev_scrub_max_active=1 should let scrub to "only" eat 50% of available IOPs. Do you think a simple rule as "if any other queues has one or more active/pending IO, skip scrubbing for some msec" can be useful (similar to how 0.7.x throttled scrub)? Thanks.

richardelling · 2020-08-11T16:33:00Z

NB, zpool wait time is the time I/Os are not issued to physical devices. So if you have a scrub ongoing and zfs_vdev_scrub_max_active is small (default=2), then it is expected to see high wait time at the zpool level. To make this info useful, you'll need to look at the wait time per queue. See `zpool iostat -l` (though I'm not convinced zpool iostat -l is as advertised, but that is another discussion) -- richard

…

On Aug 11, 2020, at 6:38 AM, Wildente ***@***.***> wrote: i am also having this issue. zpool scrub seems runs without any throttle at all and thus impacting io latency and overall system load. with zfs 0.7.* i throttled scrubs with those parameters: zfs_scrub_delay=60 zfs_top_maxinflight=6 zfs_scan_idle=150 this slowed down scrubs without impacting the application (much). with zfs 0.8 these parameter do not exist anymore. i've been reading the zfs module parameters man page and began playing around with these parameters, but i am unable to slow down the scrub at all: zfs_vdev_scrub_max_active zfs_scan_strict_mem_lim zfs_scan_mem_lim_soft_fact zfs_scan_mem_lim_fact zfs_scan_vdev_limit zfs_scrub_min_time_ms zfs_no_scrub_prefetch i also made sure, that the system parameters for queue depth and io scheduler are set as seen above. $ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/device/queue_depth ; done | sort | uniq 1 $ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/queue/scheduler ; done | sort | uniq [none] mq-deadline system configuration: dell md3060e enclosure sas hba 12 raidz1 pools of 5 nl-sas hdds (4TB) (manufacturer toshiba, seagate, hgst) os: debian buster zfs version: 0.8.4-1~bpo10+1 kernel version: 4.19.0-9-amd64 graphs from the prometheus node exporter. i did stop the scrub after some time: <https://user-images.githubusercontent.com/29410350/89903397-e2f72f00-dbe7-11ea-9e79-312406462f24.png> <https://user-images.githubusercontent.com/29410350/89903469-f86c5900-dbe7-11ea-8dd0-db06605b6759.png> i could use some help on how to go on with this. which other parameters might be helpful in decreasing the scrub speed? what else can i try? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10253 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGTZTPW6H57PPF3WYJBHN3SAFCVHANCNFSM4MQWWWTA>.

snajpa · 2020-08-11T19:48:09Z

I'm wondering, I might be totally off, but some pools we've recently created, were created with bad ashift (=9, when the drives had 4k sectors in fact). It were SSDs in both cases, but accessing the drives with 512B sectors absolutely destroyed any hint at performance the devices might have had. Recreating the pool with -o ashift=12 fixed it.

Could you, just to be sure, check the ashift? 1.3k IOPS from a pool with NVMe drives sounds like exactly the situation I'm describing :)

ahrens · 2020-08-11T20:03:48Z

@shodanshok

setting zfs_vdev_scrub_max_active=1 should let scrub to "only" eat 50% of available IOPs

Assuming that all i/os are equal, yes. But the per-byte costs can cause scrub i/o's to eat more than 50% of the available performance. I think that scrub i/o's can aggregate up to 1MB (and are likely to, now that we have "sorted scrub"), vs typical i/o's might be smaller.

Do you think a simple rule as "if any other queues has one or more active/pending IO, skip scrubbing for some msec" can be useful (similar to how 0.7.x throttled scrub)? Thanks.

I think it could be, if we do it right. For example, we might want finer granularity than whole milliseconds. And we'd want to consider both "metadata scanning" and "issuing scrub i/os" phases. Although maybe we could ignore (not limit) the metadata scanning for this purpose? A deliberate "slow scrub" feature might work by automatically adjusting this kind of knob.

shodanshok · 2020-08-11T22:40:04Z

@shodanshok

setting zfs_vdev_scrub_max_active=1 should let scrub to "only" eat 50% of available IOPs

Assuming that all i/os are equal, yes. But the per-byte costs can cause scrub i/o's to eat more than 50% of the available performance. I think that scrub i/o's can aggregate up to 1MB (and are likely to, now that we have "sorted scrub"), vs typical i/o's might be smaller.

True.

I think it could be, if we do it right. For example, we might want finer granularity than whole milliseconds. And we'd want to consider both "metadata scanning" and "issuing scrub i/os" phases. Although maybe we could ignore (not limit) the metadata scanning for this purpose? A deliberate "slow scrub" feature might work by automatically adjusting this kind of knob.

I suppose the metadata scan does not need special treatment. On the other side the data scrub phase, being sequential in nature, can really consume vast amout of bandwidth (and IOPs).

wildente · 2020-08-12T13:22:03Z

thank you for the overwhelming number of messages. i'll try to answer
them all

@ahrens: yes some more information would be useful. i was using
the parameters from my post to slow down the scrub and so it would
finish within one week (zfs 0.7) instead of ~15 hours (zfs 0.8).

i also agree, that the weight of each io request is relevant for this
case, since we are now scrubbing sequentially in large blocks.

@shodanshok: maybe i should have posted graphs of the read/write ops
instead of read data rate. these kind of servers mainly handle small random
reads and write-appends (similar to the mbox format).

i will set `zfs_vdev_scrub_max_active=1 and run your fio test command
tomorrow morning.

and yes, all pools did scrub at the same time.

@richardelling: thanks. i always thought, that this is the actual
wait io from the underlying disks.

@snajpa: in my case those pools were create about a year ago,
on hdds and they were created with ashift=12

wildente · 2020-08-13T07:44:27Z

@shodanshok i've set zfs_vdev_scrub_max_active=1 and ran the fio command on a one of the 12 zpools:

# zpool iostat zpool1-store1 -l 1

[snip]

zpool1-store1  11,5T  6,60T    313      0  6,94M      0   14ms      -   14ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    312      0  7,13M      0   14ms      -   14ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    301      0  6,94M      0   15ms      -   15ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    335      0  7,87M      0   14ms      -   14ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    324      0  7,24M      0   14ms      -   14ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    300      0  6,62M      0   16ms      -   16ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    330      0  7,44M      0   15ms      -   15ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    307      0  6,88M      0   16ms      -   16ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    314      0  6,92M      0   15ms      -   15ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    325      0  7,50M      0   14ms      -   14ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    315      0  6,85M      0   15ms      -   15ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    310      0  6,85M      0   14ms      -   14ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    142    275  4,29M  2,13M   20ms  143ms   20ms   26ms    2us      -      -  117ms    3us      -
zpool1-store1  11,5T  6,60T  1,29K     21  91,1M  87,7K    5ms   68ms    3ms   38ms    2us    2us      -  402ms    1ms      -
zpool1-store1  11,5T  6,60T  1,65K      0   115M      0    4ms      -    3ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,46K      0   102M      0    5ms      -    3ms      -    2us      -      -      -    1ms      -
                 capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool           alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
-------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
zpool1-store1  11,5T  6,60T  1,35K      0  97,5M      0    5ms      -    4ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,47K      0   110M      0    5ms      -    3ms      -    3us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,59K      0   113M      0    4ms      -    3ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,50K      0   108M      0    4ms      -    3ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,43K      0   104M      0    5ms      -    3ms      -    3us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,51K      0   105M      0    4ms      -    3ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,20K      0  80,7M      0    5ms      -    4ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,25K      0  85,1M      0    5ms      -    3ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,13K      0  77,7M      0    6ms      -    4ms      -    3us      -      -      -    2ms      -
zpool1-store1  11,5T  6,60T  1,10K      0  77,5M      0    6ms      -    4ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,30K      0  90,1M      0    5ms      -    3ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,08K      0  75,0M      0    6ms      -    4ms      -    2us      -      -      -    2ms      -

[snip]

zpool1-store1  11,5T  6,60T  1,14K      0  79,4M      0    6ms      -    4ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,13K      0  79,9M      0    5ms      -    3ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T    957      0  66,6M      0    6ms      -    4ms      -    2us      -      -      -    2ms      -
zpool1-store1  11,5T  6,60T  1,20K      0  89,7M      0    5ms      -    4ms      -    3us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,03K      0  70,3M      0    6ms      -    4ms      -    2us      -      -      -    2ms      -
zpool1-store1  11,5T  6,60T  1,19K      0  80,0M      0    6ms      -    4ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,25K      0  87,8M      0    5ms      -    4ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,71K      0   121M      0    4ms      -    3ms      -    3us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,29K     13  91,5M  55,8K    5ms   66ms    3ms   57ms    2us      -      -   11ms    1ms      -
zpool1-store1  11,5T  6,60T    155    327  4,64M  2,09M   19ms  167ms   19ms   25ms    2us    2us      -  150ms    3us      -
zpool1-store1  11,5T  6,60T    306      0  7,10M      0   15ms      -   15ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    317      0  7,31M      0   14ms      -   14ms      -    2us      -      -      -    2us      -
zpool1-store1  11,5T  6,60T    304     31  6,89M   136K   15ms   45ms   15ms   28ms    2us      -      -   15ms    2us      -
zpool1-store1  11,5T  6,60T    479    252  30,4M  2,01M    9ms  159ms    7ms   27ms    3us    2us      -  141ms    2ms      -
zpool1-store1  11,5T  6,60T  1,50K      0   107M      0    4ms      -    3ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,28K      0  90,7M      0    5ms      -    4ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,39K      0  99,0M      0    5ms      -    3ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,20K      0  79,9M      0    5ms      -    4ms      -    2us      -      -      -    1ms      -
zpool1-store1  11,5T  6,60T  1,11K      0  79,7M      0    6ms      -    4ms      -    2us      -      -      -    2ms      -

before the start of the scrub, we have ~300-330 read ops. after the start of the scrub, it jumps to 1k-1.7k read ops. i am guessing the write operation in between are checkpoints.

# zpool status zpool1-store1 | grep pool: -A4
  pool: zpool1-store1
 state: ONLINE
  scan: scrub in progress since Thu Aug 13 09:09:05 2020
        580G scanned at 642M/s, 59,5G issued at 65,9M/s, 11,5T total
        0B repaired, 0,50% done, 2 days 02:41:31 to go

# fio --name=test --filename=/zpool1-store1/test.img --rw=randread --size=32G
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
^Cbs: 1 (f=1): [r(1)][1.2%][r=256KiB/s][r=64 IOPS][eta 01d:06h:10m:44s]
fio: terminating on signal 2

test: (groupid=0, jobs=1): err= 0: pid=5417: Thu Aug 13 09:28:25 2020
  read: IOPS=76, BW=305KiB/s (312kB/s)(401MiB/1345644msec)
    clat (usec): min=2, max=514086, avg=13104.19, stdev=13722.49
     lat (usec): min=2, max=514086, avg=13104.61, stdev=13722.55
    clat percentiles (usec):
     |  1.00th=[    11],  5.00th=[    26], 10.00th=[    38], 20.00th=[    41],
     | 30.00th=[    52], 40.00th=[  8586], 50.00th=[ 13698], 60.00th=[ 17171],
     | 70.00th=[ 20055], 80.00th=[ 24249], 90.00th=[ 28967], 95.00th=[ 32637],
     | 99.00th=[ 39584], 99.50th=[ 44827], 99.90th=[ 68682], 99.95th=[ 99091],
     | 99.99th=[421528]
   bw (  KiB/s): min=    8, max=38312, per=100.00%, avg=305.07, stdev=930.11, samples=2691
   iops        : min=    2, max= 9578, avg=76.21, stdev=232.53, samples=2691
  lat (usec)   : 4=0.07%, 10=0.88%, 20=1.72%, 50=22.95%, 100=11.64%
  lat (usec)   : 250=0.06%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.03%, 10=5.91%, 20=26.39%, 50=30.03%
  lat (msec)   : 100=0.25%, 250=0.02%, 500=0.03%, 750=0.01%
  cpu          : usr=0.05%, sys=0.62%, ctx=64785, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=102656,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=305KiB/s (312kB/s), 305KiB/s-305KiB/s (312kB/s-312kB/s), io=401MiB (420MB), run=1345644-1345644msec

can i provide anything else to help with this issue?

shodanshok · 2020-08-13T11:23:54Z

@wildente so during the scrub, fio shows 76 IOPs. What about re-running fio without a background scrub? How many IOPs do you have?

Your latency numbers seems ok. Can you show, both with and without scrub running, the output of "zpool iostat -q" (to get queue stats)?

wildente · 2020-08-14T11:57:52Z

@shodanshok yes, i will do that on monday morning

wildente · 2020-08-24T11:43:54Z

@shodanshok sorry for the long delay. i can reproduce the IOPS from above, but i think that is expected of a raidz with 5 drives.

                 capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim   syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool           alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
-------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
zpool1-store1  11,4T  6,69T     61      5  2,96M  50,9K     1s  110ms    7ms   21ms   11ms    2us    7ms   97ms     1s      -      0      0      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T      0      0      0      0      -      -      -      -      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T      0      0      0      0      -      -      -      -      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T      0      0      0      0      -      -      -      -      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T      0      0      0      0      -      -      -      -      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0

[snip]

zpool1-store1  11,4T  6,69T    308      0  9,63M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      3      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    274      0  8,57M      0   11ms      -   11ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    287      0  8,97M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    318      0  9,94M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    288      0  9,00M      0   11ms      -   11ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    301      0  9,41M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      2      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    294      0  9,19M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      3      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    322      0  10,1M      0    9ms      -    9ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    303      0  9,47M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    295      0  9,22M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    315      0  9,85M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    303      0  9,47M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    307      0  9,60M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    303      0  9,47M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    333      0  10,4M      0    9ms      -    9ms      -    2us      -      -      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    304      0  9,50M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    307      0  9,60M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    273      0  8,54M      0   11ms      -   11ms      -    2us      -      -      -      -      -      0      2      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    304      0  9,52M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    321      0  10,0M      0    9ms      -    9ms      -    2us      -      -      -      -      -      0      3      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    282      0  8,82M      0   11ms      -   11ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    308      0  9,63M      0    9ms      -    9ms      -    2us      -      -      -      -      -      0      3      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    300      0  9,38M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      2      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    277      0  8,66M      0   11ms      -   11ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    306      0  9,56M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    322      0  10,1M      0    9ms      -    9ms      -    2us      -      -      -      -      -      0      2      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    294      0  9,19M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      3      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    279      0  8,72M      0   11ms      -   11ms      -    2us      -      -      -      -      -      0      3      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    316      0  9,88M      0    9ms      -    9ms      -    2us      -      -      -      -      -      0      2      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    302      0  9,44M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      3      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    302      0  9,44M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    305      0  9,53M      0    9ms      -    9ms      -    2us      -      -      -      -      -      0      2      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    289      0  9,04M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    312      0  9,75M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      3      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    311      0  9,72M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      3      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    302      0  9,44M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    285      0  8,91M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      2      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    286      0  8,94M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      3      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    286      0  8,94M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    303      0  9,47M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    303      0  9,47M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    289      0  9,04M      0   11ms      -   11ms      -    2us      -      -      -      -      -      0      2      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    281      0  8,79M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    303      0  9,47M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      4      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T    305      0  9,53M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      2      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T     93      0  2,93M      0   10ms      -   10ms      -    2us      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T      0      0      0      0      -      -      -      -      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T      0      0      0      0      -      -      -      -      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T      0      0      0      0      -      -      -      -      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T      0      0      0      0      -      -      -      -      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0
zpool1-store1  11,4T  6,69T      0      0      0      0      -      -      -      -      -      -      -      -      -      -      0      0      0      0      0      0      0      0      0      0      0      0



# fio --name=test --filename=/zpool1-store1/test.img --rw=randread --size=32G
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
^Cbs: 1 (f=1): [r(1)][0.1%][r=300KiB/s][r=75 IOPS][eta 01d:07h:21m:08s]
fio: terminating on signal 2

test: (groupid=0, jobs=1): err= 0: pid=2313: Mon Aug 24 13:31:35 2020
  read: IOPS=74, BW=297KiB/s (305kB/s)(44.8MiB/154330msec)
    clat (usec): min=8, max=68477, avg=13441.23, stdev=5411.49
     lat (usec): min=8, max=68478, avg=13441.77, stdev=5411.49
    clat percentiles (usec):
     |  1.00th=[   19],  5.00th=[ 6783], 10.00th=[ 8029], 20.00th=[ 9241],
     | 30.00th=[10159], 40.00th=[11600], 50.00th=[13042], 60.00th=[15139],
     | 70.00th=[16319], 80.00th=[17171], 90.00th=[17957], 95.00th=[20317],
     | 99.00th=[35390], 99.50th=[36963], 99.90th=[39584], 99.95th=[43779],
     | 99.99th=[47449]
   bw (  KiB/s): min=  112, max=  368, per=100.00%, avg=297.31, stdev=42.17, samples=308
   iops        : min=   28, max=   92, avg=74.26, stdev=10.54, samples=308
  lat (usec)   : 10=0.03%, 20=1.94%, 50=0.01%, 100=0.02%
  lat (msec)   : 2=0.01%, 4=0.14%, 10=26.45%, 20=66.32%, 50=5.08%
  lat (msec)   : 100=0.01%
  cpu          : usr=0.08%, sys=0.85%, ctx=11528, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=11477,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=297KiB/s (305kB/s), 297KiB/s-297KiB/s (305kB/s-305kB/s), io=44.8MiB (47.0MB), run=154330-154330msec

the layout consists of 12 zpools, each configured as raidz1 with 5 drives:

config:

        NAME                            STATE     READ WRITE CKSUM
        zpool1-store1                   ONLINE       0     0     0
          raidz1-0                      ONLINE       0     0     0
            /dev/MD3060e/D01-S01-E01p1  ONLINE       0     0     0
            /dev/MD3060e/D02-S01-E13p1  ONLINE       0     0     0
            /dev/MD3060e/D03-S01-E25p1  ONLINE       0     0     0
            /dev/MD3060e/D04-S01-E37p1  ONLINE       0     0     0
            /dev/MD3060e/D05-S01-E49p1  ONLINE       0     0     0

chrcoluk · 2021-09-10T20:56:17Z

Any news on this, has been any explanation as to why scrub throttling was removed?

Anyone who is sane will know that system responsiveness is king over scrub speed, as a scrub that kills the server is a scrub that gets disabled.

pgporada · 2021-10-05T19:42:25Z

Any news on this, has been any explanation as to why scrub throttling was removed?

I'd like to find an answer for this too.

behlendorf · 2021-10-06T18:08:47Z

There was a new scrub throttle added as of the OpenZFS 2.0 release (PR #11166). It's intended to lessen the impact of non-interactive I/O, like scrub, on the running workload. If the default behavior is still problematic, you can use the following module options to adjust the throttle.

From https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html:

     zfs_vdev_nia_delay=5 (int)
             For non-interactive I/O (scrub, resilver, removal, initialize and
             rebuild), the number of concurrently-active I/O operations is
             limited to zfs_*_min_active, unless the vdev is "idle".  When
             there are no interactive I/O operations active (synchronous or
             otherwise), and zfs_vdev_nia_delay operations have completed
             since the last interactive operation, then the vdev is considered
             to be "idle", and the number of concurrently-active non-interac‐
             tive operations is increased to zfs_*_max_active.  See ZFS I/O
             SCHEDULER.

     zfs_vdev_nia_credit=5 (int)
             Some HDDs tend to prioritize sequential I/O so strongly, that
             concurrent random I/O latency reaches several seconds.  On some
             HDDs this happens even if sequential I/O operations are submitted
             one at a time, and so setting zfs_*_max_active= 1 does not help.
             To prevent non-interactive I/O, like scrub, from monopolizing the
             device, no more than zfs_vdev_nia_credit operations can be sent
             while there are outstanding incomplete interactive operations.
             This enforced wait ensures the HDD services the interactive I/O
             within a reasonable amount of time.  See ZFS I/O SCHEDULER.

shodanshok changed the title ~~ZFS new scrub approach heavily impacting application IO performance~~ Scrub heavily impacting application IO performance Apr 28, 2020

behlendorf added the Type: Performance Performance improvement or performance problem label Apr 29, 2020

behlendorf closed this as completed Jun 1, 2020

shodanshok mentioned this issue Jul 10, 2020

scrub impacts user I/O latency too much #10535

Closed

shodanshok mentioned this issue Nov 26, 2020

Reduce latency effects of non-interactive I/O. #11166

Merged

12 tasks

rincebrain mentioned this issue Nov 3, 2022

Are zfs_scrub_delay and zfs_scan_idle no longer tunneable? #14132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrub heavily impacting application IO performance #10253

Scrub heavily impacting application IO performance #10253

shodanshok commented Apr 25, 2020

shodanshok commented Apr 28, 2020

shodanshok commented May 4, 2020 •

edited

Loading

shodanshok commented May 30, 2020

shodanshok commented May 30, 2020 •

edited

Loading

behlendorf commented Jun 1, 2020

dechamps commented Jul 10, 2020 •

edited

Loading

gdevenyi commented Jul 10, 2020

shodanshok commented Jul 11, 2020

dechamps commented Jul 11, 2020

dechamps commented Jul 11, 2020

richardelling commented Jul 13, 2020 via email

shodanshok commented Jul 13, 2020

bghira commented Jul 13, 2020

dechamps commented Jul 13, 2020 •

edited

Loading

shodanshok commented Jul 13, 2020

bghira commented Jul 14, 2020

shodanshok commented Jul 14, 2020

dechamps commented Jul 14, 2020

wildente commented Aug 11, 2020

ahrens commented Aug 11, 2020

shodanshok commented Aug 11, 2020

richardelling commented Aug 11, 2020 via email

snajpa commented Aug 11, 2020

ahrens commented Aug 11, 2020

shodanshok commented Aug 11, 2020 •

edited

Loading

wildente commented Aug 12, 2020

wildente commented Aug 13, 2020

shodanshok commented Aug 13, 2020

wildente commented Aug 14, 2020

wildente commented Aug 24, 2020

chrcoluk commented Sep 10, 2021

pgporada commented Oct 5, 2021 •

edited

Loading

behlendorf commented Oct 6, 2021

Scrub heavily impacting application IO performance #10253

Scrub heavily impacting application IO performance #10253

Comments

shodanshok commented Apr 25, 2020

System information

Describe the problem you're observing

Describe how to reproduce the problem

shodanshok commented Apr 28, 2020

shodanshok commented May 4, 2020 • edited Loading

shodanshok commented May 30, 2020

shodanshok commented May 30, 2020 • edited Loading

behlendorf commented Jun 1, 2020

dechamps commented Jul 10, 2020 • edited Loading

gdevenyi commented Jul 10, 2020

shodanshok commented Jul 11, 2020

dechamps commented Jul 11, 2020

dechamps commented Jul 11, 2020

richardelling commented Jul 13, 2020 via email

shodanshok commented Jul 13, 2020

bghira commented Jul 13, 2020

dechamps commented Jul 13, 2020 • edited Loading

shodanshok commented Jul 13, 2020

bghira commented Jul 14, 2020

shodanshok commented Jul 14, 2020

dechamps commented Jul 14, 2020

wildente commented Aug 11, 2020

ahrens commented Aug 11, 2020

shodanshok commented Aug 11, 2020

richardelling commented Aug 11, 2020 via email

snajpa commented Aug 11, 2020

ahrens commented Aug 11, 2020

shodanshok commented Aug 11, 2020 • edited Loading

wildente commented Aug 12, 2020

wildente commented Aug 13, 2020

shodanshok commented Aug 13, 2020

wildente commented Aug 14, 2020

wildente commented Aug 24, 2020

chrcoluk commented Sep 10, 2021

pgporada commented Oct 5, 2021 • edited Loading

behlendorf commented Oct 6, 2021

shodanshok commented May 4, 2020 •

edited

Loading

shodanshok commented May 30, 2020 •

edited

Loading

dechamps commented Jul 10, 2020 •

edited

Loading

dechamps commented Jul 13, 2020 •

edited

Loading

shodanshok commented Aug 11, 2020 •

edited

Loading

pgporada commented Oct 5, 2021 •

edited

Loading