-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrub heavily impacting application IO performance #10253
Comments
After some more investigation, I found that the very low scrub performance was not directly related to the new zfs scan mode, but due to the interaction of
If the first point was my fault (well, I set it as Lets see the output of
While I really like the new scrub/resilver performance, I think we need an "escape hatch" to throttle scrubbing when application IO should be affected as little as possible. |
An update: I considered restoring some form of delay, taking it from 0.7.x branch. However, I found that limiting Finally, a scrub can be stopped/paused during work hours. @behlendorf feel free to close the ticket. I am not closing it now only because I don't know if you (or other maintainers) want to track the problem described above. Thanks. |
@behlendorf @ahrens (I do not remember who contributed the sequential scrub code, please feel free to add the right person) I would like to add another datapoint. Short summary:
Please note how the HDDs are overwelmed by pending ZFS scrub request: while the scrub itself is very fast, it completely saturates the HDDs with very bad resulting performance for running VMs. Setting Any idea on what can be done to further decrease scrub load? |
Well, I did an interesting discovery: setting I got curious and tested a disk (WD Gold 2 TB) in isolation. I can replicate the issue by concurrently running the following two While the first As a side note, an older WD Green did not show any issue. I am leaving this issue open for some days only because I don't know if someone want to comment and/or share other relevant experiences. Anyway, feel free to close it. Thanks. |
That's really interesting. It definitely sounds like an issue with the WD Gold drives, and it's not something I would have expected from an enterprise branded drive. You might want to check if there's a firmware update available. Thanks for posting a solution for anyone else who may encounter this. |
I was able to reproduce this on a different model of Western Digital hard drives: WD Red 10 TB (WD100EFAX). I am using 6 of these drives in a zpool made of 3 mirrors. See #10535. My experience closely matches @shodanshok's: following the steps to reproduce in his original post, with default settings ( Now, if only Western Digital could fix their firmware… stalling all random reads when sequential reads are inflight sounds pretty bad. One can easily imagine such behaviour causing problems with production services becoming unresponsive just because some random user decided to scan the contents of a file. |
Is it time for the ZFS wiki or related documentation to make a "known bad" list of drives/firmwares that have been definitively identified as interacting badly with ZFS? |
@gdevenyi Rather than a list (which will become outdated pretty fast), I suggest inserting a note in the hardware/performance page stating that if excessive performance degradation is observed during scrub, disabling NCQ is a possible workaround (maybe even linking to this issue). |
I was actually planning to add a "Queue depth" section on the Performance tuning OpenZFS wiki page to describe this problem, but that page doesn't seem to have open edit access. |
…and for reference, I used the following udev rule to automatically apply the workaround to all my affected disks:
|
On Jul 11, 2020, at 3:38 AM, Etienne Dechamps ***@***.***> wrote:
…and for reference, I used the following udev rule to automatically apply the workaround to all my affected disks:
DRIVER=="sd", ATTR{model}=="WDC WD100EFAX-68", ATTR{queue_depth}="1"
This has long been a behaviour seen by HDDs, with some firmware better than others.
You might find queue_depth=2 works better, but higher queue depths are worse. For
some background, see
http://blog.richardelling.com/2012/03/iops-and-latency-are-not-related-hdd.html <http://blog.richardelling.com/2012/03/iops-and-latency-are-not-related-hdd.html>
…-- richard
|
@richardelling Unfortunately, for the specific case of WD Gold disks (and I suppose @dechamps WD Red too), using anything over 1 causes the read starvation issue described above. |
with WD Gold disks, disabling the disk scheduler (using |
To clarify, in my case, |
@misterbigstuff I my case, using |
notably i'm using the SATA revision of these devices, which have different firmware from the SAS counterpart. |
@misterbigstuff Interesting: I also have multiple SATA WD Gold, but these drives show the described issued unless I set
Can you share your disk model/firmware version? Did you try the reproducer involving concurrently running these two
|
@misterbigstuff
I am also using SATA, so that shouldn't make a difference. Here are the details of one of my drives:
OS details: Debian Unstable/Sid, Linux 5.7.0-1, ZFS/SPL 0.8.4-1. @misterbigstuff One thing that might be different in your case is that you might be running an older Linux Kernel - you keep mentioning the
|
i am also having this issue. zpool scrub seems runs without any throttle at all and thus with zfs 0.7.* i throttled scrubs with those parameters:
this slowed down scrubs without impacting the application (much). with zfs 0.8 these parameter do not exist anymore. i've been reading the zfs module parameters man page and began playing around with these parameters, but i am unable to slow down the scrub at all:
i also made sure, that the system parameters for queue depth and io scheduler are set as seen above. $ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/device/queue_depth ; done | sort | uniq $ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/queue/scheduler ; done | sort | uniq system configuration:
graphs from the prometheus node exporter. i did stop the scrub after some time: i could use some help on how to go on with this. which other parameters might be helpful in decreasing the scrub speed? what else can i try? |
At Delphix, we have investigated reducing the impact of scrub by having it run at a reduced i/o rate. Several years back, one of our interns prototyped this. It would be wonderful if we took this discussion as motivation to complete that work with a goal of having scrub on by default in more deployments of ZFS! If anyone is interested in working on that, I can dig up the design documents and any code. |
@wildente from the graphs you posted, it seems the pools had almost no load excluding scrub itself. Did you scrub all your pool at the same time? Can you set
@ahrens excluding bad interactions with hardware queues, setting |
NB, zpool wait time is the time I/Os are not issued to physical devices. So if you have a scrub ongoing and
zfs_vdev_scrub_max_active is small (default=2), then it is expected to see high wait time at the zpool level.
To make this info useful, you'll need to look at the wait time per queue. See `zpool iostat -l` (though I'm not
convinced zpool iostat -l is as advertised, but that is another discussion)
-- richard
… On Aug 11, 2020, at 6:38 AM, Wildente ***@***.***> wrote:
i am also having this issue. zpool scrub seems runs without any throttle at all and thus
impacting io latency and overall system load.
with zfs 0.7.* i throttled scrubs with those parameters:
zfs_scrub_delay=60
zfs_top_maxinflight=6
zfs_scan_idle=150
this slowed down scrubs without impacting the application (much).
with zfs 0.8 these parameter do not exist anymore. i've been reading the zfs module parameters man page and began playing around with these parameters, but i am unable to slow down the scrub at all:
zfs_vdev_scrub_max_active
zfs_scan_strict_mem_lim
zfs_scan_mem_lim_soft_fact
zfs_scan_mem_lim_fact
zfs_scan_vdev_limit
zfs_scrub_min_time_ms
zfs_no_scrub_prefetch
i also made sure, that the system parameters for queue depth and io scheduler are set as seen above.
$ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/device/queue_depth ; done | sort | uniq
1
$ for i in /sys/block/sd* ; do [[ $(cat $i/queue/rotational) == 1 ]] && cat $i/queue/scheduler ; done | sort | uniq
[none] mq-deadline
system configuration:
dell md3060e enclosure
sas hba
12 raidz1 pools of 5 nl-sas hdds (4TB) (manufacturer toshiba, seagate, hgst)
os: debian buster
zfs version: 0.8.4-1~bpo10+1
kernel version: 4.19.0-9-amd64
graphs from the prometheus node exporter. i did stop the scrub after some time:
<https://user-images.githubusercontent.com/29410350/89903397-e2f72f00-dbe7-11ea-9e79-312406462f24.png>
<https://user-images.githubusercontent.com/29410350/89903469-f86c5900-dbe7-11ea-8dd0-db06605b6759.png>
i could use some help on how to go on with this. which other parameters might be helpful in decreasing the scrub speed? what else can i try?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#10253 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGTZTPW6H57PPF3WYJBHN3SAFCVHANCNFSM4MQWWWTA>.
|
I'm wondering, I might be totally off, but some pools we've recently created, were created with bad ashift (=9, when the drives had 4k sectors in fact). It were SSDs in both cases, but accessing the drives with 512B sectors absolutely destroyed any hint at performance the devices might have had. Recreating the pool with Could you, just to be sure, check the ashift? 1.3k IOPS from a pool with NVMe drives sounds like exactly the situation I'm describing :) |
Assuming that all i/os are equal, yes. But the per-byte costs can cause scrub i/o's to eat more than 50% of the available performance. I think that scrub i/o's can aggregate up to 1MB (and are likely to, now that we have "sorted scrub"), vs typical i/o's might be smaller.
I think it could be, if we do it right. For example, we might want finer granularity than whole milliseconds. And we'd want to consider both "metadata scanning" and "issuing scrub i/os" phases. Although maybe we could ignore (not limit) the metadata scanning for this purpose? A deliberate "slow scrub" feature might work by automatically adjusting this kind of knob. |
True.
I suppose the metadata scan does not need special treatment. On the other side the data scrub phase, being sequential in nature, can really consume vast amout of bandwidth (and IOPs). |
thank you for the overwhelming number of messages. i'll try to answer @ahrens: yes some more information would be useful. i was using i also agree, that the weight of each io request is relevant for this @shodanshok: maybe i should have posted graphs of the read/write ops i will set `zfs_vdev_scrub_max_active=1 and run your fio test command and yes, all pools did scrub at the same time. @richardelling: thanks. i always thought, that this is the actual @snajpa: in my case those pools were create about a year ago, |
@shodanshok i've set zfs_vdev_scrub_max_active=1 and ran the fio command on a one of the 12 zpools:
before the start of the scrub, we have ~300-330 read ops. after the start of the scrub, it jumps to 1k-1.7k read ops. i am guessing the write operation in between are checkpoints.
can i provide anything else to help with this issue? |
@wildente so during the scrub, Your latency numbers seems ok. Can you show, both with and without scrub running, the output of "zpool iostat -q" (to get queue stats)? |
@shodanshok yes, i will do that on monday morning |
@shodanshok sorry for the long delay. i can reproduce the IOPS from above, but i think that is expected of a raidz with 5 drives.
the layout consists of 12 zpools, each configured as raidz1 with 5 drives:
|
Any news on this, has been any explanation as to why scrub throttling was removed? Anyone who is sane will know that system responsiveness is king over scrub speed, as a scrub that kills the server is a scrub that gets disabled. |
I'd like to find an answer for this too. |
There was a new scrub throttle added as of the OpenZFS 2.0 release (PR #11166). It's intended to lessen the impact of non-interactive I/O, like scrub, on the running workload. If the default behavior is still problematic, you can use the following module options to adjust the throttle. From https://openzfs.github.io/openzfs-docs/man/4/zfs.4.html:
|
System information
Describe the problem you're observing
The new scub code heavily impacts application IO performance when used with HDD-based pools. Application IOPs are reducted by up to 10x factor.
Using a 4x SAS 15k 300 GB disks test pool which can provide ~250 IOPs for 4K single-thread sync random read (as measured by
fio
), starting a scrub degrades application random 4K reads to 20-60 (so 4-10x lower random read speed).The older ZFS 0.7.x relase had a zfs_scrub_delay which can be used to limit how much scrub "conflicts" with other read/write operation, but this parameter is gone with the new 0.8.x relase. The rationale is that management of the different IO classes should be done excluvely via ZIO scheduler tuning, adjusting the relative weight via *_max_active tunables, but I can't see any meaningful difference even when setting zfs_vdev_scrub_max_active=1 and zfs_vdev_sync_read_max_active=1000.
I think the problem is due to the new scrub code batching read in very large block sizes, leading to long depth on the scrub queue and, finally, on the vdev queue. Indeed the new scrub code is very fast (reading at 400-500 MB/s on that test array), but this leads to poor random IOPs delivered to (test) application.
While faster scrub is great, we need a method to limit its impact on production pool (even if this means a longer scrub time).
Describe how to reproduce the problem
fio --name=test --filename=/tank/test.img --rw=randread --size=32G
and look at the current IOPszpool scrub tank
fio
NOTE: using a 128k random read (matching the dataset recordsize) will not change IOPs number (except that the raw throughput value is higher).
The text was updated successfully, but these errors were encountered: