Reduce latency effects of non-interactive I/O. #11166

amotin · 2020-11-06T04:13:46Z

Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds! Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:

by using _min_active queue depths for non-interactive requests while
the interactive request(s) are active and few requests after;
by throttling it further if no interactive requests has completed
while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities. It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool. On fragmented pool I
also saw improvements, but not so dramatic. Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90

I've also measured scrub time during the test and on idle pools. On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before. On idle non-fragmented pool
I've measured no difference. On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

behlendorf

I've definitely seen recent reports of this kind of behavior from some HDDs and I agree the existing tunables aren't sufficient to handle it. Even setting the minimum value to 1 one isn't always enough. This provides a pretty nice mechanism to address the issue. It might also be useful as a way to tune how much impact a scrub, resilver, removal, etc has on the running workload.

module/zfs/vdev_queue.c

include/sys/zio_priority.h

module/zfs/vdev_queue.c

man/man5/zfs-module-parameters.5

IvanVolosyuk · 2020-11-07T11:32:08Z

Nice change. I wonder if something similar can happens within interactive read requests. Can long sequential reads (large file copy) starve small random reads (interactive applications)?

amotin · 2020-11-07T13:05:08Z

@IvanVolosyuk In theory I think it is possible, but I was unable to reproduce it when tried. It has to be perfectly sequential data without holes, the reader should not distract on indirect blocks and other metadata and it should never lag, and even then it is up to the drive, since you may see not all of random requests got the maximum time.

amotin · 2020-11-07T14:56:59Z

I've updated the patch following most of @behlendorf comments.

behlendorf · 2020-11-10T03:06:58Z

@amotin for some reason it seems the CI didn't test your latest chances. Would you mind rebasing this and updating the PR when you get a chance to kick off a new test run.

amotin · 2020-11-10T03:20:14Z

@behlendorf It seems there is nothing new to rebase to, but I've just force-pushed it once more.

h1z1 · 2020-11-10T03:54:36Z

Curious about this, what kind of IO were you submitting and was readahead disabled? I believe scrub IO is async no? Those drives have quite large caches.

amotin · 2020-11-10T03:58:35Z

@h1z1 As I have written, payload was 4KB random read, scrub was sequential. Disk settings were at defaults -- both read-ahead and write cache enabled. Not sure what you mean mentioning async and cache.

h1z1 · 2020-11-10T04:20:41Z

@h1z1 As I have written, payload was 4KB random read, scrub was sequential. Disk settings were at defaults

What about OS?

both read-ahead and write cache enabled.

Good to know, I didn't see that above? Was readahead also disabled otherwise your reads will be amplified no?

Not sure what you mean mentioning async and cache.

They are async as in they are not expected to return with any set deadline. You can end up in cases where the drive has the data you're about to request in cache but evicts it before your submission. Or it could optimize the operations. I'd expect drives to starve random IO if they're getting blasted with sequential especially it doesn't overlap.

amotin · 2020-11-10T14:23:30Z

What about OS?

FreeBSD.

both read-ahead and write cache enabled.

Good to know, I didn't see that above? Was readahead also disabled otherwise your reads will be amplified no?

As I have told, everything is at disk defaults. FreeBSD does not change them unless asked. Readahead is enabled by default.

I'd expect drives to starve random IO if they're getting blasted with sequential especially it doesn't overlap.

I am not sure queue depth of 1 counts as blasted. I clearly understand why the drive would behave the way it does, it is just very far from fail scheduling.

man/man5/zfs-module-parameters.5

module/zfs/vdev_queue.c

amotin · 2020-11-10T21:45:25Z

@ahrens Before me scrub was always using queue depth of 2. I've made it jump between 1 and 3 (which is more safe now). So despite the min value reduced, I don't think it is a step down. I would increase max even higher, unless on FreeBSD due to MAXPHYS of 128KB and ZFS I/O aggregation to 1MB it would not stuff SATA queue completely full already at 4.

h1z1 · 2020-11-11T06:10:28Z

What about OS?

FreeBSD.

o_O And what about Linux?

both read-ahead and write cache enabled.

Good to know, I didn't see that above? Was readahead also disabled otherwise your reads will be amplified no?

As I have told, everything is at disk defaults. FreeBSD does not change them unless asked. Readahead is enabled by default.

You haven't told me anything until I asked and that doesn't explain what the defaults are.

I'd expect drives to starve random IO if they're getting blasted with sequential especially it doesn't overlap.

I am not sure queue depth of 1 counts as blasted. I clearly understand why the drive would behave the way it does, it is just very far from fail scheduling.

I don't think you do tbh given you're completely ignoring the amplification as a cause nor said what IOPs you're testing with. And if the IO is async does FreeBSD still limit the queue to 1 -outstanding- request?

amotin · 2020-11-11T14:53:16Z

o_O And what about Linux?

And what about Linux? Is it special?

You haven't told me anything until I asked and that doesn't explain what the defaults are.

"Disk settings were at defaults -- both read-ahead and write cache enabled." Could I be more specific anyhow?

I don't think you do tbh given you're completely ignoring the amplification as a cause nor said what IOPs you're testing with.

Would you be more specific in what ignorance exactly are you accusing me? Which of amplifications are you talking about? "I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool."

And if the IO is async does FreeBSD still limit the queue to 1 -outstanding- request?

No. FreeBSD by default does not do any I/O scheduling, passing everything to the drive as much as it fits.

h1z1 · 2020-11-11T22:10:32Z

o_O And what about Linux?

And what about Linux? Is it special?

I don't know, what's so special about FBSD? Did you test it at all on Linux?

You haven't told me anything until I asked and that doesn't explain what the defaults are.

"Disk settings were at defaults -- both read-ahead and write cache enabled." Could I be more specific anyhow?

Agreed, you did. After I asked.

I don't think you do tbh given you're completely ignoring the amplification as a cause nor said what IOPs you're testing with.

Would you be more specific in what ignorance exactly are you accusing me? Which of amplifications are you talking about? "I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool."

.. and you're still not saying what your IOPs were nor even what you used to test with. fio? dd? Some custom util? You're using a 16k blocksize against the zvol or device itself??

And if the IO is async does FreeBSD still limit the queue to 1 -outstanding- request?

No. FreeBSD by default does not do any I/O scheduling, passing everything to the drive as much as it fits.

.. to be clear, you're telling me FreeBSD has no IO scheduler?

amotin · 2020-11-11T23:30:15Z

I don't know, what's so special about FBSD? Did you test it at all on Linux?

No, I didn't. Are you arguing just to argue, or you have some reason to think it is different on Linux? If you want to help with testing -- please, be my guest.

.. and you're still not saying what your IOPs were nor even what you used to test with. fio? dd? Some custom util? You're using a 16k blocksize against the zvol or device itself??

Who cares what the tool was if it was simple QD1 4KB random read? If you are curios, it was fio with such a config:

[global]
size=100G
runtime=500
ioengine=psync
iomem_align=2m
blocksize=4k
rw=randread
iodepth=1
zero_buffers
group_reporting

[job]
filename=/dev/zvol/small/vol1
write_hist_log=lsmall
log_hist_msec=60000
log_hist_coarseness=6

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

16KB is a volblocksize. The disk is 512e, and pool has ashift=12.

.. to be clear, you're telling me FreeBSD has no IO scheduler?

There are some specialized, but by default there is nothing other then FIFO (for SSDs) or elevator (for HDDs) for requests that don't fit the device cache. We already have one scheduler in disk, and another inside ZFS. How many more schedulers do we need?

module/zfs/vdev_queue.c

h1z1 · 2020-11-17T21:28:47Z

Who cares what the tool was if it was simple QD1 4KB random read? If you are curios, it was fio with such a config:

It matters because you've still not said how hard you were "benchmarking" -- IOPS

ioengine=psync
iomem_align=2m
blocksize=4k
rw=randread
iodepth=1

[job]
filename=/dev/zvol/small/vol1

If I understand that correct, a 4k sync read @ QD1, basically worst case for a platter disk.

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

Reported by zpool iostat or the drive directly? You're referring to one disk but you state that is a 4xMirror above.

16KB is a volblocksize. The disk is 512e, and pool has ashift=12.

Which is 32 LBA reads at minimum on the platter assuming it's sequential. Your fio test is against a zvol - 4k read balloons to 16k minimum,

.. to be clear, you're telling me FreeBSD has no IO scheduler?

There are some specialized, but by default there is nothing other then FIFO (for SSDs) or elevator (for HDDs) for requests that don't fit the device cache. We already have one scheduler in disk, and another inside ZFS. How many more schedulers do we need?

That would depend on where and what you're aggregating. You stated there was no scheduler at all thus the question.

The behavior you see is not common to all environments. The point of this was to understand what kind of IO you were testing with. The only way I've been able to duplicate this - in Linux - is by intentionally hitting the drive(s) with sub-optimal IO patterns.

amotin · 2020-11-17T21:54:56Z

If I understand that correct, a 4k sync read @ QD1, basically worst case for a platter disk.

All this PR is about worst cases. It is easy to be fast when you have deep queue of independent requests, but life is unfair. My goal was to measure and reduce worst case latency.

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

Reported by zpool iostat or the drive directly? You're referring to one disk but you state that is a 4xMirror above.

Reported by fio.

16KB is a volblocksize. The disk is 512e, and pool has ashift=12.

Which is 32 LBA reads at minimum on the platter assuming it's sequential. Your fio test is against a zvol - 4k read balloons to 16k minimum,

Yes. And? It is HDD. Read latency for 4KB and 16KB is practically the same -- seek time.

The only way I've been able to duplicate this - in Linux - is by intentionally hitting the drive(s) with sub-optimal IO patterns.

So you've been able to duplicate it. I am happy. Anything more to prove?

h1z1 · 2020-11-18T00:37:15Z

If I understand that correct, a 4k sync read @ QD1, basically worst case for a platter disk.

All this PR is about worst cases. It is easy to be fast when you have deep queue of independent requests, but life is unfair. My goal was to measure and reduce worst case latency.

But you're assuming a deep queue is always bad, it isn't. Data locality for example.

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

Reported by zpool iostat or the drive directly? You're referring to one disk but you state that is a 4xMirror above.

Reported by fio.

..... and the number of IOP/s was???

16KB is a volblocksize. The disk is 512e, and pool has ashift=12.

Which is 32 LBA reads at minimum on the platter assuming it's sequential. Your fio test is against a zvol - 4k read balloons to 16k minimum,

Yes. And? It is HDD. Read latency for 4KB and 16KB is practically the same -- seek time.

Maybe if it were sequential and you're not testing a single IOP.

The only way I've been able to duplicate this - in Linux - is by intentionally hitting the drive(s) with sub-optimal IO patterns.

So you've been able to duplicate it. I am happy. Anything more to prove?

That it wasn't back pressure congestion either from the controller or some process between..

amotin · 2020-11-18T05:02:04Z

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

Reported by zpool iostat or the drive directly? You're referring to one disk but you state that is a 4xMirror above.

Reported by fio.

..... and the number of IOP/s was???

Are you kidding me? Reread my quote 3 lines higher. What else do you want? If you mean whether fio had target IOPS rate -- not it didn't, as you can see in config I've quoted. The numbers above is what was reached without any delays.

That it wasn't ...

@h1z1 , I'm sorry, but I am tired of your assumptions that I am an idiot. I am not. Either show me why I am wrong, or write something better, or if just go troll somewhere else.

h1z1 · 2020-11-19T09:14:30Z

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

Reported by zpool iostat or the drive directly? You're referring to one disk but you state that is a 4xMirror above.

Reported by fio.

..... and the number of IOP/s was???

Are you kidding me? Reread my quote 3 lines higher. What else do you want? If you mean whether fio had target IOPS rate -- not it didn't, as you can see in config I've quoted. The numbers above is what was reached without any delays.

Yes, what I see are numbers .. without context. I'm not talking about target IOPS, I'm literally asking what - how many - IO's per second you got from fio. Seriously.

That it wasn't ...

@h1z1 , I'm sorry, but I am tired of your assumptions that I am an idiot. I am not. Either show me why I am wrong, or write something better, or if just go troll somewhere else.

You posted numbers that didn't make sense, I was looking for clarification. You're taking offense to that is bizzare for such a change.

And I'm tired of dealing with you. Cheers.

behlendorf · 2020-11-20T01:31:56Z

Some additional testing results from a different pool configuration. I used a draid2:8d:34c:2s pool config, or in other words a dRAID pool configured with double parity and 8 data disks per RAIDZ stripe constructed from 34 HDDs. Testing was done using fio against a zvol configured with a 32k block size due to the large stripe width (8+2).

Each test was run for 600 seconds either with "no scrub" running in the background to get a baseline, or with a scrub running and a specific zfs_vdev_scrub_min_active value from 1 (the default) up to 3 (the zfs_vdev_scrub_max_active value). The goal being to roughly characterize the effect of the scrub on the random fio workload. For all of the tests run each HDDs was close to its maximum sustained IOPS, so when fio performance decreased the scrub performance increased commensurately.

With the master source the zfs_vdev_scrub_min_active option has no real effect, as expected it effectively always operated at the maximum zfs_vdev_scrub_max_active=3 queue depth.
With the patched source fio was able use a larger share of the available pool IOPS. Increasing the minimum scrub queue depth allowed the scrub to consume a larger fraction (default: zfs_vdev_scrub_max_active=1).

Average latency with the patch was 1/2 to 1/3 of the master branch.

The worst case latency was reduced by 10x from about 4 seconds to consistently under half a second.

amotin · 2020-11-20T04:14:12Z

Thank you @behlendorf for adding colors into my boring numbers and reproducing the results. 10x latency reduction is cool! :)

ahrens · 2020-11-20T03:46:39Z

include/sys/vdev_impl.h

+	uint32_t	vq_ia_active;	/* Active interactive I/Os. */
+	uint32_t	vq_nia_credit;	/* Non-interactive I/Os credit. */


The terminology is a little hard to parse here, e.g. "active interactive" has a lot of "active" :-) What would you think about calling these "foreground"/"background" or "user"/"system"?

I'm sorry, but if possible I'd leave the tunables/terminology as-is at this point. This being in review for two weeks and the tunable names are already used in our software, changing which at this point of release cycle would be a pain.

While this particular comment ends up being a little awkward, I personally like describing these an interactive IOs. So I'm fine with the existing terminology.

man/man5/zfs-module-parameters.5

module/zfs/vdev_queue.c

ahrens · 2020-11-20T04:17:57Z

module/zfs/vdev_queue.c

+			return (MIN(vq->vq_nia_credit,
+			    zfs_vdev_scrub_min_active));
+		} else if (vq->vq_nia_credit < zfs_vdev_nia_delay)
+			return (zfs_vdev_scrub_min_active);


I'm a little confused about what vq_nia_credit is, conceptually. Is it 2 different ideas depending on whether vq_ia_active == 0?

Yes. When vq_ia_active == 0, it counts non-interactive requests since last completed interactive one to decide when to allow max_active (zfs_vdev_nia_delay). When vq_ia_active > 0, it counts credit for non-interactive requests since last completed interactive.

Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.

Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11166

shodanshok · 2020-11-26T16:56:58Z

This is a much needed patch, thanks. For reference, I tracked down the high-latency cause to NCQ for some SATA HDDs at least.

amotin · 2020-11-26T17:16:02Z

For reference, I tracked down the high-latency cause to NCQ for some SATA HDDs at least.

Disabling of NCQ moves all responsibility for the starvation avoidance from disk's to OS I/O scheduler, that may or may not fix the problem. But it is generally bad for bulk performance, since no OS I/O scheduler can be as efficient as disk's one, at best it may know more about I/O priorities, etc, like this ZFS' one knows.

shodanshok · 2020-11-26T18:01:13Z

Sure, I only wanted to point that bad NCQ implementation can starve to death random I/O when small sequential ones are issued. I saw quite a few cases of that specific issue since scrub was refactored to issue more sequential reads, so your work is really appreciated!

Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes openzfs#11166

Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11166

Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes openzfs#11166

amotin requested a review from ahrens November 6, 2020 04:13

amotin self-assigned this Nov 6, 2020

amotin force-pushed the nia_throttle branch from 21d08be to 2cddbdb Compare November 6, 2020 04:22

amotin added Status: Code Review Needed Ready for review and testing Status: Design Review Needed Architecture or design is under discussion and removed Status: Design Review Needed Architecture or design is under discussion labels Nov 6, 2020

behlendorf reviewed Nov 6, 2020

View reviewed changes

amotin force-pushed the nia_throttle branch from 2cddbdb to 3af15e4 Compare November 7, 2020 14:51

amotin force-pushed the nia_throttle branch 3 times, most recently from 2d126a9 to e169cc8 Compare November 7, 2020 20:18

amotin force-pushed the nia_throttle branch from e169cc8 to bf8d476 Compare November 10, 2020 03:18

ahrens reviewed Nov 10, 2020

View reviewed changes

man/man5/zfs-module-parameters.5 Outdated Show resolved Hide resolved

module/zfs/vdev_queue.c Outdated Show resolved Hide resolved

module/zfs/vdev_queue.c Show resolved Hide resolved

amotin force-pushed the nia_throttle branch 2 times, most recently from 87bf8b5 to c3a5468 Compare November 11, 2020 18:20

ghost reviewed Nov 12, 2020

View reviewed changes

module/zfs/vdev_queue.c Show resolved Hide resolved

ghost approved these changes Nov 12, 2020

View reviewed changes

module/zfs/vdev_queue.c Show resolved Hide resolved

behlendorf approved these changes Nov 17, 2020

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 18, 2020

ahrens reviewed Nov 20, 2020

View reviewed changes

amotin force-pushed the nia_throttle branch from c3a5468 to 4923559 Compare November 22, 2020 14:12

behlendorf merged commit 6f5aac3 into openzfs:master Nov 24, 2020

amotin deleted the nia_throttle branch November 24, 2020 17:47

amotin mentioned this pull request Mar 21, 2021

ZFS resilver can be very slow if there are other heavy disk IO requests, can the resilver priority be adjusted? #11777

Closed

amotin mentioned this pull request May 1, 2021

Scale worker threads and taskqs with number of CPUs. #11966

Merged

13 tasks

behlendorf mentioned this pull request Oct 6, 2021

Scrub heavily impacting application IO performance #10253

Closed

amotin mentioned this pull request Sep 14, 2023

OpenZFS for Linux interaction problem with NCQ - potential data loss #15270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce latency effects of non-interactive I/O. #11166

Reduce latency effects of non-interactive I/O. #11166

amotin commented Nov 6, 2020

behlendorf left a comment

IvanVolosyuk commented Nov 7, 2020

amotin commented Nov 7, 2020 •

edited

Loading

amotin commented Nov 7, 2020

behlendorf commented Nov 10, 2020

amotin commented Nov 10, 2020

h1z1 commented Nov 10, 2020

amotin commented Nov 10, 2020

h1z1 commented Nov 10, 2020 •

edited

Loading

amotin commented Nov 10, 2020

amotin commented Nov 10, 2020 •

edited

Loading

h1z1 commented Nov 11, 2020

amotin commented Nov 11, 2020

h1z1 commented Nov 11, 2020

amotin commented Nov 11, 2020 •

edited

Loading

h1z1 commented Nov 17, 2020

amotin commented Nov 17, 2020

h1z1 commented Nov 18, 2020

amotin commented Nov 18, 2020 •

edited

Loading

h1z1 commented Nov 19, 2020

behlendorf commented Nov 20, 2020 •

edited

Loading

amotin commented Nov 20, 2020

ahrens Nov 20, 2020

amotin Nov 22, 2020

behlendorf Nov 22, 2020

ahrens Nov 20, 2020

amotin Nov 20, 2020 •

edited

Loading

shodanshok commented Nov 26, 2020 •

edited

Loading

amotin commented Nov 26, 2020 •

edited

Loading

shodanshok commented Nov 26, 2020

		uint32_t vq_ia_active; /* Active interactive I/Os. */
		uint32_t vq_nia_credit; /* Non-interactive I/Os credit. */

Reduce latency effects of non-interactive I/O. #11166

Reduce latency effects of non-interactive I/O. #11166

Conversation

amotin commented Nov 6, 2020

Types of changes

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

IvanVolosyuk commented Nov 7, 2020

amotin commented Nov 7, 2020 • edited Loading

amotin commented Nov 7, 2020

behlendorf commented Nov 10, 2020

amotin commented Nov 10, 2020

h1z1 commented Nov 10, 2020

amotin commented Nov 10, 2020

h1z1 commented Nov 10, 2020 • edited Loading

amotin commented Nov 10, 2020

amotin commented Nov 10, 2020 • edited Loading

h1z1 commented Nov 11, 2020

amotin commented Nov 11, 2020

h1z1 commented Nov 11, 2020

amotin commented Nov 11, 2020 • edited Loading

h1z1 commented Nov 17, 2020

amotin commented Nov 17, 2020

h1z1 commented Nov 18, 2020

amotin commented Nov 18, 2020 • edited Loading

h1z1 commented Nov 19, 2020

behlendorf commented Nov 20, 2020 • edited Loading

amotin commented Nov 20, 2020

ahrens Nov 20, 2020

Choose a reason for hiding this comment

amotin Nov 22, 2020

Choose a reason for hiding this comment

behlendorf Nov 22, 2020

Choose a reason for hiding this comment

ahrens Nov 20, 2020

Choose a reason for hiding this comment

amotin Nov 20, 2020 • edited Loading

Choose a reason for hiding this comment

shodanshok commented Nov 26, 2020 • edited Loading

amotin commented Nov 26, 2020 • edited Loading

shodanshok commented Nov 26, 2020

amotin commented Nov 7, 2020 •

edited

Loading

h1z1 commented Nov 10, 2020 •

edited

Loading

amotin commented Nov 10, 2020 •

edited

Loading

amotin commented Nov 11, 2020 •

edited

Loading

amotin commented Nov 18, 2020 •

edited

Loading

behlendorf commented Nov 20, 2020 •

edited

Loading

amotin Nov 20, 2020 •

edited

Loading

shodanshok commented Nov 26, 2020 •

edited

Loading

amotin commented Nov 26, 2020 •

edited

Loading