Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce latency effects of non-interactive I/O. #11166

Merged
merged 1 commit into from
Nov 24, 2020

Conversation

amotin
Copy link
Member

@amotin amotin commented Nov 6, 2020

Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds! Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:

  • by using _min_active queue depths for non-interactive requests while
    the interactive request(s) are active and few requests after;
  • by throttling it further if no interactive requests has completed
    while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities. It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool. On fragmented pool I
also saw improvements, but not so dramatic. Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90

I've also measured scrub time during the test and on idle pools. On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before. On idle non-fragmented pool
I've measured no difference. On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin requested a review from ahrens November 6, 2020 04:13
@amotin amotin self-assigned this Nov 6, 2020
@amotin amotin added Status: Code Review Needed Ready for review and testing Status: Design Review Needed Architecture or design is under discussion and removed Status: Design Review Needed Architecture or design is under discussion labels Nov 6, 2020
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've definitely seen recent reports of this kind of behavior from some HDDs and I agree the existing tunables aren't sufficient to handle it. Even setting the minimum value to 1 one isn't always enough. This provides a pretty nice mechanism to address the issue. It might also be useful as a way to tune how much impact a scrub, resilver, removal, etc has on the running workload.

module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
include/sys/zio_priority.h Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
man/man5/zfs-module-parameters.5 Show resolved Hide resolved
@IvanVolosyuk
Copy link
Contributor

Nice change. I wonder if something similar can happens within interactive read requests. Can long sequential reads (large file copy) starve small random reads (interactive applications)?

@amotin
Copy link
Member Author

amotin commented Nov 7, 2020

@IvanVolosyuk In theory I think it is possible, but I was unable to reproduce it when tried. It has to be perfectly sequential data without holes, the reader should not distract on indirect blocks and other metadata and it should never lag, and even then it is up to the drive, since you may see not all of random requests got the maximum time.

@amotin
Copy link
Member Author

amotin commented Nov 7, 2020

I've updated the patch following most of @behlendorf comments.

@amotin amotin force-pushed the nia_throttle branch 3 times, most recently from 2d126a9 to e169cc8 Compare November 7, 2020 20:18
@behlendorf
Copy link
Contributor

@amotin for some reason it seems the CI didn't test your latest chances. Would you mind rebasing this and updating the PR when you get a chance to kick off a new test run.

@amotin
Copy link
Member Author

amotin commented Nov 10, 2020

@behlendorf It seems there is nothing new to rebase to, but I've just force-pushed it once more.

@h1z1
Copy link

h1z1 commented Nov 10, 2020

Curious about this, what kind of IO were you submitting and was readahead disabled? I believe scrub IO is async no? Those drives have quite large caches.

@amotin
Copy link
Member Author

amotin commented Nov 10, 2020

@h1z1 As I have written, payload was 4KB random read, scrub was sequential. Disk settings were at defaults -- both read-ahead and write cache enabled. Not sure what you mean mentioning async and cache.

@h1z1
Copy link

h1z1 commented Nov 10, 2020

@h1z1 As I have written, payload was 4KB random read, scrub was sequential. Disk settings were at defaults

What about OS?

both read-ahead and write cache enabled.

Good to know, I didn't see that above? Was readahead also disabled otherwise your reads will be amplified no?

Not sure what you mean mentioning async and cache.

They are async as in they are not expected to return with any set deadline. You can end up in cases where the drive has the data you're about to request in cache but evicts it before your submission. Or it could optimize the operations. I'd expect drives to starve random IO if they're getting blasted with sequential especially it doesn't overlap.

@amotin
Copy link
Member Author

amotin commented Nov 10, 2020

What about OS?

FreeBSD.

both read-ahead and write cache enabled.

Good to know, I didn't see that above? Was readahead also disabled otherwise your reads will be amplified no?

As I have told, everything is at disk defaults. FreeBSD does not change them unless asked. Readahead is enabled by default.

I'd expect drives to starve random IO if they're getting blasted with sequential especially it doesn't overlap.

I am not sure queue depth of 1 counts as blasted. I clearly understand why the drive would behave the way it does, it is just very far from fail scheduling.

man/man5/zfs-module-parameters.5 Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Show resolved Hide resolved
@amotin
Copy link
Member Author

amotin commented Nov 10, 2020

@ahrens Before me scrub was always using queue depth of 2. I've made it jump between 1 and 3 (which is more safe now). So despite the min value reduced, I don't think it is a step down. I would increase max even higher, unless on FreeBSD due to MAXPHYS of 128KB and ZFS I/O aggregation to 1MB it would not stuff SATA queue completely full already at 4.

@h1z1
Copy link

h1z1 commented Nov 11, 2020

What about OS?

FreeBSD.

o_O And what about Linux?

both read-ahead and write cache enabled.

Good to know, I didn't see that above? Was readahead also disabled otherwise your reads will be amplified no?

As I have told, everything is at disk defaults. FreeBSD does not change them unless asked. Readahead is enabled by default.

You haven't told me anything until I asked and that doesn't explain what the defaults are.

I'd expect drives to starve random IO if they're getting blasted with sequential especially it doesn't overlap.

I am not sure queue depth of 1 counts as blasted. I clearly understand why the drive would behave the way it does, it is just very far from fail scheduling.

I don't think you do tbh given you're completely ignoring the amplification as a cause nor said what IOPs you're testing with. And if the IO is async does FreeBSD still limit the queue to 1 -outstanding- request?

@amotin
Copy link
Member Author

amotin commented Nov 11, 2020

o_O And what about Linux?

And what about Linux? Is it special?

You haven't told me anything until I asked and that doesn't explain what the defaults are.

"Disk settings were at defaults -- both read-ahead and write cache enabled." Could I be more specific anyhow?

I don't think you do tbh given you're completely ignoring the amplification as a cause nor said what IOPs you're testing with.

Would you be more specific in what ignorance exactly are you accusing me? Which of amplifications are you talking about? "I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool."

And if the IO is async does FreeBSD still limit the queue to 1 -outstanding- request?

No. FreeBSD by default does not do any I/O scheduling, passing everything to the drive as much as it fits.

@amotin amotin force-pushed the nia_throttle branch 2 times, most recently from 87bf8b5 to c3a5468 Compare November 11, 2020 18:20
@h1z1
Copy link

h1z1 commented Nov 11, 2020

o_O And what about Linux?

And what about Linux? Is it special?

I don't know, what's so special about FBSD? Did you test it at all on Linux?

You haven't told me anything until I asked and that doesn't explain what the defaults are.

"Disk settings were at defaults -- both read-ahead and write cache enabled." Could I be more specific anyhow?

Agreed, you did. After I asked.

I don't think you do tbh given you're completely ignoring the amplification as a cause nor said what IOPs you're testing with.

Would you be more specific in what ignorance exactly are you accusing me? Which of amplifications are you talking about? "I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool."

.. and you're still not saying what your IOPs were nor even what you used to test with. fio? dd? Some custom util? You're using a 16k blocksize against the zvol or device itself??

And if the IO is async does FreeBSD still limit the queue to 1 -outstanding- request?

No. FreeBSD by default does not do any I/O scheduling, passing everything to the drive as much as it fits.

.. to be clear, you're telling me FreeBSD has no IO scheduler?

@amotin
Copy link
Member Author

amotin commented Nov 11, 2020

I don't know, what's so special about FBSD? Did you test it at all on Linux?

No, I didn't. Are you arguing just to argue, or you have some reason to think it is different on Linux? If you want to help with testing -- please, be my guest.

.. and you're still not saying what your IOPs were nor even what you used to test with. fio? dd? Some custom util? You're using a 16k blocksize against the zvol or device itself??

Who cares what the tool was if it was simple QD1 4KB random read? If you are curios, it was fio with such a config:

[global]
size=100G
runtime=500
ioengine=psync
iomem_align=2m
blocksize=4k
rw=randread
iodepth=1
zero_buffers
group_reporting

[job]
filename=/dev/zvol/small/vol1
write_hist_log=lsmall
log_hist_msec=60000
log_hist_coarseness=6

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

16KB is a volblocksize. The disk is 512e, and pool has ashift=12.

.. to be clear, you're telling me FreeBSD has no IO scheduler?

There are some specialized, but by default there is nothing other then FIFO (for SSDs) or elevator (for HDDs) for requests that don't fit the device cache. We already have one scheduler in disk, and another inside ZFS. How many more schedulers do we need?

module/zfs/vdev_queue.c Show resolved Hide resolved
@h1z1
Copy link

h1z1 commented Nov 17, 2020

Who cares what the tool was if it was simple QD1 4KB random read? If you are curios, it was fio with such a config:

It matters because you've still not said how hard you were "benchmarking" -- IOPS

ioengine=psync
iomem_align=2m
blocksize=4k
rw=randread
iodepth=1

[job]
filename=/dev/zvol/small/vol1

If I understand that correct, a 4k sync read @ QD1, basically worst case for a platter disk.

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

Reported by zpool iostat or the drive directly? You're referring to one disk but you state that is a 4xMirror above.

16KB is a volblocksize. The disk is 512e, and pool has ashift=12.

Which is 32 LBA reads at minimum on the platter assuming it's sequential. Your fio test is against a zvol - 4k read balloons to 16k minimum,

.. to be clear, you're telling me FreeBSD has no IO scheduler?

There are some specialized, but by default there is nothing other then FIFO (for SSDs) or elevator (for HDDs) for requests that don't fit the device cache. We already have one scheduler in disk, and another inside ZFS. How many more schedulers do we need?

That would depend on where and what you're aggregating. You stated there was no scheduler at all thus the question.

The behavior you see is not common to all environments. The point of this was to understand what kind of IO you were testing with. The only way I've been able to duplicate this - in Linux - is by intentionally hitting the drive(s) with sub-optimal IO patterns.

@amotin
Copy link
Member Author

amotin commented Nov 17, 2020

If I understand that correct, a 4k sync read @ QD1, basically worst case for a platter disk.

All this PR is about worst cases. It is easy to be fast when you have deep queue of independent requests, but life is unfair. My goal was to measure and reduce worst case latency.

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

Reported by zpool iostat or the drive directly? You're referring to one disk but you state that is a 4xMirror above.

Reported by fio.

16KB is a volblocksize. The disk is 512e, and pool has ashift=12.

Which is 32 LBA reads at minimum on the platter assuming it's sequential. Your fio test is against a zvol - 4k read balloons to 16k minimum,

Yes. And? It is HDD. Read latency for 4KB and 16KB is practically the same -- seek time.

The only way I've been able to duplicate this - in Linux - is by intentionally hitting the drive(s) with sub-optimal IO patterns.

So you've been able to duplicate it. I am happy. Anything more to prove?

@h1z1
Copy link

h1z1 commented Nov 18, 2020

If I understand that correct, a 4k sync read @ QD1, basically worst case for a platter disk.

All this PR is about worst cases. It is easy to be fast when you have deep queue of independent requests, but life is unfair. My goal was to measure and reduce worst case latency.

But you're assuming a deep queue is always bad, it isn't. Data locality for example.

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

Reported by zpool iostat or the drive directly? You're referring to one disk but you state that is a 4xMirror above.

Reported by fio.

..... and the number of IOP/s was???

16KB is a volblocksize. The disk is 512e, and pool has ashift=12.

Which is 32 LBA reads at minimum on the platter assuming it's sequential. Your fio test is against a zvol - 4k read balloons to 16k minimum,

Yes. And? It is HDD. Read latency for 4KB and 16KB is practically the same -- seek time.

Maybe if it were sequential and you're not testing a single IOP.

The only way I've been able to duplicate this - in Linux - is by intentionally hitting the drive(s) with sub-optimal IO patterns.

So you've been able to duplicate it. I am happy. Anything more to prove?

That it wasn't back pressure congestion either from the controller or some process between..

@amotin
Copy link
Member Author

amotin commented Nov 18, 2020

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

Reported by zpool iostat or the drive directly? You're referring to one disk but you state that is a 4xMirror above.

Reported by fio.

..... and the number of IOP/s was???

Are you kidding me? Reread my quote 3 lines higher. What else do you want? If you mean whether fio had target IOPS rate -- not it didn't, as you can see in config I've quoted. The numbers above is what was reached without any delays.

That it wasn't ...

@h1z1 , I'm sorry, but I am tired of your assumptions that I am an idiot. I am not. Either show me why I am wrong, or write something better, or if just go troll somewhere else.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 18, 2020
@h1z1
Copy link

h1z1 commented Nov 19, 2020

If you need average test IOPS during scrub, then for SEAGATE ST14000NM0048 it increased from 22 to 73, without scrub it was 166.

Reported by zpool iostat or the drive directly? You're referring to one disk but you state that is a 4xMirror above.

Reported by fio.

..... and the number of IOP/s was???

Are you kidding me? Reread my quote 3 lines higher. What else do you want? If you mean whether fio had target IOPS rate -- not it didn't, as you can see in config I've quoted. The numbers above is what was reached without any delays.

Yes, what I see are numbers .. without context. I'm not talking about target IOPS, I'm literally asking what - how many - IO's per second you got from fio. Seriously.

That it wasn't ...

@h1z1 , I'm sorry, but I am tired of your assumptions that I am an idiot. I am not. Either show me why I am wrong, or write something better, or if just go troll somewhere else.

You posted numbers that didn't make sense, I was looking for clarification. You're taking offense to that is bizzare for such a change.

And I'm tired of dealing with you. Cheers.

@behlendorf
Copy link
Contributor

behlendorf commented Nov 20, 2020

Some additional testing results from a different pool configuration. I used a draid2:8d:34c:2s pool config, or in other words a dRAID pool configured with double parity and 8 data disks per RAIDZ stripe constructed from 34 HDDs. Testing was done using fio against a zvol configured with a 32k block size due to the large stripe width (8+2).

Each test was run for 600 seconds either with "no scrub" running in the background to get a baseline, or with a scrub running and a specific zfs_vdev_scrub_min_active value from 1 (the default) up to 3 (the zfs_vdev_scrub_max_active value). The goal being to roughly characterize the effect of the scrub on the random fio workload. For all of the tests run each HDDs was close to its maximum sustained IOPS, so when fio performance decreased the scrub performance increased commensurately.

image

  • With the master source the zfs_vdev_scrub_min_active option has no real effect, as expected it effectively always operated at the maximum zfs_vdev_scrub_max_active=3 queue depth.
  • With the patched source fio was able use a larger share of the available pool IOPS. Increasing the minimum scrub queue depth allowed the scrub to consume a larger fraction (default: zfs_vdev_scrub_max_active=1).

image

  • Average latency with the patch was 1/2 to 1/3 of the master branch.

image

  • The worst case latency was reduced by 10x from about 4 seconds to consistently under half a second.

@amotin
Copy link
Member Author

amotin commented Nov 20, 2020

Thank you @behlendorf for adding colors into my boring numbers and reproducing the results. 10x latency reduction is cool! :)

Comment on lines +152 to +153
uint32_t vq_ia_active; /* Active interactive I/Os. */
uint32_t vq_nia_credit; /* Non-interactive I/Os credit. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminology is a little hard to parse here, e.g. "active interactive" has a lot of "active" :-) What would you think about calling these "foreground"/"background" or "user"/"system"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but if possible I'd leave the tunables/terminology as-is at this point. This being in review for two weeks and the tunable names are already used in our software, changing which at this point of release cycle would be a pain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this particular comment ends up being a little awkward, I personally like describing these an interactive IOs. So I'm fine with the existing terminology.

man/man5/zfs-module-parameters.5 Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
Comment on lines +379 to +383
return (MIN(vq->vq_nia_credit,
zfs_vdev_scrub_min_active));
} else if (vq->vq_nia_credit < zfs_vdev_nia_delay)
return (zfs_vdev_scrub_min_active);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about what vq_nia_credit is, conceptually. Is it 2 different ideas depending on whether vq_ia_active == 0?

Copy link
Member Author

@amotin amotin Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. When vq_ia_active == 0, it counts non-interactive requests since last completed interactive one to decide when to allow max_active (zfs_vdev_nia_delay). When vq_ia_active > 0, it counts credit for non-interactive requests since last completed interactive.

Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
@behlendorf behlendorf merged commit 6f5aac3 into openzfs:master Nov 24, 2020
@amotin amotin deleted the nia_throttle branch November 24, 2020 17:47
behlendorf pushed a commit that referenced this pull request Nov 25, 2020
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11166
@shodanshok
Copy link
Contributor

shodanshok commented Nov 26, 2020

This is a much needed patch, thanks. For reference, I tracked down the high-latency cause to NCQ for some SATA HDDs at least.

@amotin
Copy link
Member Author

amotin commented Nov 26, 2020

For reference, I tracked down the high-latency cause to NCQ for some SATA HDDs at least.

Disabling of NCQ moves all responsibility for the starvation avoidance from disk's to OS I/O scheduler, that may or may not fix the problem. But it is generally bad for bulk performance, since no OS I/O scheduler can be as efficient as disk's one, at best it may know more about I/O priorities, etc, like this ZFS' one knows.

@shodanshok
Copy link
Contributor

Sure, I only wanted to point that bad NCQ implementation can starve to death random I/O when small sequential ones are issued. I saw quite a few cases of that specific issue since scrub was refactored to issue more sequential reads, so your work is really appreciated!

ghost pushed a commit to zfsonfreebsd/ZoF that referenced this pull request Dec 1, 2020
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11166
ghost pushed a commit to zfsonfreebsd/ZoF that referenced this pull request Dec 1, 2020
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11166
ghost pushed a commit to zfsonfreebsd/ZoF that referenced this pull request Dec 23, 2020
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11166
behlendorf pushed a commit that referenced this pull request Dec 23, 2020
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11166
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11166
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants