-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zvol: Support blk-mq for better performance #12664
Conversation
BenchmarksMy benchmarking script is here: https://gist.github.com/tonyhutter/01cf39c44d967e0b176a1d2eeb7f2460#file-benchmark-zvols-sh All tests were done with 16-CPU node, 8 NVMe drives, 4x 2-disk mirrors pool
32-CPU node, 60 SAS HDDs in JBOD, 6x 10-disk raidz2 pool
Some of these numbers seem unreasonably high, so I'm curious to what kind of performance numbers other people get. A proper Note: To benchmark the "old" zvol code against blk-mq, simply comment out Be sure not to build with |
would encourage using a real EPYC system with full NUMA zone configuration to test memory locality latency concerns |
One of the nice things about blk-mq is that it tries to submit the block IO to the request queue for same CPU. So if you do a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I'll make time to do some testing early next week.
tests/zfs-tests/tests/functional/zvol/zvol_stress/zvol_stress.ksh
Outdated
Show resolved
Hide resolved
I've been test-driving this branch for a couple of days.
... but that sort of stacktrace featuring ZFS was fairly common even without this branch (though the hang wasn't), so it may be a red herring / unrelated. (I suspect that it's a bug to call any variation of wait_for_completion while atomic, and perhaps outside of a lowlatency kernel the opportunity wouldn't arise.) |
@adamdmoss thanks for the additional testing. The issue you reported wasn't introduced by this change, I've gone ahead and opened PR #12696 with your stack trace to track it. |
@behlendorf thanks for taking a look, I'll make those changes. |
@tonyhutter: Thanks for tagging old threads with this - email notifications are handy. The benchmarks show considerable system CPU usage increases in the non-O_DIRECT case. When MQ was first mainlined, we saw that sort of thing as well and ended up tracking it down to scheduler thrash for where things should be handled (related IOs want to be on the same CPU/queue, but its not guaranteed to be submitted into the same queue IIRC). Is the threadpool fixed in your testing or dynamic? I may not need to do this anymore, but we boot older (like pre v7) xeons using There is also a dm-crypt optimization which came about with faster SSD/NVMe storage which de-queues crypto operations and forces serialized execution in the same context. I think it came in around 5.10, cloudflare wrote it IIRC. If a similar trick is feasible here, it might help reduce the gaps between O_DIRECT and non-direct, or possibly even help out with the system load if it is in fact being caused by queue scheduling contention Also, whats the impact of 100 RWM cycles on the benchmark - does it actually help execute other IOs while the slab allocator seeks a viable hole in the extent, or does the RWM penalty overtake the gains here? Unfortunately ZVOLs wear like race slicks :). Thanks a ton for tackling this, looking forward to faster ZVOLs. Any thoughts on whether this would play nice with the async DMU idea Matt Macy was working on? Async dispatch across multiple queues sounds offhand like it would have cumulative benefits (so long as the async parts are truly evented instead of having to poll all those queues). |
Test suite is failing. |
There shouldn't be... In the end it's just passing BIOs to the zvol layer as before. Famous last words though...
Correct, but in this case it's a good thing. It's showing that the system is utilizing more of the CPU cores to get the work done quicker. For example, in the 16-CPU benchmark it's basically burning ~18% more CPU cycles for 252% more throughput. There may be some scheduler thash as we do have lots of queueing going on. And by that I mean each
I don't know, that would interesting to see. I used parallel dd's in the benchmark since that's what the developer who originaly reported the issue used.
I'm not familiar enough with that code so say one way or another. This PR doesn't change anything lower than the BIO layer, so I'm assuming it would be fine. @DemiMarie yea for some reason it's not running the |
@tonyhutter This is a very interesting performance optimization and i would definitely give it a testing try. Since versions 0.8.6 and 2.0.6 are widely used and seems quite stable now, could you please produce backport patches for 0.8.6 and 2.0.6 ? |
2102b5f
to
8bf8dad
Compare
@samuelxhu this will either go into a future 2.1.x or 2.2.x release, depending on risk/benefit. 2.0.x is only for bugfixes and low-risk patches and 0.8.6 is no longer developed, so no plans to backport it. |
It would be nice if ZTS performance suite has zvol tests. Perhaps a hackathon project ;-) |
@tonyhutter: i believe that there is a way for viable schedulers of a block device to be restricted such as to remove the "intelligently queuing" options from the available list. AFAIK, distros tend to default to things like CFQ for generic use-cases, which stack unfavorably with the IO scheduler underneath the ZVOL. I think there's also a way to set the default scheduler, but i haven't dug around in that part of the kernel in some time. |
setting the default is for all block devices, unfortunately. ZFS used to set noop scheduler on the vdevs but this mechanism was removed from the Linux kernel in torvalds/linux@85c0a03 |
To quote Homer Simpson: "doh!" |
@behlendorf my last push has all your changes included. I need to look into some
|
2c5b268
to
b1a1fcb
Compare
I may have found an unpleasant bug.
which produces
Some of that stack won't be reproducible by others - this is a 5.15 grsec kernel (hence the RAP hash calls), but its the IO which never completes - not a failure of the call/ret site check. Since we build ZFS into the kernel, i updated the code to default to using blk-mq for ZVOLs and in order to ensure that this is in-fact the cause (happened multiple times in a row), i booted with I think we need tests where we build other block constructs atop ZVOLs to test for scheduling interactions between layered blockdevs in-kernel. |
Does ZFS do any memory allocations in the I/O path? It is possible for doing so to deadlock. |
To add injury to insult, since thin pools are fragile things to begin with, this one got toasted. |
I take it that thin_repair wasn’t able to fix the problem? If there was a transaction ID mismatch, that may be fixable by hand-editing the metadata. |
@DemiMarie - thin_repair found nothing wrong, but the XFS atop the volume wouldn't mount or even be detected as XFS by xfsfix. Whatever was written-out to the thin pool mangled the FS beyond recognition. |
@sempervictus Funny you should mention the lockup - I was just doing an "extract the linux kernel tarball" test on top of an ext4-formatted blk-mq zvol, and also hit lockups:
Discussion:There have been essentially four versions (really, variations) of the blkmq code that I've tested over the course of this PR: Ver 1: blk-mq with 'struct requests' on the workqueue ( This was the first iteration I submitted for this PR. It put each Ver 2: blk-mq with BIOs on the workqueue ( This is the current branch in the PR. It is also the branch that is presently giving us lockups. It puts each BIO from each Ver 3: blk-mq with the old-school "zvol threads" ( This was an experimental branch that I put together to see if I rectify the bad random-RW performance by using the existing zvol codepaths with blk-mq. It worked, but gave mostly the same (and sometimes worse) performance than the current, vanilla, non-blkmq zvol code. Ver 4: blk-mq with 'struct requests' on the workqueue + "BIO merging" ( This is another experimental branch that takes Ver 1, and adds the ability to merge contiguous BIOs within a request into one big BIO. The idea being to reduce dbuf locking overhead and processing for contiguous RW. I've seen it merge requests with 300+ BIOs into one BIO, for example. It has given me the best performance so far for O_DIRECT sequential writes. I think the way forward is Ver 4. Or more specifically, I will upload Ver 1 for this PR (which I'm hoping will fix the lockup), and then do a follow-on PR with the BIO merging in it to bring it up to Ver 4. |
@tonyhutter: i'm having a fair deal of deja vu from when @ryao yanked the old version of these block device mechanisms. One of the resulting follow-ups was to implement 5731140, with the following comment:
so really, we're talking about reverting that when we use the MQ pipeline and leaving it as hacked together 5y ago when using the existing pipeline. If the Linux-proper way to do this is to feed BIOs (a la v3) into the queues, is what we're seeing some sort of ordering issue? Is there some analog to EDIT: sorry about the nuisant nitpicking here - i was around the last time this codepath was significantly altered and in the years since we have lost so much performance relative to the gains in backing storage that it seems like this is something "we must get right, this time" if ZVOLs are to stay relevant in modern compute environments (backing cloud storage for medical research, for example). |
Right, we would revert
I don't know if it's the ordering. I suspect the slighly worse performance for V3 may be due to the fact that V3 is basically "do it exactly the same as the old way but put the BIO in a blkmq queue first". So more overhead from queuing.
Unfortunately, there's no easy way to tell Linux to "merge the iovecs from the BIOs in this this request". Maybe it did that in older kernels back in the day, but I don't see where it does it anymore (and I have looked). Note that for the "BIO merging" patch, we actually do merge the iovecs, in that we create a single UIO (which is ZFS's internal representation of a BIO) from multiple BIOs, and then put the BIO's iovecs into the UIO as an array.
Yea, when all is said and done, we'll basically have: zvol_use_blk_mq = 0: Use the old zvol code. Best for low-recordsize zvols with mostly small random RW workloads (databases). Alternatively we could get rid of Also - |
9545c22
to
193025a
Compare
I cant quite tell how much of the force-push from 3d ago is this PR and how much is delta in master without a deep dive. |
@sempervictus the latest push changes the PR to "Ver 1: blk-mq with 'struct requests' on the workqueue". Hopefully that will fix the crash we were seeing earlier, and also preps the code for the follow-up "BIO merge" PR. |
Thank you sir. Should i wait to throw this into our kernel tree for the follow-up since we saw that the struct approach worked before? |
@sempervictus yes, feel free to give it a shot. It is not crashing on my anymore when I do my ext4-on-zvol test. Don't forget to enable blkmq before importing/creating your volumes: |
Reviewers - this PR is passing ZTS and appears to be ready to go. Can you please take another look? |
I've been running the version from ~5d ago without issue. EDIT: looks like i am running the correct version, so will get a VM to run the tests for a bit |
@tonyhutter can you go ahead and rebase this to resolve the kernel.org build issue. |
No crashes yet - two identical VMs on grsec 5.15.17 (with RAP and all the goodies) running, one with MQ on and one with it off. I might need new nvmes at some point with all this abuse 😄, but it looks to be working. |
@tonyhutter - when copying in-tree, but building as a module (normally i build ZFS into the main binary), we hit this little gem:
^^ 5.10.96 |
Hmm.. that might be a general build bug then - my PR doesn't do anything with that function. Maybe try re-running
|
Add support for the kernel's block multiqueue (blk-mq) interface in the zvol block driver. blk-mq creates multiple request queues on different CPUs rather than having a single request queue. This can improve zvol performance with multithreaded reads/writes. This implementation uses the blk-mq interfaces on 4.13 or newer kernels. Building against older kernels will fall back to the older BIO interfaces. Note that you must set the `zvol_use_blk_mq` module param to enable the blk-mq API. It is disabled by default. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue openzfs#12483
I'm closing this PR in favor of #13148 which has both the blk-mq changes and the "BIO merging" that we've been talking about. I originally thought it would be better to split it into two PRs, but after looking at the new code it doesn't really make sense to do that way. |
Motivation and Context
Increase multi-threaded performance on zvols (Issue #12483)
Description
Add support for the kernel's block multiqueue (blk-mq) interface in the zvol block driver. blk-mq creates multiple request queues on different CPUs rather than having a single request queue. This can improve zvol performance with multithreaded reads/writes.
This implementation uses the blk-mq interfaces on 4.13 or newer kernels. Building against older kernels will fall back to the
older BIO interfaces.
How Has This Been Tested?
Test case added. Also ran benchmarks (results to follow)
Types of changes
Checklist:
Signed-off-by
.