-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reinstate zvol_taskq to fix aio on zvol #5824
Conversation
@tuxoko, thanks for your PR! By analyzing the history of the files in this pull request, we identified @bprotopopov, @behlendorf and @edillmann to be potential reviewers. |
module/zfs/zvol.c
Outdated
rl_t *rl; | ||
dmu_tx_t *tx; | ||
#ifdef HAVE_GENERIC_IO_ACCT | ||
unsigned long start_jif = jiffies; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I prefer to see the start time jiffies adjacent to the generic_start_io_acct()
indeed, you can even include the generic_start_io_acct() inside the ifdef, for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be nice. You can even now declare the variable there since 4a5d7f8 was merged.
@@ -56,9 +56,11 @@ | |||
|
|||
unsigned int zvol_inhibit_dev = 0; | |||
unsigned int zvol_major = ZVOL_MAJOR; | |||
unsigned int zvol_threads = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that reflecting the current value of zvol_threads via blk_update_nr_requests() would make this more consistent with other blk devices.
A stretch goal, which is fine to defer as later work, is to allow zvol_threads to be changed via nr_requests in sysfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By paralyzing this again with a taskq we're potentially allowing for IOs to be submitted and completed out of order. For example, the request queue might contain a WRITE followed by a READ for the same LBA. That could end up being processed by the taskq as a READ followed by a WRITE which wouldn't be good.
One possible way to correctly handle this would be to take the required range locks synchronously in zvol_request()
prior to dispatching the bio to the taskq. That way we're certain all the operations will be processed in the order in which they were received and still done in parallel when possible. We may however need to take some to ensure this can never deadlock due to pending requests in the taskq.
This would also be a great opportunity to investigating implementing the zvol as a multi-queue block device. In theory this should also allow us to improve the sustained IOPs rate for a zvol.
module/zfs/zvol.c
Outdated
zvol_state_t *zv = q->queuedata; | ||
fstrans_cookie_t cookie = spl_fstrans_mark(); | ||
uint64_t offset = BIO_BI_SECTOR(bio)<<9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's add single spaces around the <<
operator.
module/zfs/zvol.c
Outdated
rl_t *rl; | ||
dmu_tx_t *tx; | ||
#ifdef HAVE_GENERIC_IO_ACCT | ||
unsigned long start_jif = jiffies; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be nice. You can even now declare the variable there since 4a5d7f8 was merged.
I have thought of that, just wasn't sure if we need it. But I think your right that we should do this. I don't think there's anyway of deadlock since the lock dependency doesn't change at all.
I had wanted to do that for a long time, but it would mean we need to support two completely different zvol_request path due to legacy kernel support. So I haven't brought myself to do it. |
I can't think of a scenario either since we won't be taking any locks in the taskq function. |
This PR seems to be hitting the codepath which causes the bug i describe in #4265. I ran the same non sync IO tiotests against a fresh ZVOL created before the upgrade, and recreated after the upgrade, and saw something rather interesting. In one of the test cycles after the upgrade, the linear/random write speed ratio flipped - the linear writes went >2X faster than the random, but the random slowed down to hell. I have Linux prefetch disabled on ZVOLs using fad4380317, both before and after, and up till now have not seen anything since the ZVOL overhaul which made linear writes work even near the speed of the underlying SSD. As i mention in the issue, the ZVOL use case is rather important for shared storage, and today, its impractical for significantly loading consumers and/or contending ones. If there is something serializing the write requests into blocking IOs, this PR might be hitting it, and while the results arent great from the paltry bench i ran (random writes @ half the speed, sometimes, maybe), it seems to be in the right ballpark. Here's what i'm seeing using the same non-sync-IO tiotest i ran for the issue benchmarks:
Post patch:
3X loop of tiotest to reproduce:
EDIT: This PR also destroys read speeds. Looking at the same benchmarks, i'm guessing we're getting blocked in the AIO layer before we hit ARC:
Post patch:
|
Running these workloads at 15% of the volsize 10X (each overwriting the last, supposedly), the performance goes down to 80/80MB for writes, while reads appear capped at that ~250 mark. On a related note, we need a consistent benchmarking approach which accounts for all the desirable synthetic counters (fio, others?) and real-world applications written by people who dont think about IO models or OS semantics (tiotest, others?), and produces common outputs we can all work with. Different users will be interested in different representations of said benchmarks and their associated tunables, since people backing ASM with ZVOL and those backing Cinder will probably not have the same IO patterns (till a Cinder user builds a RAC in their cloud). @tuxoko: This work is obviously of great interest to us, so if you need us to test ideas/options, we can definitely allocate some resources, including bare metal if need be. |
@sempervictus Also, I'm still struggling with some performance inconsistencies. I originally test this patch on a bare metal 8 spindle disks mirror pair pool with 256GB ram. Doing fio 4k randwrite with 32 iodepth would cause strange behavior as it would write for 3~5 sec and hang for 0.5~2min and repeat. I figured out that reducing zfs_dirty_data_max from 10% total mem to 1G greatly reduce such behavior. However, when testing on a KVM with ssd back virual disks with 8GB ram. Doing the same test would still have intermittent hang for 2~4sec. |
@tuxoko: the readahead patch was not helping me here, dropped that, added TRIM support, and numbers are looking a lot better on the same 840 SSD.
Writes in the same conditions:
To see how reuse of the volume space affects this stack, i ran the same test 20X with 8 threads instead of 4 to double up on the amount written. Its a bit "all over the place" in terms of numbers, which is slightly concerning, but definitely shows improvement, and doesnt have a clear trend of degrading the performance down to a crawl. Results paste - http://pastebin.com/BvQ9XLnu |
It may be worth exploring creating multiple zvol taskqs. Previously there was some concern over contention of the taskq spin lock which could result in inconsistent behavior. One per multi-queue might work well. On a related note setting |
I just got a test rig set up for this and have run a couple of initial 1 hour write-only tests with this fio configuration:
The pool is 50 spinning disks in a set of 25 2-drive mirrors so it ought to have pretty good bandwidth and iops. I capped the ARC at 32GiB for this test, but the test never used much more than 8GiB of it. One test was run with today's master and the other with the zvol taskq patch rebased to the same master. The test system has 40 cores by 2 threads (80 threads). Picking just the write bandwidth and IO depth output from current master:
and with the taskq it is:
I'd be more than happy to run any other battery of tests but this one doesn't show a whole lot of difference. |
@dweeezil: Thanks for putting together such a robust rig :). What happens when job concurrency is replaced with thread concurrency? A la numjobs=1 thread=32 with size=8192m (if i remember my FIO correctly)? |
@sempervictus Actually, thanks to a corporate citizen with an interest in the advancement of OpenZFS. Anyway, I will try the same test with threads but in my experience it generally doesn't change the results much. I normally run with processes for easier access via strace, etc. I'd sure love to be able to concoct an fio work load that exposes a big problem with the current zvol (req based) system but so far haven't been able to do so. |
@dweeezil: I generally use fio to work out how databases and other consumers with synchronous IO requirements will fare, or at least predictable access requirements based on their workload. When it comes to dealing with consumers such as KV indices/datastores everyone's so fond of these days, i actually often find tiotest and fio with direct=0 to be a slightly better indicator of the likely profile some of these chaotic consumers will need. So far as test profiles go, i'd say we should keep something like what youre currently doing as a baseline, add non direct tests, and implement some real-world scenarios like everyone's favorite iSCSI export. With AIO working again, SCST might be a viable option (though IIRC, it needs SG hooks), but LIO is always there and seems to induce undue iowait on ZVOLs in real world environments. There's always sync=always if you want to ramp up the hate factor... I've seen strange things like multiple VMs using LVM iSCSI exports from the same large ZVOL and occasionally bringing it to a crawl with some unpleasant order/priority of small IOs which just happen to line up poorly. I've always thought this to be indicative of some misunderstanding between the OS and ZFS IO elevators. There's also ASM/RAC, but that requires a faustian licensing bargain, and potentially the NFS workload which can become a different kind of mess depending on the NFS conditions in kernel and between hosts. |
The value of this is not in the average, overall performance. The value is to provide an async experience to those consumers who don't handle async themselves. Specifically, a consumer that expects generic_make_request() to return prior to the BIO_END_IO() will be happier. Such consumers exist, but beware of testing with the various load generators because they tend to have pluggable engines with differing policies. |
While I agree that interface for consumers is the point, performance regressions are a legitimate class of bug, and we currently don't test before/after for commits in our test cases. For people running stable releases, an upgrade introducing features and interfaces which murders performance could be a nightmare if other system components now depend on the presence of the interface. Something tells me 99% of users don't do the silly levels of testing performed by those of us commonly commenting on these PRs, and they're the ones usually posting issues about all hell breaking loose after yum or apt change 100+ packages including zfs...
|
OK, so this is probably due how qemu/kvm handles virtual disk cache. I turn off cache and things looks a lot better. |
@sempervictus I think we are in agreement. Regression testing is important. In this case, there are only a few consumers that are blocking on this, so the appropriate test is needed to prove the functionality works. @tuxoko I think an appropriate change is to keep the current, blocking behaviour between what is essentially generic_start_io_acct() and generic_end_io_acct(), when zvol_threads == 0. Only create the threads and offload to them when zvol_threads > 1. That way we can take care of the blocking consumers as an exception, rather than the rule. Thoughts? |
@richardelling |
@tuxoko as I read the current code, BIO_END_IO completes before zvol_request completes. I'll use the term blocking, because the request blocks until after the I/O completes. It does not matter if the consumer sets a callback. Functionally, this is fine. There appear to be very few consumers that will expose this behaviour. |
So looks like i found some viable tuning knobs in sysfs relating to how the block device writes are handled with this patch: EDIT: these tests were done with read_ahead_kb set to 0, compression=lz4, on a single 850 Pro. This PR, the trim PR, and multi-threaded SPA sync are included in the stack, build was done yesterday. I can pull the commit ref if needed. EDIT2: i took a quick look through blkdev_compat.h and it looks like calling blk_queue_set_write_cache as blk_queue_set_write_cache(zv->zv_queue, B_FALSE, B_TRUE); should remove the write cache thing, but QUEUE_FLAG_NOMERGES looks a bit more complicated, and i've not yet even found where we hook the max_readahead_kb control. Benchmark loops in background are still producing the same results, seems to slow down ~15% with volblocksize=4k (from 8), but no significant increase past 8K. Also, does anyone know how we play with the blk_mq vs blk_sg differences, if at all? Seems the MQ changes fit rather well into the idea of worker threads in ZFS. |
@sempervictus Sure. I'll add it to my todo list this weekend (which was mainly to get the TRIM stuff refreshed). Presumably my current 4.9.8 upstream kernel will be OK for this testing? |
@dweeezil: should build just fine on vanilla, looking forward to your results.
|
@tuxoko Right now, only reads are synchronous while writes are asynchronous, unless we write a partial uncached record. In the case of a partial write of a cached record or a write of a whole record (cached or uncached), we return the to userspace after copying into the kernel and the DMU then handles things from there. The correct solution is to rework the DMU so that asynchronous dispatch of reads and writes occurs there and then execute a callback to notify userspace when it finishes. There actually is a callback notification mechanism in the ZIO layer, but it needs changes to support reading/writing from a userspace buffer. That was my original plan to take performance to the next level at some point after that patch. zvol performance became a low priority for me soon afterward due to people at my new job at the time not caring about zvols. Now that I am no longer there, I have time to care now.. This definitely has my attention because it is affecting at the moment, but it is number 4 on my list of things to do right now. |
@ryao: any chance you have the bandwidth or inclination to implement or
explain the required changes in the DMU? The ZVOL behavior and performance
issues have been a thorn for ages and would be great to have them behave
predictably, play nice with consumers, and perform better than the village
floppy disk...
|
@sempervictus We'll see. It is number 3 on my to do list. |
@tuxoko OK it's pretty clear this is a change we need to make. When you get a moment can you rebase this patch against master so we can get a clean current test run. EDIT: @sempervictus does this PR still negatively impact your workloads even with your proposed changed in #5902? |
Rebased. |
Commit 37f9dac removed the zvol_taskq for processing zvol request. I imagined this was removed because after we switched to make_request_fn based, we no longer received request from interrupt. However, this also made all bio request synchronous, and cause serious performance issue as the bio submitter would wait for every bio it submitted, effectly making iodepth to be 1. This patch reinstate zvol_taskq, and to make sure overlapped I/Os are ordered properly, we take range lock in zvol_request, and pass it along with bio to the I/O functions zvol_{write,discard,read}. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
In order to faciliate getting this merged and aid in benchmarking add a zvol_request_sync module option to switch between sync and async request handling. For the moment, the default behavior remains sync. Additionally fix a few minor nits. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Based on the discussion above, and what I've seen in other issues, it seems clear that some workloads/configurations benefit from handling bio requests asynchronously while others do better with synchronous semantics. In order to get to this merged and facilitate more comprehensive performance testing I've added a patch to this PR which adds a @ryao @tuxoko @sempervictus @richardelling please comment and review. |
@behlendorf: VM testing hasn't produced any crashes in ztest loops, so that's good - with either 1 or 0 for zvol_request_sync. For clarification, when you say "switch between" do you mean to say that runtime changes to the module option will alter behavior? I've been reloading the module to test in the VM just to be safe. |
I've been running some benchmark tests on our servers with 0.7.0 RC3 and this PR applied. I've outlined the results in that thread to keep it all together, see #4880 |
@sempervictus yes exactly. You can safely switch @koplover thanks for testing this patch and posting your results. I really appreciate the feedback. The more data points we have regarding performance the better. Since this change has been reviewed, passes all the automated + manual testing, and doesn't change the existing behavior I'll be merging it to master. What I'd like to be able to determine over the next week or two is what are the optimal default values and which of the changes in #5902 we need to bring in. As always additional performance analysis is appreciated! |
@behlendorf |
@tuxoko thanks! I've squashed the commits and merged this. |
… on zvol) from branch 0.7.* to 0.6.5.11. reference: openzfs#5824 openzfs@692e55b?diff=unified
Commit 37f9dac removed the zvol_taskq for processing zvol request. I imagined
this was removed because after we switched to make_request_fn based, we no
longer received request from interrupt.
However, this also made all bio request synchronous, and cause serious
performance issue as the bio submitter would wait for every bio it submitted,
effectly making iodepth to be 1.
This patch reinstate zvol_taskq, and refactor zvol_{read,write,discard} to
make them take bio as argument.
Signed-off-by: Chunwei Chen david.chen@osnexus.com
How Has This Been Tested?
Types of changes
Checklist: