zvol rework #2484

ryao · 2014-07-10T10:58:14Z

The Illumos taskq import that I have been developing is incompatible with the current zvol code due to how Linux handles interrupts. This refactors the zvol code to stop processing IOs in interrupts. See the commit message of ryao/zfs@1d22dbd for a full description of the benefits. To summarize, we see the following benefits from this change:

No per zvol spinlocks.
No redundant IO elevator processing.
Interrupts are disabled only when actually necessary.
No redispatching of IOs under heavy load.
Linux's page out routines will properly block.
Many autotools checks become obsolete.

In addition, this pull request includes ryao/zfs@4ed7d71, which is a fix for a potentially rare deadlock in ZIO processing that became readily reproducible in the presence of this change.

FransUrbo · 2014-07-10T11:01:35Z

So in essence, we're going to get [some,much,huge?] improvements on ZVOLs?

ryao · 2014-07-10T11:09:33Z

@FransUrbo I expect zvol performance to improve, but I have not done performance tests to confirm/quantify it. The purpose of these changes was to refactor zvols into something that is compatible with the Illumos taskq import that I am developing. The potential for improved performance is a bonus.

Also, I just pushed a small revision to fix a typo in a commit message. The code in the pull request has not changed.

kernelOfTruth · 2014-07-10T13:36:34Z

this sounds like it also might improve & cut down latency for desktop systems (e.g. during backup jobs which unavoidably need to be done while working with the system - currently ZFSonLinux already feels snappier than Btrfs which even with latest code still causes some huge latency spikes & lags)

nice !

behlendorf · 2014-07-10T17:39:06Z

@ryao this looks promising. Let's get some random and sequential performance tests run for 4k/1m block sizes with the old and new code. Then we'll be able to say something about performance. We'll also want to run a stress test against the block device to check correctness. Using the zvol to back another pool or a btrfs filesystem would be a good sanity check.

ryao · 2014-07-10T17:42:18Z

@behlendorf I used it to back a UFS2 filesystem, populated it with files, unmounted and ran fsck. These changes did not cause any problems. I will take some time on the weekend to torture test ZFS on a zvol and then see if it passes a scrub. I try to include some performance testing with the torture testing.

ryao · 2014-07-10T19:58:20Z

@kernelOfTruth If you are not doing anything that involves zvols, this will not have any effect. If you are using zvols, this might improve responsiveness. The current zvol code disables kernel preemption by doing processing in an interrupt context. Disabling kernel preemption is considered to be bad for interactive response, so this change should be a good thing with respect to that.

I expect the biggest change to occur with swap on zvols under memory pressure. Right now, it is possible for swap on zvols to deadlock and we did not have a clear picture why. It occurred to me when I was refactoring the code that the asynchronous nature of the current zvol code is a key difference between ZFSOnLinux and ZFS on Illumos in terms of how swap/page-out on zvols works. On Illumos, blocking operations are used on zvols while on Linux, we had asynchronous operations on zvols.

The consequence for page out is that the Linux kernel will race through its page out routines when swap is on a zvol by moving pages from userland into the zvol's request queue far faster than they can travel to disk. That has a feedback effect where the page out routines worsen pressure and then respond to the increase in pressure by trying even harder. That is contrary to how pageout is supposed to work.

An analogy would be to imagine a sinking ship where crews are trying to pump water off the ship. The way the current zvol code works is that only 1 team is pumping off the ship while all others are pumping into the cabin of that team. If the guys pumping water out of the ship suffocate because the others managed to fill the cabin faster than the guys water pumping out of the ship could handle, everyone drowns.

I believe that the asynchronous zvol operations have a similar effect on page out. This could explain how we manage to deadlock a system under prolonged pressure when swap is on zvols. In specific, the page out routines gobble up so much of the system's emergency memory that ZIO processing can no longer operate and things deadlock. In addition, the asynchronous zvol operations under memory pressure should schedule far more for page out than necessary, which would worsen lags caused by page faults.

I am already using this code on my production workstation with no ill effect. My feeling is that these changes will make swap on zvols stable enough for production deployment, but I have yet to test this theory because my workstation has 64GB of RAM. I suspect desktop workloads under constrained memory situations would be a good way of gauging the effect, but it is hard to do that quantitatively and my time is limited. If others' beat me to evaluating the effect of these patches on desktop workloads, I would be happy to hear their results.

IronWolve · 2014-07-10T23:04:36Z

I was just going to open a bug about zvol and spl_kmem_cache ran away. I created a zvol, partitioned it, then started to format it. During the format it gets a few inodes done and the cpu spikes across all cpu's. Centos 6.5 box, latest patches and zfs release. Was going to use zvol+fc because my nfs speeds are horrible.

15:23:41 root@uszfs004: ~ # top -b -n1 | head -14
top - 15:24:29 up 42 min, 2 users, load average: 68.05, 33.87, 19.08
Tasks: 758 total, 2 running, 754 sleeping, 1 stopped, 1 zombie
Cpu(s): 0.1%us, 5.0%sy, 0.0%ni, 91.1%id, 3.7%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132142108k total, 64291620k used, 67850488k free, 20946480k buffers
Swap: 4194296k total, 0k used, 4194296k free, 84260k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3012 root 39 19 0 0 0 R 100.0 0.0 12:26.49 spl_kmem_cache/

ryao · 2014-07-10T23:13:54Z

It is worth noting that there are opportunities for further improvements in this area. In specific, we could implement zero copy by implementing the file_operations and the inode_operations handlers. We would still need a copy when there is compression, but copying would be reduced to the absolute minimum necessary. I suspect that we would need to rewrite the underlying ZIO layer to use bvecs instead of slab-backed buffers in order to do that. Otherwise, we would still have double buffering on mmap(). I believe that certain aspects of Linux will still want to use the request handler, so we would need to take great care to ensure that there are no interfaces that would allow them to access it through the buffer cache to maintain cache coherence.That kind of improvement would be a variation on the sgbuf idea that I plan to revisit when I find time.

ryao · 2014-07-10T23:15:27Z

@IronWolve Were you using the code in this pull request? If not, then you should file a separate report. It would be helpful if you included a backtrace of the looping thread and instructions on how to reproduce the problem in your report.

kernelOfTruth · 2014-07-12T17:00:04Z

@ryao if I understood http://zfsonlinux.org/example-zvol.html correctly, I'm indeed heavily using zvols (several "virtual block device[s]", some of them subfolders with large, some of them with several small files, on top of my mirrored /home pool - nested within each other)

probably will give this a try within the next few days

Thanks !

kernelOfTruth · 2014-07-13T15:13:43Z

currently using this with 2129, 2291, 2408 and 2484

had to do some modifications to dmu.c and vdev_disk.c as the patch would fail - hope the changes are correct:

http://pastebin.com/yzW0LghN

the whole diff with the 2 modified files cleanly added:

http://pastebin.com/ShFW0wze

ryao · 2014-07-14T22:44:34Z

It turns out that this is incompatible with older kernels because a typedef was changed:

Here is the typedef before 3.2:

typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);

Here is the typedef in and after 3.2:

typedef void (make_request_fn) (struct request_queue *q, struct bio *bio);

I need to work out a sane way of handling this with autotools before it can be merged.

ryao · 2014-07-15T15:24:26Z

@kpande There is something strange happening in your build environment. This pull request removes rq_for_each_segment4(), but your build failure indicates that the code is calling it.

ryao · 2014-07-15T15:35:52Z

@kpande Pull from my git. This is in the zvol branch.

FransUrbo · 2014-08-10T16:57:18Z

@ryao Have you noticed/heard of any possible problems with this?

I built latest master with this (and some other pull requests that I want/need):

https://github.com/FransUrbo/zfs/tree/iscsi
=> Adds 'shareiscsi' to ZoL
https://github.com/FransUrbo/zfs/tree/smbfs_registry-shares
=> Rewrite of the 'sharesmb' part of ZoL
https://github.com/FransUrbo/zfs/tree/initramfs
=> Adding initrd scripts from pkg-zfs to ZoL
https://github.com/FransUrbo/zfs/tree/fixes.init_total_rewrite
=> Rewrite of the init scripts from pkg-zfs to ZoL
https://github.com/FransUrbo/zfs/tree/issue-2448
=> Make sure that libblkid is used/required
https://github.com/ryao/zfs/tree/zvol
=> This one...
https://github.com/behlendorf/zfs/tree/issue-1350
=> Adds support for property overrides

After creating a new pool from scratch, I then did a send+receive from my dump pool on a FreeNAS9 to this new pool. It's been running for a couple of days (should have been finished this morning).
When I checked the machine today, I saw a hard lock, rebooted and have since tried to import+mount. unsuccessfully

With

options zfs zvol_inhibit_dev=1
options zfs zvol_threads=64
options zfs zfs_autoimport_disable=1

I can import the pool (using -N), but any use of zfs (such as something simple as zfs list gives that lock again. SysRQ doesn't work and nothing in the logs, and the screen isn't big enough to give the full trace... I've included a pic on what I could see.

The only merge other than ZoL/master (very latest) that could have anything to do with this is your pull request...

I'm suspecting this PR (unless it's something with master, but then others should have reported any problems ?) only because I've seen references to zvol_*() - when NOT setting zvol_inhibit_dev, theres a lot of zvol_create_minors() etc.

After a while, I figured I'd remove the this PR, but problems persists. Just for completeness, I've even tried a completely clean ZoL build. Didn't help - hard lock after a few seconds after running zfs list | less.

I'm not completely ruling out problem with the pool/data. It was sent from my (quite broken) pool to the FreeNAS9 (which took a couple of weeks because it was so broken - the FreeNAS9 doesn't seem to have any problems though, I can access everything there) and I'm now trying to send it back after recreating it. Unless you can come up with an (even vague would work :) idea, I'll have to recreate it again (clean build) and try again. This will take a few days though, so I'm hoping you have an idea or two :)

ryao · 2014-08-10T17:49:22Z

@FransUrbo That looks like a pre-existing problem in the send/recv code. I believe that #2338 might help, but whatever is happening is not fully understood right now. Are certain that this problem consistently occurs with these patched included and never without them?

FransUrbo · 2014-08-10T17:53:13Z

@ryao Ok, thanx.

However:

Behavior of zfs send -R is not changed.

That's what I used (and I forgot to add -p when I dumped the pool to the FreeNAS9 machine). So what's my options now?

I see that #2338 (might be) missing part of the Illumos commit. Will that be a problem?

FransUrbo · 2014-08-10T17:58:07Z

whatever is happening is not fully understood right now.

Would access to my machines help?
Are certain that this problem consistently occurs with these patched included and never without them?

What?

ryao · 2015-02-03T17:42:37Z

My curiosity got the better of me. I wrote a fix for the 2.6.32 issue and pushed it for buildbot testing. If all goes well, this should be safe to merge. I do not recommend trying this in production until the build bot results are back because it has not been subject to my own regression tests.

The main impetus for this patch is the Illumos taskq import. The ZoL zvol code operated in an interrupt context to take advantage of the IO elevator and the SPL taskq code was (ab)used to form the bottom half. The SPL taskq code was designed to be callable in an interrupt context while the Illumos taskq code was designed to expect the assistance of the thread thread scheduler. Since Linux's thread scheduler does not provide that assistance, we must trampoline through the system workqueue to continue providing the bottom half when using the Illumos taskq code. This seems unnecessarily complex when we consider that Illumos itself does not perform this emulation and the recent multiqueue work in Linux has identified interrupt processing as a major bottleneck on Linux: http://kernel.dk/systor13-final18.pdf The entry point for zvol operations is zfs_request(). That is called in an interrupt context by blk_queue_bio(), which is the kernel's generic ->make_request() handler. Registering our own handler allows us to avoid this. Comments in the code suggest that this was done to gain the benefit of having Linux's IO elevator do merging. However, the DMU already does write merging while arc_read() should implicitly merge read IOs because only 1 thread is permitted to fetch the buffer. In addition, the main consumers of zvols are VMs and iSCSI shares, which have their own elevators to merge IOs. In addition, commercial ZFSOnLinux distributions report that regular files are more performant than zvols. That stronlgy suggests that the original rationale for using Linux' generic ->make_request() handler was invalid. Some minor refactoring allows us to register zfs_request() as our ->make_request() handler. This eliminates an obstacle to importing the Illumos taskq code while giving us several additional benefits: 1. No per zvol spinlocks. 2. No redundant IO elevator processing. 3. Interrupts are disabled only when actually necessary. 4. No redispatching of IOs under heavy load. 5. Linux's page out routines will properly block. 6. Many autotools checks become obsolete. It is worth noting that the previous zvol code would redispatch IOs when all zio taskq threads were in use. The consequence being that IOs could be redispatched and potentially, reordered across barriers. This explains an obscure IRC report about corruption on zvol-backed iSCSI shares triggered by a kernel panic during heavy load. It was thought that the initiator did something wrong, but now it is clear that the zvol code itself could reorder, which explains the incident. Signed-off-by: Richard Yao <ryao@gentoo.org>

Signed-off-by: Richard Yao <ryao@gentoo.org>

This autotools check was never needed because we can check for the existence of QUEUE_FLAG_DISCARD in the kernel headers. Signed-off-by: Richard Yao <ryao@gentoo.org>

This autotools check was never needed because we can check for the existence of QUEUE_FLAG_NONROT in the kernel headers. Also, the comment in config/kernel-blk-queue-nonrot.m4 is incorrect. This was a Linux 2.6.28 API change, not a Linux 2.6.27 API change. Signed-off-by: Richard Yao <ryao@gentoo.org>

Linux 2.6.36 introduced REQ_SECURE to indicate when discards *must* be processed, such that we cannot do optimizations like block alignment. Consequently, the discard semantics prior to 2.6.36 require us to always process unaligned discards. Previously, we would do this optimization regardless. This patch changes things to correctly restrict this optimization to situations where REQ_SECURE exists, but is not included in the flags. Signed-off-by: Richard Yao <ryao@gentoo.org>

sempervictus · 2015-02-05T18:49:51Z

module/zfs/zvol.c

 	}
+#ifdef


I think this is supposed to be #endif for REQ_SECURE, and GCC seems to agree that this is a problem. Strange thing is, the compile-time error only popped up during the DKMS build on one host (several others built, which is now slightly concerning). Everything runs the same kernel, not sure why this one tripped it.

behlendorf · 2015-02-05T20:41:39Z

Bumping to 0.6.5, if we can get it done for 0.6.4 that would be great but it shouldn't hold up the release.

sempervictus · 2015-02-05T21:26:31Z

initial testing is quite encouraging, though i'm still baffled by what is now a consistent build failure on our patch stack - seems this doesnt play well with the large block and ZIO optimization PRs ( #2865 and #3063 ), which really means it should never have built in the first place on the systems where it does work.

iSCSI and local ZVOL backed random IO load from multiple VMs (including running on-host for the local tests with all the contention involved) shows ~15% overall drop in IOWait from the same workload. When running random greps through VM FS' searching for strings which will never match, the reduction in wait is closer to 30%, which seems a bit odd, but i'm not complaining.

Thank you @ryao, i'll get this running in a few test hosts before i try to back any of our production SCST hosts with this, but i highly encourage anyone with resources to put this through its paces. Last time we ran this patch stack we ended up seeing issues pretty late in the game - like on resilvering and other unusual operations, so thorough testing is highly recommended.

devonbessemer · 2015-02-06T04:52:30Z

I don't have much experience with git. Can someone guide me on how I can test the latest commit (2.6.32 fix)? I have a few CentOS 6.6 dev boxes.

FransUrbo · 2015-03-30T16:45:34Z

@ryao fix in commit f113911 - Support secure discard on zvols):

diff --git a/module/zfs/zvol.c b/module/zfs/zvol.c
index e182550..1e8ba38 100644
--- a/module/zfs/zvol.c
+++ b/module/zfs/zvol.c
@@ -641,11 +641,11 @@ zvol_discard(struct bio *bio)
         * 2.6.35) will not receive this optimization.
         */
 #ifdef REQ_SECURE
-       if (!(bio->bi_rw & REQ_SECURE))
+       if (!(bio->bi_rw & REQ_SECURE)) {
                start = P2ROUNDUP(start, zv->zv_volblocksize);
                end = P2ALIGN(end, zv->zv_volblocksize);
        }
-#ifdef
+#endif

        if (start >= end)
                return (0);

kernelOfTruth · 2015-04-24T16:23:02Z

@ryao any news or changes on this ?

behlendorf · 2015-04-24T18:14:55Z

Yes, this change is definitely going to need to be updated and have some additional review and discussion. @ryao I'm going to close this out for now, please open a new pull request when you have some time to revisit this.

ryao · 2015-05-21T19:55:35Z

@kpande Are these issues on recent kernels or just 2.6.32?

@FransUrbo I just saw your comment now. Thanks for catching that. It was a last ditch attempt to improve this before I put it on the back burner. I just rebased this on head and pushed it to my branch. I haven't had a chance to do testing though.

predragmandic · 2015-07-16T16:49:30Z

@ryao in your patch at this part in zvol.c:

    +#ifdef REQ_SECURE
    +   if (!(bio->bi_rw & REQ_SECURE))
    +       start = P2ROUNDUP(start, zv->zv_volblocksize);
    +       end = P2ALIGN(end, zv->zv_volblocksize);
    +   }
    +#ifdef

think you are missing '{' at start of the block and '#endif' instead of '#ifdef' at the end.

This was referenced Jul 13, 2014

zvol/kworker/jbd2 possible deadlock (hung tasks) #2221

Closed

Kmem rework #2411

Closed

behlendorf added this to the 0.6.4 milestone Jul 17, 2014

behlendorf added the Bug label Jul 17, 2014

edillmann mentioned this pull request Aug 5, 2014

SPL: Fixing allocation for task txg_sync which used GFP flags 0x7aebba4 with PF_NOFS set #2569

Closed

ryao force-pushed the zvol branch from 7347756 to 8ee4afa Compare February 3, 2015 17:41

ryao force-pushed the zvol branch 3 times, most recently from 205e9a8 to 1b5ca81 Compare February 3, 2015 20:04

ryao force-pushed the zvol branch from 1b5ca81 to f113911 Compare February 3, 2015 20:08

ryao added 10 commits February 3, 2015 15:08

Remove rq_for_each_segment() autotools check

3152ad0

Signed-off-by: Richard Yao <ryao@gentoo.org>

Remove rq_is_sync() autotools check

b7b2872

Signed-off-by: Richard Yao <ryao@gentoo.org>

Remove blk_end_request() autotools check.

0c47edf

Signed-off-by: Richard Yao <ryao@gentoo.org>

Remove blk_requeue_request() autotools check

01f4952

Signed-off-by: Richard Yao <ryao@gentoo.org>

Remove blk_fetch_request() autotools check

7d75a84

Signed-off-by: Richard Yao <ryao@gentoo.org>

Remove blk_rq_pos() autotools check

913aca9

Signed-off-by: Richard Yao <ryao@gentoo.org>

Remove blk_rq_bytes()/blk_rq_sectors autotools checks

fdcf1f3

Signed-off-by: Richard Yao <ryao@gentoo.org>

Remove blk_queue_discard() autotools check

cc65185

This autotools check was never needed because we can check for the existence of QUEUE_FLAG_DISCARD in the kernel headers. Signed-off-by: Richard Yao <ryao@gentoo.org>

sempervictus reviewed Feb 5, 2015
View reviewed changes

behlendorf modified the milestones: 0.6.5, 0.6.4 Feb 5, 2015

behlendorf removed this from the 0.6.5 milestone Apr 24, 2015

behlendorf closed this Apr 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zvol rework #2484

zvol rework #2484

ryao commented Jul 10, 2014

FransUrbo commented Jul 10, 2014

ryao commented Jul 10, 2014

kernelOfTruth commented Jul 10, 2014

behlendorf commented Jul 10, 2014

ryao commented Jul 10, 2014

ryao commented Jul 10, 2014

IronWolve commented Jul 10, 2014

ryao commented Jul 10, 2014

ryao commented Jul 10, 2014

kernelOfTruth commented Jul 12, 2014

kernelOfTruth commented Jul 13, 2014

ryao commented Jul 14, 2014

ryao commented Jul 15, 2014

ryao commented Jul 15, 2014

FransUrbo commented Aug 10, 2014

ryao commented Aug 10, 2014

FransUrbo commented Aug 10, 2014

FransUrbo commented Aug 10, 2014

ryao commented Feb 3, 2015

sempervictus Feb 5, 2015

behlendorf commented Feb 5, 2015

sempervictus commented Feb 5, 2015

devonbessemer commented Feb 6, 2015

FransUrbo commented Mar 30, 2015

kernelOfTruth commented Apr 24, 2015

behlendorf commented Apr 24, 2015

ryao commented May 21, 2015

predragmandic commented Jul 16, 2015

zvol rework #2484

zvol rework #2484

Conversation

ryao commented Jul 10, 2014

FransUrbo commented Jul 10, 2014

ryao commented Jul 10, 2014

kernelOfTruth commented Jul 10, 2014

behlendorf commented Jul 10, 2014

ryao commented Jul 10, 2014

ryao commented Jul 10, 2014

IronWolve commented Jul 10, 2014

ryao commented Jul 10, 2014

ryao commented Jul 10, 2014

kernelOfTruth commented Jul 12, 2014

kernelOfTruth commented Jul 13, 2014

ryao commented Jul 14, 2014

ryao commented Jul 15, 2014

ryao commented Jul 15, 2014

FransUrbo commented Aug 10, 2014

ryao commented Aug 10, 2014

FransUrbo commented Aug 10, 2014

FransUrbo commented Aug 10, 2014

ryao commented Feb 3, 2015

sempervictus Feb 5, 2015

Choose a reason for hiding this comment

behlendorf commented Feb 5, 2015

sempervictus commented Feb 5, 2015

devonbessemer commented Feb 6, 2015

FransUrbo commented Mar 30, 2015

kernelOfTruth commented Apr 24, 2015

behlendorf commented Apr 24, 2015

ryao commented May 21, 2015

predragmandic commented Jul 16, 2015