Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zvol rework #2484

Closed
wants to merge 11 commits into from
Closed

zvol rework #2484

wants to merge 11 commits into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Jul 10, 2014

The Illumos taskq import that I have been developing is incompatible with the current zvol code due to how Linux handles interrupts. This refactors the zvol code to stop processing IOs in interrupts. See the commit message of ryao/zfs@1d22dbd for a full description of the benefits. To summarize, we see the following benefits from this change:

  1. No per zvol spinlocks.
  2. No redundant IO elevator processing.
  3. Interrupts are disabled only when actually necessary.
  4. No redispatching of IOs under heavy load.
  5. Linux's page out routines will properly block.
  6. Many autotools checks become obsolete.

In addition, this pull request includes ryao/zfs@4ed7d71, which is a fix for a potentially rare deadlock in ZIO processing that became readily reproducible in the presence of this change.

@FransUrbo
Copy link
Contributor

So in essence, we're going to get [some,much,huge?] improvements on ZVOLs?

@ryao
Copy link
Contributor Author

ryao commented Jul 10, 2014

@FransUrbo I expect zvol performance to improve, but I have not done performance tests to confirm/quantify it. The purpose of these changes was to refactor zvols into something that is compatible with the Illumos taskq import that I am developing. The potential for improved performance is a bonus.

Also, I just pushed a small revision to fix a typo in a commit message. The code in the pull request has not changed.

@kernelOfTruth
Copy link
Contributor

this sounds like it also might improve & cut down latency for desktop systems (e.g. during backup jobs which unavoidably need to be done while working with the system - currently ZFSonLinux already feels snappier than Btrfs which even with latest code still causes some huge latency spikes & lags)

nice !

@behlendorf
Copy link
Contributor

@ryao this looks promising. Let's get some random and sequential performance tests run for 4k/1m block sizes with the old and new code. Then we'll be able to say something about performance. We'll also want to run a stress test against the block device to check correctness. Using the zvol to back another pool or a btrfs filesystem would be a good sanity check.

@ryao
Copy link
Contributor Author

ryao commented Jul 10, 2014

@behlendorf I used it to back a UFS2 filesystem, populated it with files, unmounted and ran fsck. These changes did not cause any problems. I will take some time on the weekend to torture test ZFS on a zvol and then see if it passes a scrub. I try to include some performance testing with the torture testing.

@ryao
Copy link
Contributor Author

ryao commented Jul 10, 2014

@kernelOfTruth If you are not doing anything that involves zvols, this will not have any effect. If you are using zvols, this might improve responsiveness. The current zvol code disables kernel preemption by doing processing in an interrupt context. Disabling kernel preemption is considered to be bad for interactive response, so this change should be a good thing with respect to that.

I expect the biggest change to occur with swap on zvols under memory pressure. Right now, it is possible for swap on zvols to deadlock and we did not have a clear picture why. It occurred to me when I was refactoring the code that the asynchronous nature of the current zvol code is a key difference between ZFSOnLinux and ZFS on Illumos in terms of how swap/page-out on zvols works. On Illumos, blocking operations are used on zvols while on Linux, we had asynchronous operations on zvols.

The consequence for page out is that the Linux kernel will race through its page out routines when swap is on a zvol by moving pages from userland into the zvol's request queue far faster than they can travel to disk. That has a feedback effect where the page out routines worsen pressure and then respond to the increase in pressure by trying even harder. That is contrary to how pageout is supposed to work.

An analogy would be to imagine a sinking ship where crews are trying to pump water off the ship. The way the current zvol code works is that only 1 team is pumping off the ship while all others are pumping into the cabin of that team. If the guys pumping water out of the ship suffocate because the others managed to fill the cabin faster than the guys water pumping out of the ship could handle, everyone drowns.

I believe that the asynchronous zvol operations have a similar effect on page out. This could explain how we manage to deadlock a system under prolonged pressure when swap is on zvols. In specific, the page out routines gobble up so much of the system's emergency memory that ZIO processing can no longer operate and things deadlock. In addition, the asynchronous zvol operations under memory pressure should schedule far more for page out than necessary, which would worsen lags caused by page faults.

I am already using this code on my production workstation with no ill effect. My feeling is that these changes will make swap on zvols stable enough for production deployment, but I have yet to test this theory because my workstation has 64GB of RAM. I suspect desktop workloads under constrained memory situations would be a good way of gauging the effect, but it is hard to do that quantitatively and my time is limited. If others' beat me to evaluating the effect of these patches on desktop workloads, I would be happy to hear their results.

@IronWolve
Copy link

I was just going to open a bug about zvol and spl_kmem_cache ran away. I created a zvol, partitioned it, then started to format it. During the format it gets a few inodes done and the cpu spikes across all cpu's. Centos 6.5 box, latest patches and zfs release. Was going to use zvol+fc because my nfs speeds are horrible.

15:23:41 root@uszfs004: ~ # top -b -n1 | head -14
top - 15:24:29 up 42 min, 2 users, load average: 68.05, 33.87, 19.08
Tasks: 758 total, 2 running, 754 sleeping, 1 stopped, 1 zombie
Cpu(s): 0.1%us, 5.0%sy, 0.0%ni, 91.1%id, 3.7%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132142108k total, 64291620k used, 67850488k free, 20946480k buffers
Swap: 4194296k total, 0k used, 4194296k free, 84260k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3012 root 39 19 0 0 0 R 100.0 0.0 12:26.49 spl_kmem_cache/

@ryao
Copy link
Contributor Author

ryao commented Jul 10, 2014

It is worth noting that there are opportunities for further improvements in this area. In specific, we could implement zero copy by implementing the file_operations and the inode_operations handlers. We would still need a copy when there is compression, but copying would be reduced to the absolute minimum necessary. I suspect that we would need to rewrite the underlying ZIO layer to use bvecs instead of slab-backed buffers in order to do that. Otherwise, we would still have double buffering on mmap(). I believe that certain aspects of Linux will still want to use the request handler, so we would need to take great care to ensure that there are no interfaces that would allow them to access it through the buffer cache to maintain cache coherence.That kind of improvement would be a variation on the sgbuf idea that I plan to revisit when I find time.

@ryao
Copy link
Contributor Author

ryao commented Jul 10, 2014

@IronWolve Were you using the code in this pull request? If not, then you should file a separate report. It would be helpful if you included a backtrace of the looping thread and instructions on how to reproduce the problem in your report.

@kernelOfTruth
Copy link
Contributor

@ryao if I understood http://zfsonlinux.org/example-zvol.html correctly, I'm indeed heavily using zvols (several "virtual block device[s]", some of them subfolders with large, some of them with several small files, on top of my mirrored /home pool - nested within each other)

probably will give this a try within the next few days

Thanks !

@kernelOfTruth
Copy link
Contributor

currently using this with 2129, 2291, 2408 and 2484

had to do some modifications to dmu.c and vdev_disk.c as the patch would fail - hope the changes are correct:

http://pastebin.com/yzW0LghN

the whole diff with the 2 modified files cleanly added:

http://pastebin.com/ShFW0wze

@ryao
Copy link
Contributor Author

ryao commented Jul 14, 2014

It turns out that this is incompatible with older kernels because a typedef was changed:

Here is the typedef before 3.2:

typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);

Here is the typedef in and after 3.2:

typedef void (make_request_fn) (struct request_queue *q, struct bio *bio);

I need to work out a sane way of handling this with autotools before it can be merged.

@ryao
Copy link
Contributor Author

ryao commented Jul 15, 2014

@kpande There is something strange happening in your build environment. This pull request removes rq_for_each_segment4(), but your build failure indicates that the code is calling it.

@ryao
Copy link
Contributor Author

ryao commented Jul 15, 2014

@kpande Pull from my git. This is in the zvol branch.

@FransUrbo
Copy link
Contributor

@ryao Have you noticed/heard of any possible problems with this?

I built latest master with this (and some other pull requests that I want/need):

After creating a new pool from scratch, I then did a send+receive from my dump pool on a FreeNAS9 to this new pool. It's been running for a couple of days (should have been finished this morning).
When I checked the machine today, I saw a hard lock, rebooted and have since tried to import+mount. unsuccessfully

With

options zfs zvol_inhibit_dev=1
options zfs zvol_threads=64
options zfs zfs_autoimport_disable=1

I can import the pool (using -N), but any use of zfs (such as something simple as zfs list gives that lock again. SysRQ doesn't work and nothing in the logs, and the screen isn't big enough to give the full trace... I've included a pic on what I could see.

The only merge other than ZoL/master (very latest) that could have anything to do with this is your pull request...

I'm suspecting this PR (unless it's something with master, but then others should have reported any problems ?) only because I've seen references to zvol_*() - when NOT setting zvol_inhibit_dev, theres a lot of zvol_create_minors() etc.

After a while, I figured I'd remove the this PR, but problems persists. Just for completeness, I've even tried a completely clean ZoL build. Didn't help - hard lock after a few seconds after running zfs list | less.

I'm not completely ruling out problem with the pool/data. It was sent from my (quite broken) pool to the FreeNAS9 (which took a couple of weeks because it was so broken - the FreeNAS9 doesn't seem to have any problems though, I can access everything there) and I'm now trying to send it back after recreating it. Unless you can come up with an (even vague would work :) idea, I'll have to recreate it again (clean build) and try again. This will take a few days though, so I'm hoping you have an idea or two :)

image

@ryao
Copy link
Contributor Author

ryao commented Aug 10, 2014

@FransUrbo That looks like a pre-existing problem in the send/recv code. I believe that #2338 might help, but whatever is happening is not fully understood right now. Are certain that this problem consistently occurs with these patched included and never without them?

@FransUrbo
Copy link
Contributor

@ryao Ok, thanx.

However:

Behavior of zfs send -R is not changed.

That's what I used (and I forgot to add -p when I dumped the pool to the FreeNAS9 machine). So what's my options now?

I see that #2338 (might be) missing part of the Illumos commit. Will that be a problem?

@FransUrbo
Copy link
Contributor

whatever is happening is not fully understood right now.

Would access to my machines help?
Are certain that this problem consistently occurs with these patched included and never without them?

What?

@ryao
Copy link
Contributor Author

ryao commented Feb 3, 2015

My curiosity got the better of me. I wrote a fix for the 2.6.32 issue and pushed it for buildbot testing. If all goes well, this should be safe to merge. I do not recommend trying this in production until the build bot results are back because it has not been subject to my own regression tests.

@ryao ryao force-pushed the zvol branch 3 times, most recently from 205e9a8 to 1b5ca81 Compare February 3, 2015 20:04
The main impetus for this patch is the Illumos taskq import. The ZoL
zvol code operated in an interrupt context to take advantage of the IO
elevator and the SPL taskq code was (ab)used to form the bottom half.
The SPL taskq code was designed to be callable in an interrupt context
while the Illumos taskq code was designed to expect the assistance of
the thread thread scheduler. Since Linux's thread scheduler does not
provide that assistance, we must trampoline through the system workqueue
to continue providing the bottom half when using the Illumos taskq code.
This seems unnecessarily complex when we consider that Illumos itself
does not perform this emulation and the recent multiqueue work in Linux
has identified interrupt processing as a major bottleneck on Linux:

http://kernel.dk/systor13-final18.pdf

The entry point for zvol operations is zfs_request(). That is called in
an interrupt context by blk_queue_bio(), which is the kernel's generic
->make_request() handler. Registering our own handler allows us to avoid
this. Comments in the code suggest that this was done to gain the
benefit of having Linux's IO elevator do merging. However, the DMU
already does write merging while arc_read() should implicitly merge read
IOs because only 1 thread is permitted to fetch the buffer. In addition,
the main consumers of zvols are VMs and iSCSI shares, which have their
own elevators to merge IOs. In addition, commercial ZFSOnLinux
distributions report that regular files are more performant than zvols.
That stronlgy suggests that the original rationale for using Linux'
generic ->make_request() handler was invalid.

Some minor refactoring allows us to register zfs_request() as our
->make_request() handler. This eliminates an obstacle to importing the
Illumos taskq code while giving us several additional benefits:

1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs under heavy load.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.

It is worth noting that the previous zvol code would redispatch IOs when
all zio taskq threads were in use. The consequence being that IOs could
be redispatched and potentially, reordered across barriers. This
explains an obscure IRC report about corruption on zvol-backed iSCSI
shares triggered by a kernel panic during heavy load. It was thought
that the initiator did something wrong, but now it is clear that the
zvol code itself could reorder, which explains the incident.

Signed-off-by: Richard Yao <ryao@gentoo.org>
ryao added 10 commits February 3, 2015 15:08
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
This autotools check was never needed because we can check for the
existence of QUEUE_FLAG_DISCARD in the kernel headers.

Signed-off-by: Richard Yao <ryao@gentoo.org>
This autotools check was never needed because we can check for the
existence of QUEUE_FLAG_NONROT in the kernel headers.

Also, the comment in config/kernel-blk-queue-nonrot.m4 is incorrect.
This was a Linux 2.6.28 API change, not a Linux 2.6.27 API change.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Linux 2.6.36 introduced REQ_SECURE to indicate when discards *must* be
processed, such that we cannot do optimizations like block alignment.
Consequently, the discard semantics prior to 2.6.36 require us to always
process unaligned discards. Previously, we would do this optimization
regardless. This patch changes things to correctly restrict this
optimization to situations where REQ_SECURE exists, but is not included
in the flags.

Signed-off-by: Richard Yao <ryao@gentoo.org>
}
#ifdef
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is supposed to be #endif for REQ_SECURE, and GCC seems to agree that this is a problem. Strange thing is, the compile-time error only popped up during the DKMS build on one host (several others built, which is now slightly concerning). Everything runs the same kernel, not sure why this one tripped it.

@behlendorf behlendorf modified the milestones: 0.6.5, 0.6.4 Feb 5, 2015
@behlendorf
Copy link
Contributor

Bumping to 0.6.5, if we can get it done for 0.6.4 that would be great but it shouldn't hold up the release.

@sempervictus
Copy link
Contributor

initial testing is quite encouraging, though i'm still baffled by what is now a consistent build failure on our patch stack - seems this doesnt play well with the large block and ZIO optimization PRs ( #2865 and #3063 ), which really means it should never have built in the first place on the systems where it does work.

iSCSI and local ZVOL backed random IO load from multiple VMs (including running on-host for the local tests with all the contention involved) shows ~15% overall drop in IOWait from the same workload. When running random greps through VM FS' searching for strings which will never match, the reduction in wait is closer to 30%, which seems a bit odd, but i'm not complaining.

Thank you @ryao, i'll get this running in a few test hosts before i try to back any of our production SCST hosts with this, but i highly encourage anyone with resources to put this through its paces. Last time we ran this patch stack we ended up seeing issues pretty late in the game - like on resilvering and other unusual operations, so thorough testing is highly recommended.

@devonbessemer
Copy link

I don't have much experience with git. Can someone guide me on how I can test the latest commit (2.6.32 fix)? I have a few CentOS 6.6 dev boxes.

@FransUrbo
Copy link
Contributor

@ryao fix in commit f113911 - Support secure discard on zvols):

diff --git a/module/zfs/zvol.c b/module/zfs/zvol.c
index e182550..1e8ba38 100644
--- a/module/zfs/zvol.c
+++ b/module/zfs/zvol.c
@@ -641,11 +641,11 @@ zvol_discard(struct bio *bio)
         * 2.6.35) will not receive this optimization.
         */
 #ifdef REQ_SECURE
-       if (!(bio->bi_rw & REQ_SECURE))
+       if (!(bio->bi_rw & REQ_SECURE)) {
                start = P2ROUNDUP(start, zv->zv_volblocksize);
                end = P2ALIGN(end, zv->zv_volblocksize);
        }
-#ifdef
+#endif

        if (start >= end)
                return (0);

@kernelOfTruth
Copy link
Contributor

@ryao any news or changes on this ?

@behlendorf behlendorf removed this from the 0.6.5 milestone Apr 24, 2015
@behlendorf
Copy link
Contributor

Yes, this change is definitely going to need to be updated and have some additional review and discussion. @ryao I'm going to close this out for now, please open a new pull request when you have some time to revisit this.

@behlendorf behlendorf closed this Apr 24, 2015
@ryao
Copy link
Contributor Author

ryao commented May 21, 2015

@kpande Are these issues on recent kernels or just 2.6.32?

@FransUrbo I just saw your comment now. Thanks for catching that. It was a last ditch attempt to improve this before I put it on the back burner. I just rebased this on head and pushed it to my branch. I haven't had a chance to do testing though.

@predragmandic
Copy link

@ryao in your patch at this part in zvol.c:

    +#ifdef REQ_SECURE
    +   if (!(bio->bi_rw & REQ_SECURE))
    +       start = P2ROUNDUP(start, zv->zv_volblocksize);
    +       end = P2ALIGN(end, zv->zv_volblocksize);
    +   }
    +#ifdef

think you are missing '{' at start of the block and '#endif' instead of '#ifdef' at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.