-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zvol rework #2484
zvol rework #2484
Conversation
So in essence, we're going to get [some,much,huge?] improvements on ZVOLs? |
@FransUrbo I expect zvol performance to improve, but I have not done performance tests to confirm/quantify it. The purpose of these changes was to refactor zvols into something that is compatible with the Illumos taskq import that I am developing. The potential for improved performance is a bonus. Also, I just pushed a small revision to fix a typo in a commit message. The code in the pull request has not changed. |
this sounds like it also might improve & cut down latency for desktop systems (e.g. during backup jobs which unavoidably need to be done while working with the system - currently ZFSonLinux already feels snappier than Btrfs which even with latest code still causes some huge latency spikes & lags) nice ! |
@ryao this looks promising. Let's get some random and sequential performance tests run for 4k/1m block sizes with the old and new code. Then we'll be able to say something about performance. We'll also want to run a stress test against the block device to check correctness. Using the zvol to back another pool or a btrfs filesystem would be a good sanity check. |
@behlendorf I used it to back a UFS2 filesystem, populated it with files, unmounted and ran fsck. These changes did not cause any problems. I will take some time on the weekend to torture test ZFS on a zvol and then see if it passes a scrub. I try to include some performance testing with the torture testing. |
@kernelOfTruth If you are not doing anything that involves zvols, this will not have any effect. If you are using zvols, this might improve responsiveness. The current zvol code disables kernel preemption by doing processing in an interrupt context. Disabling kernel preemption is considered to be bad for interactive response, so this change should be a good thing with respect to that. I expect the biggest change to occur with swap on zvols under memory pressure. Right now, it is possible for swap on zvols to deadlock and we did not have a clear picture why. It occurred to me when I was refactoring the code that the asynchronous nature of the current zvol code is a key difference between ZFSOnLinux and ZFS on Illumos in terms of how swap/page-out on zvols works. On Illumos, blocking operations are used on zvols while on Linux, we had asynchronous operations on zvols. The consequence for page out is that the Linux kernel will race through its page out routines when swap is on a zvol by moving pages from userland into the zvol's request queue far faster than they can travel to disk. That has a feedback effect where the page out routines worsen pressure and then respond to the increase in pressure by trying even harder. That is contrary to how pageout is supposed to work. An analogy would be to imagine a sinking ship where crews are trying to pump water off the ship. The way the current zvol code works is that only 1 team is pumping off the ship while all others are pumping into the cabin of that team. If the guys pumping water out of the ship suffocate because the others managed to fill the cabin faster than the guys water pumping out of the ship could handle, everyone drowns. I believe that the asynchronous zvol operations have a similar effect on page out. This could explain how we manage to deadlock a system under prolonged pressure when swap is on zvols. In specific, the page out routines gobble up so much of the system's emergency memory that ZIO processing can no longer operate and things deadlock. In addition, the asynchronous zvol operations under memory pressure should schedule far more for page out than necessary, which would worsen lags caused by page faults. I am already using this code on my production workstation with no ill effect. My feeling is that these changes will make swap on zvols stable enough for production deployment, but I have yet to test this theory because my workstation has 64GB of RAM. I suspect desktop workloads under constrained memory situations would be a good way of gauging the effect, but it is hard to do that quantitatively and my time is limited. If others' beat me to evaluating the effect of these patches on desktop workloads, I would be happy to hear their results. |
I was just going to open a bug about zvol and spl_kmem_cache ran away. I created a zvol, partitioned it, then started to format it. During the format it gets a few inodes done and the cpu spikes across all cpu's. Centos 6.5 box, latest patches and zfs release. Was going to use zvol+fc because my nfs speeds are horrible. 15:23:41 root@uszfs004: ~ # top -b -n1 | head -14 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
It is worth noting that there are opportunities for further improvements in this area. In specific, we could implement zero copy by implementing the file_operations and the inode_operations handlers. We would still need a copy when there is compression, but copying would be reduced to the absolute minimum necessary. I suspect that we would need to rewrite the underlying ZIO layer to use bvecs instead of slab-backed buffers in order to do that. Otherwise, we would still have double buffering on mmap(). I believe that certain aspects of Linux will still want to use the request handler, so we would need to take great care to ensure that there are no interfaces that would allow them to access it through the buffer cache to maintain cache coherence.That kind of improvement would be a variation on the sgbuf idea that I plan to revisit when I find time. |
@IronWolve Were you using the code in this pull request? If not, then you should file a separate report. It would be helpful if you included a backtrace of the looping thread and instructions on how to reproduce the problem in your report. |
@ryao if I understood http://zfsonlinux.org/example-zvol.html correctly, I'm indeed heavily using zvols (several "virtual block device[s]", some of them subfolders with large, some of them with several small files, on top of my mirrored /home pool - nested within each other) probably will give this a try within the next few days Thanks ! |
currently using this with 2129, 2291, 2408 and 2484 had to do some modifications to dmu.c and vdev_disk.c as the patch would fail - hope the changes are correct: the whole diff with the 2 modified files cleanly added: |
It turns out that this is incompatible with older kernels because a typedef was changed: Here is the typedef before 3.2:
Here is the typedef in and after 3.2:
I need to work out a sane way of handling this with autotools before it can be merged. |
@kpande There is something strange happening in your build environment. This pull request removes rq_for_each_segment4(), but your build failure indicates that the code is calling it. |
@kpande Pull from my git. This is in the zvol branch. |
@ryao Have you noticed/heard of any possible problems with this? I built latest master with this (and some other pull requests that I want/need):
After creating a new pool from scratch, I then did a send+receive from my dump pool on a FreeNAS9 to this new pool. It's been running for a couple of days (should have been finished this morning). With
I can import the pool (using The only merge other than ZoL/master (very latest) that could have anything to do with this is your pull request... I'm suspecting this PR (unless it's something with master, but then others should have reported any problems ?) only because I've seen references to zvol_*() - when NOT setting After a while, I figured I'd remove the this PR, but problems persists. Just for completeness, I've even tried a completely clean ZoL build. Didn't help - hard lock after a few seconds after running I'm not completely ruling out problem with the pool/data. It was sent from my (quite broken) pool to the FreeNAS9 (which took a couple of weeks because it was so broken - the FreeNAS9 doesn't seem to have any problems though, I can access everything there) and I'm now trying to send it back after recreating it. Unless you can come up with an (even vague would work :) idea, I'll have to recreate it again (clean build) and try again. This will take a few days though, so I'm hoping you have an idea or two :) |
@FransUrbo That looks like a pre-existing problem in the send/recv code. I believe that #2338 might help, but whatever is happening is not fully understood right now. Are certain that this problem consistently occurs with these patched included and never without them? |
|
My curiosity got the better of me. I wrote a fix for the 2.6.32 issue and pushed it for buildbot testing. If all goes well, this should be safe to merge. I do not recommend trying this in production until the build bot results are back because it has not been subject to my own regression tests. |
205e9a8
to
1b5ca81
Compare
The main impetus for this patch is the Illumos taskq import. The ZoL zvol code operated in an interrupt context to take advantage of the IO elevator and the SPL taskq code was (ab)used to form the bottom half. The SPL taskq code was designed to be callable in an interrupt context while the Illumos taskq code was designed to expect the assistance of the thread thread scheduler. Since Linux's thread scheduler does not provide that assistance, we must trampoline through the system workqueue to continue providing the bottom half when using the Illumos taskq code. This seems unnecessarily complex when we consider that Illumos itself does not perform this emulation and the recent multiqueue work in Linux has identified interrupt processing as a major bottleneck on Linux: http://kernel.dk/systor13-final18.pdf The entry point for zvol operations is zfs_request(). That is called in an interrupt context by blk_queue_bio(), which is the kernel's generic ->make_request() handler. Registering our own handler allows us to avoid this. Comments in the code suggest that this was done to gain the benefit of having Linux's IO elevator do merging. However, the DMU already does write merging while arc_read() should implicitly merge read IOs because only 1 thread is permitted to fetch the buffer. In addition, the main consumers of zvols are VMs and iSCSI shares, which have their own elevators to merge IOs. In addition, commercial ZFSOnLinux distributions report that regular files are more performant than zvols. That stronlgy suggests that the original rationale for using Linux' generic ->make_request() handler was invalid. Some minor refactoring allows us to register zfs_request() as our ->make_request() handler. This eliminates an obstacle to importing the Illumos taskq code while giving us several additional benefits: 1. No per zvol spinlocks. 2. No redundant IO elevator processing. 3. Interrupts are disabled only when actually necessary. 4. No redispatching of IOs under heavy load. 5. Linux's page out routines will properly block. 6. Many autotools checks become obsolete. It is worth noting that the previous zvol code would redispatch IOs when all zio taskq threads were in use. The consequence being that IOs could be redispatched and potentially, reordered across barriers. This explains an obscure IRC report about corruption on zvol-backed iSCSI shares triggered by a kernel panic during heavy load. It was thought that the initiator did something wrong, but now it is clear that the zvol code itself could reorder, which explains the incident. Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
This autotools check was never needed because we can check for the existence of QUEUE_FLAG_DISCARD in the kernel headers. Signed-off-by: Richard Yao <ryao@gentoo.org>
This autotools check was never needed because we can check for the existence of QUEUE_FLAG_NONROT in the kernel headers. Also, the comment in config/kernel-blk-queue-nonrot.m4 is incorrect. This was a Linux 2.6.28 API change, not a Linux 2.6.27 API change. Signed-off-by: Richard Yao <ryao@gentoo.org>
Linux 2.6.36 introduced REQ_SECURE to indicate when discards *must* be processed, such that we cannot do optimizations like block alignment. Consequently, the discard semantics prior to 2.6.36 require us to always process unaligned discards. Previously, we would do this optimization regardless. This patch changes things to correctly restrict this optimization to situations where REQ_SECURE exists, but is not included in the flags. Signed-off-by: Richard Yao <ryao@gentoo.org>
} | ||
#ifdef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is supposed to be #endif for REQ_SECURE, and GCC seems to agree that this is a problem. Strange thing is, the compile-time error only popped up during the DKMS build on one host (several others built, which is now slightly concerning). Everything runs the same kernel, not sure why this one tripped it.
Bumping to 0.6.5, if we can get it done for 0.6.4 that would be great but it shouldn't hold up the release. |
initial testing is quite encouraging, though i'm still baffled by what is now a consistent build failure on our patch stack - seems this doesnt play well with the large block and ZIO optimization PRs ( #2865 and #3063 ), which really means it should never have built in the first place on the systems where it does work. iSCSI and local ZVOL backed random IO load from multiple VMs (including running on-host for the local tests with all the contention involved) shows ~15% overall drop in IOWait from the same workload. When running random greps through VM FS' searching for strings which will never match, the reduction in wait is closer to 30%, which seems a bit odd, but i'm not complaining. Thank you @ryao, i'll get this running in a few test hosts before i try to back any of our production SCST hosts with this, but i highly encourage anyone with resources to put this through its paces. Last time we ran this patch stack we ended up seeing issues pretty late in the game - like on resilvering and other unusual operations, so thorough testing is highly recommended. |
I don't have much experience with git. Can someone guide me on how I can test the latest commit (2.6.32 fix)? I have a few CentOS 6.6 dev boxes. |
@ryao fix in commit f113911 - Support secure discard on zvols):
|
@ryao any news or changes on this ? |
Yes, this change is definitely going to need to be updated and have some additional review and discussion. @ryao I'm going to close this out for now, please open a new pull request when you have some time to revisit this. |
@kpande Are these issues on recent kernels or just 2.6.32? @FransUrbo I just saw your comment now. Thanks for catching that. It was a last ditch attempt to improve this before I put it on the back burner. I just rebased this on head and pushed it to my branch. I haven't had a chance to do testing though. |
@ryao in your patch at this part in zvol.c:
think you are missing '{' at start of the block and '#endif' instead of '#ifdef' at the end. |
The Illumos taskq import that I have been developing is incompatible with the current zvol code due to how Linux handles interrupts. This refactors the zvol code to stop processing IOs in interrupts. See the commit message of ryao/zfs@1d22dbd for a full description of the benefits. To summarize, we see the following benefits from this change:
In addition, this pull request includes ryao/zfs@4ed7d71, which is a fix for a potentially rare deadlock in ZIO processing that became readily reproducible in the presence of this change.