-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock contention on arcs mtx #3115
Conversation
I'll throw this on a testbed not running the updated ABD stack (that PR may limit the scope of testing here since people who run strange things in the night tend to use #2129). Thank you very much for getting this ticking so quickly |
@dweeezil FYI, I just opened a review to fix arcstat in the upstream illumos code: https://reviews.csiden.org/r/164/ |
cce835a
to
175f413
Compare
Quick update: My initial benchmarking with fio (using the same test as @prakashsurya did under illumos) has not shown the expected performance increase as the number cores increase. I was concerned I was running into memory bandwidth issues so I did a round of pmbw tests to get a handle on the best-case random-read memory bandwidth. It looks like I'm likely not running into the absolute memory bandwidth limit yet, but it's difficult to tell because the memory bandwidth drops so dramatically once the various caches' sizes are exceeded (I'm looking at the pmbw graphs). While trying to get this running on a lockstat-enabled kernel (to see where there might still be a bottleneck), I realized that all the tracepoint stuff for the ARC was broken by the illumos 5408 patch which means I've got to go back to that patch and overhaul the ARC tracepointing. I'll post more information as it becomes available. |
Ugh, that's going to be a constant headache moving forward. I think we'll need to make a regression test to make sure the tracepoints remain intact as the code continues to evolve, but that's work for another issue. |
In some sense we already have one. The Centos 7 debug builder enables tracepoints so it will fail if the tracepoint code isn't right. What we really need to to eradicate the remaining known test failures so when a builder fails peoples immediately look in to why. http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-debug-builder/builds/802 |
@behlendorf What are your thoughts on handling tracepoints for the newly-structured |
@dweeezil I think we should keep it simple and fast and assign the useful bits of both |
I agree with @behlendorf, |
I don't really have a preference, so I'm cool with either way; "keep it simple" is usually my guidance unless there's a compelling reason not to. And glad to hear we at least have something to try and catch issues introduced with the tracepoints. 👍 |
I'm posting this as a pull request now in order that it get some wider review and also that the buildbots get a chance to chew on it. I've given it some pretty intense testing, mainly with fio over the past few days and there don't seem to be any obvious regressions. I still consider this a work-in-progress. The dbufs kstat still needs to be fixed but I don't expect that to be terribly difficult.
referencing/posting Illumos #5497 here for easier discovery (e.g. via Google): |
I installed this on my system running Fedora 21 with kernel 3.17.8-200.fc20.x86_64 after trying the patch in 3181 to resolve an issue with the system becoming unresponsive or crashing after rsync'ing mail directories with lots of small files. I haven't frozen up like 3181, but as soon as my ARC got full ZFS operations slowed to a crawl. It takes several seconds to list each file in a directory. There's almost no CPU in use now. |
I've still not gotten a chance to give this a good workout and to test whether it's actually operating as intended. Unfortunately while I was working on setting up a test environment to evaluate this patch, I came across the issue I just posted in #3183 and that has been keeping me distracted for the past week or so. During my initial testing of the loads for which this patch was designed to help, I did not see any better scalability in read bandwidth versus a stock system. While trying to arrange a test environment in which I could get better measurements, I came across the issue mentioned above. And to be clear, this patch is not something that was going to help metadata-intensive loads much. The benchmark used by its author and by myself involves highly parallel 100% ARC-cached reads. To be clear, however, when I say "the patch" above, I'm referring only to 175f413 which is the last patch in the stack (and the subject of this pull request). The other prerequisite patches could certainly help other issues. |
FYI:
kernelOfTruth@c76ca73 & kernelOfTruth@129c618
not sure how to handle that one |
If I had known that I would succeed, I had done it on a clean base (D'oh ! sorry about that) anyway, it's a waste of work to throw it away and start anew (perhaps I'll do so - but for now the buildbots can chew on it) - here's a branch / tree with #2129 and this patchset #3115 on top https://github.com/kernelOfTruth/zfs/commits/zfs_kOT_15.03.2015 Midway through I realized that I had to revert #3181 otherwise it would have been even more complicated it compiled cleanly - and besides the above mentioned message
there were no error messages Hope it helps |
When I compiled it I got the same message, and then changed buf->b_state to hdr->b_state since it was the only thing in scope that seemed correct. |
@angstymeat it's actually
if you also use "5408 managing ZFS cache devices requires lots of RAM" there the definition or name changed once again 😉 (provided of course I understood it correctly, but otherwise it wouldn't compile for me) |
As an update, my currently-interrupted work on this stack requires going back to the "5408 managing ZFS cache devices requires lots of RAM" patch and updating our tracing infrastructure to properly deal with the newly-nested ARC structures. Once that update is complete, the rest of the patch stack will be rebased atop that work. |
I've been able to get my test system into the following unfortunate state (output courtesy of a hacked arcstat.py):
The problem is that the large block support bumped the maximum block size to 16MiB and the logic in the adapt thread does this:
Oops, if This may very well be unrelated to the issue at hand, but I do seem to be able to stress the system rather easily and cause a bad collapse of |
I'm testing now with dweeezil/zfs@4288ef8 to fix the non-expanding arc_c problem and will likely add that patch to the stack. |
@behlendorf Back to the use of the system taskq again. One of the arcstats I added while testing is this:
and after causing a bunch of pruning, I see:
and:
so it looks like there are 3 perpetual waiters. Might there not be other things on the system taskq such as the deadman, prefetcher thread, etc.? |
@prakashsurya it can be for a meta-data heavy workload. I think the right way to think about this is that the Linux prune functionality is directly analogous to the I think one reasonable way to do this portably would be just to agree on a common zfs specific interface. That could be called
As for Lustre I think that's a slightly different kettle of fish. Your right they probably shouldn't be using this interface any more. But having something like this available is generally useful if you're considering building something which isn't a filesystem on top of the DMU. And these days there's increasing interest in doing just that. @dweeezil I've refreshed my lock-contention-on-arcs_mtx branch which is based on your lock-contention-on-arcs_mtx branch. It depends on the changes in behlendorf/spl#453 and includes all the changes we discussed. In the end I just caved and went down the rabit hole of implementing the TASKQ_DYNAMIC flag from illumos. This way we can make all the multi-thread taskq's dynamic, restore the illumos default values, and not has a crazy number of threads on an idle system. openzfs/spl#455 I also added a patch to my stack which addressing the arc_prune() issues I saw for older kernels without per-filesystem shrinkers. With these changes in place the only issues I'm seeing are a cleanup time where it looks like an As for your 3 perpetual waiters I'm not sure what's going on there. But in my updated patches I did create a dedicated arc_prune() taskq which make shed some light on things. |
@behlendorf I'll be re-testing today with your version of the stack (https://github.com/behlendorf/zfs/tree/lock-contention-on-arcs_mtx) and modified SPL as well as an updated version of my |
@dweeezil I've refreshed my spl and zfs branches with the following small changes. openzfs/spl#455
It appears the umount issues I referenced above isn't a new issue introduced by these changes. I was able to reproduce it on the master branch with a bit of work. So I don't consider that a blocker although it needs to be explained. My intention is to run these patch stacks over the weekend and see how they hold up. So far things are looking promising. |
@behlendorf I'm using your behlendorf/spl@a0f5e41 and behlendorf/zfs@af184e8. If I run my
Access to the ZFS filesystem is, likewise, blocked as you might imagine. I'm going to reboot and see what's at |
The
Unfortunately,
Deadlock. |
Testing my theory, the following gross hack (and a similar one in the SPL shrinker) does fix the deadlock with
EDIT: Never mind, the deadlock just takes longer to appear. EDIT[2]: Because it's a different deadlock, involving |
I'm going to try locking down kswapd as well (at least there's a proper API for detecting it). Clearly dynamic taskqs add a layer of memory allocations which wouldn't have happened in ZoL in the past. |
@behlendorf I'm going to continue testing with |
…to be written to l2arc device If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit e14bb3258d05c1b1077e2db7cf77088924e56919 Also, consistently use asize / a_sz for the allocated size, psize / p_sz for the physical size. Where the latter accounts for possible size reduction because of compression, whereas the former accounts for possible size expansion because of alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. modified to account for the changes of openzfs#2129 (ABD) , openzfs#3115 (Illumos - 5408 managing ZFS cache devices requires lots of RAM) and Illumos - 5701 zpool list reports incorrect "alloc" value for cache devices
With |
@dweeezil my testing survived all weekend as well. I was running high thread versions of xdd, dbench, and iozone concurrently in a loop for ~2 days. No stack traces or lockups. Not too bad! Interestingly I didn't observe the issue TASKQ_DYNAMIC lockups you hit. This was almost certainly because I wasn't pushing as hard on the meta limit. But this is clearly an issue introduced by the patch which needs to be addressed. The creation of dynamic threads was moved from |
@behlendorf My 40-process 512GiB instance of dbench just passed 6 hours and ran fine the whole time (as expected). I also had As to the dynamic taskqs, I do like their effect with lots of cores, especially on pools with a large number of top-level vdevs. I made a pool with 40 top-level vdevs once and with the non-dynamic scheme, wound up with over 1600 (!) metaslab_group_taskq threads. Low-memory systems, however, can be trivially deadlocked if dynamic taskqs are enabled. I'm thinking the dynamic taskq feature ought to be its own patch in its own pull request anyway. |
@dweeezil I think we're in good shape as well. I've opened #3481 with a finalized version of this patch stack with the following modifications.
I'll queue this patch stack up for another evening of testing. If everything looks good and you don't have any objections I'd like to merge these changes tomorrow. Please review it one final time. |
@dweeezil after addressing a minor build issue in the proposed patch stack they held up well to my overnight stress testing. At this point I think they're ready to be merged, so let me know when your happy and I'll merge them. |
@behlendorf I'll get the stack (your -final branch) reviewed and give it a bit more testing today. |
@behlendorf So far, it's looking good to me. I'll continue running tests throughout the rest of day, but at this point, I'd say it's ready to merge. More than anything, I'd like to move on to investigate some of the other scalability bottlenecks I've discovered while working on this. |
@dweeezil sounds good. I'll leave my tests running overnight as well and if everything still looks good in the morning I'll get this merged so we can pursue the other improvements. |
This is still looking good after overnight testing with fio and dbench. I ran some final numbers with @prakashsurya's original test. For reference, it's fio with the following configuration:
on a filesystem with With current master code, there are 873419 contentions on Unfortunately, as mentioned in previous issue comments, the overall performance for 8K block reads does not improve at all with this patch. There are clearly other ZoL-specific issues at play here. With this patch stack, the top source of lock contention becomes the In summary, I think this patch stack is ready to go as-is. |
That sounds VERY interesting guess I'll give this a spin against vanilla (if it's not merged before I have to opportunity to update my zfs drivers), @dweeezil maximal latency during "high" i/o (one rsync backup job to btrfs volume) measured with latencytop was around 3000 ms (3 seconds !) Using a different cpu scheduler (BFS instead of CFS) also helps but so far there was some noticable stalls, latency spikes, etc. when using ZFS - so looking forward to the merge Thanks |
@dweeezil my testing held up as expected too. Since these changes are ready to go I've merged them to master. That's a nice big step forwards. Now to address some of the lingering performance issues you uncovered. Thanks for all your hard work on this! 06358ea Merge branch 'lock-contention-on-arcs_mtx-final' |
Woo! 👍 Nice work!! It's too bad this doesn't help the bottom line performance numbers, but at least it moves the contention out of the ARC; which was the point. Reducing the average latency from milliseconds to 10s of microseconds is good to hear! :) EDIT: Oh, and let's not forget, this will also effectively give people that use an L2ARC device more RAM, since |
@prakashsurya speaking us further slimming down the ARC header memory usage it would get great to get your feedback on #1971. It pulls the |
I'm posting this as a pull request now in order that it get some wider review and also that the buildbots get a chance to chew on it.
I've given it some pretty intense testing, mainly with
fio
over the past few days and there don't seem to be any obvious regressions.I still consider this a work-in-progress. The
dbufs
kstat still needs to be fixed but I don't expect that to be terribly difficult.