Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock contention on arcs mtx #3115

Closed
wants to merge 7 commits into from

Conversation

dweeezil
Copy link
Contributor

I'm posting this as a pull request now in order that it get some wider review and also that the buildbots get a chance to chew on it.

I've given it some pretty intense testing, mainly with fio over the past few days and there don't seem to be any obvious regressions.

I still consider this a work-in-progress. The dbufs kstat still needs to be fixed but I don't expect that to be terribly difficult.

@sempervictus
Copy link
Contributor

I'll throw this on a testbed not running the updated ABD stack (that PR may limit the scope of testing here since people who run strange things in the night tend to use #2129).

Thank you very much for getting this ticking so quickly

@prakashsurya
Copy link
Member

@dweeezil FYI, I just opened a review to fix arcstat in the upstream illumos code: https://reviews.csiden.org/r/164/

@dweeezil dweeezil force-pushed the lock-contention-on-arcs_mtx branch from cce835a to 175f413 Compare February 22, 2015 14:28
@dweeezil
Copy link
Contributor Author

dweeezil commented Mar 2, 2015

Quick update: My initial benchmarking with fio (using the same test as @prakashsurya did under illumos) has not shown the expected performance increase as the number cores increase.

I was concerned I was running into memory bandwidth issues so I did a round of pmbw tests to get a handle on the best-case random-read memory bandwidth. It looks like I'm likely not running into the absolute memory bandwidth limit yet, but it's difficult to tell because the memory bandwidth drops so dramatically once the various caches' sizes are exceeded (I'm looking at the pmbw graphs).

While trying to get this running on a lockstat-enabled kernel (to see where there might still be a bottleneck), I realized that all the tracepoint stuff for the ARC was broken by the illumos 5408 patch which means I've got to go back to that patch and overhaul the ARC tracepointing. I'll post more information as it becomes available.

@prakashsurya
Copy link
Member

While trying to get this running on a lockstat-enabled kernel (to see where there might still be a bottleneck), I realized that all the tracepoint stuff for the ARC was broken by the illumos 5408 patch which means I've got to go back to that patch and overhaul the ARC tracepointing.

Ugh, that's going to be a constant headache moving forward. I think we'll need to make a regression test to make sure the tracepoints remain intact as the code continues to evolve, but that's work for another issue.

@behlendorf
Copy link
Contributor

Ugh, that's going to be a constant headache moving forward. I think we'll need to make a regression test to make sure the tracepoints remain intact as the code continues to evolve

In some sense we already have one. The Centos 7 debug builder enables tracepoints so it will fail if the tracepoint code isn't right. What we really need to to eradicate the remaining known test failures so when a builder fails peoples immediately look in to why.

http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-debug-builder/builds/802

@dweeezil
Copy link
Contributor Author

dweeezil commented Mar 2, 2015

@behlendorf What are your thoughts on handling tracepoints for the newly-structured arc_buf_hdr_t insofar as the b_l2hdr and b_l1hdr members are concerned? In particular, do you think it's OK to always copy both of them in TP_fast_assign regardless of whether they're valid? Regarding the TP_printk, I was thinking of trying to add some conditional logic to only print the contents of one or the other depending on which one is valid.

@behlendorf
Copy link
Contributor

@dweeezil I think we should keep it simple and fast and assign the useful bits of both b_l2hdr and b_l1hdr in TP_fast_assign. A developer looking at the tracepoint log should be able to interpret which parts of that structure and meaningful. That said, I'm sure @nedbass and @prakashsurya have some thoughts about this as well.

@nedbass
Copy link
Contributor

nedbass commented Mar 2, 2015

I agree with @behlendorf, TP_* should be kept dumb and simple. In fact you might want to see what's in the supposedly invalid parts of a structure in some cases.

@prakashsurya
Copy link
Member

I don't really have a preference, so I'm cool with either way; "keep it simple" is usually my guidance unless there's a compelling reason not to.

And glad to hear we at least have something to try and catch issues introduced with the tracepoints. 👍

kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Mar 13, 2015
I'm posting this as a pull request now in order that it get some wider review and also that the buildbots get a chance to chew on it.

I've given it some pretty intense testing, mainly with fio over the past few days and there don't seem to be any obvious regressions.

I still consider this a work-in-progress. The dbufs kstat still needs to be fixed but I don't expect that to be terribly difficult.
@kernelOfTruth
Copy link
Contributor

referencing/posting Illumos #5497 here for easier discovery (e.g. via Google):
https://www.illumos.org/issues/5497

@angstymeat
Copy link

I installed this on my system running Fedora 21 with kernel 3.17.8-200.fc20.x86_64 after trying the patch in 3181 to resolve an issue with the system becoming unresponsive or crashing after rsync'ing mail directories with lots of small files.

I haven't frozen up like 3181, but as soon as my ARC got full ZFS operations slowed to a crawl. It takes several seconds to list each file in a directory. There's almost no CPU in use now.

@dweeezil
Copy link
Contributor Author

I've still not gotten a chance to give this a good workout and to test whether it's actually operating as intended. Unfortunately while I was working on setting up a test environment to evaluate this patch, I came across the issue I just posted in #3183 and that has been keeping me distracted for the past week or so.

During my initial testing of the loads for which this patch was designed to help, I did not see any better scalability in read bandwidth versus a stock system. While trying to arrange a test environment in which I could get better measurements, I came across the issue mentioned above.

And to be clear, this patch is not something that was going to help metadata-intensive loads much. The benchmark used by its author and by myself involves highly parallel 100% ARC-cached reads.

To be clear, however, when I say "the patch" above, I'm referring only to 175f413 which is the last patch in the stack (and the subject of this pull request). The other prerequisite patches could certainly help other issues.

@kernelOfTruth
Copy link
Contributor

FYI:

module/zfs/arc.c:3458:47: error: ‘buf’ undeclared (first use in this function)
   cmn_err(CE_PANIC, "invalid arc state 0x%p", buf->b_state);

kernelOfTruth@c76ca73 & kernelOfTruth@129c618

module/zfs/multilist.c:30:1: warning: ‘multilist_d2l’ defined but not used [-Wunused-function]
 multilist_d2l(multilist_t *ml, void *obj)

not sure how to handle that one

@kernelOfTruth
Copy link
Contributor

If I had known that I would succeed, I had done it on a clean base (D'oh ! sorry about that)

anyway, it's a waste of work to throw it away and start anew (perhaps I'll do so - but for now the buildbots can chew on it) - here's a branch / tree with #2129 and this patchset #3115 on top

https://github.com/kernelOfTruth/zfs/commits/zfs_kOT_15.03.2015

Midway through I realized that I had to revert #3181 otherwise it would have been even more complicated

it compiled cleanly - and besides the above mentioned message

module/zfs/multilist.c:30:1: warning: ‘multilist_d2l’ defined but not used [-Wunused-function]
 multilist_d2l(multilist_t *ml, void *obj)

there were no error messages

Hope it helps

@angstymeat
Copy link

When I compiled it I got the same message, and then changed buf->b_state to hdr->b_state since it was the only thing in scope that seemed correct.

@kernelOfTruth
Copy link
Contributor

@angstymeat it's actually

hdr->b_l1hdr.b_state

if you also use "5408 managing ZFS cache devices requires lots of RAM"

there the definition or name changed once again 😉

(provided of course I understood it correctly, but otherwise it wouldn't compile for me)

@dweeezil
Copy link
Contributor Author

As an update, my currently-interrupted work on this stack requires going back to the "5408 managing ZFS cache devices requires lots of RAM" patch and updating our tracing infrastructure to properly deal with the newly-nested ARC structures. Once that update is complete, the rest of the patch stack will be rebased atop that work.

@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 3, 2015

I've been able to get my test system into the following unfortunate state (output courtesy of a hacked arcstat.py):

dweeezil@zfs-dev:~$ ./arcstat.sh 
   p     c  meta  arcsz  metalimit  c_max  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  
 29M   31M  613M   613M  2.9G  3.9G  

The problem is that the large block support bumped the maximum block size to 16MiB and the logic in the adapt thread does this:

        /*
         * If we're within (2 * maxblocksize) bytes of the target
         * cache size, increment the target cache size
         */
        if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {

Oops, if arc_c drops too low, the result will be underflow and look like a huge value and we've got a system that's pretty well bound up because arc_c will never be allowed to grow.

This may very well be unrelated to the issue at hand, but I do seem to be able to stress the system rather easily and cause a bad collapse of arc_c.

@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 4, 2015

I'm testing now with dweeezil/zfs@4288ef8 to fix the non-expanding arc_c problem and will likely add that patch to the stack.

@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 4, 2015

@behlendorf Back to the use of the system taskq again. One of the arcstats I added while testing is this:

@@ -2488,7 +2494,9 @@ static void
 arc_prune(int64_t adjust)
 {
        arc_prune_async(adjust);
+       ARCSTAT_INCR(arcstat_prune_wait, 1);
        taskq_wait(system_taskq);
+       ARCSTAT_INCR(arcstat_prune_wait, -1);
 }

and after causing a bunch of pruning, I see:

dweeezil@zfs-dev:~/src/zfs-mtx$ grep prune /proc/spl/kstat/zfs/arcstats 
arc_prune                       4    382
arc_balanced_meta_prune         4    273
arc_prune_wait                  4    3

and:

root@zfs-dev:~# ps axwww | grep D
  PID TTY      STAT   TIME COMMAND
 2265 ?        Ss     0:00 /usr/sbin/sshd -D
58664 ?        D      0:16 /bin/bash ./create.sh
58673 ?        D      0:17 /bin/bash ./create.sh
58691 ?        D      0:16 /bin/bash ./create.sh
74797 pts/2    S+     0:00 grep --color=auto D

so it looks like there are 3 perpetual waiters. Might there not be other things on the system taskq such as the deadman, prefetcher thread, etc.?

@behlendorf
Copy link
Contributor

I'd like to better understand the illumos side of things first. With that said, is the prune callback used often in a standard ZoL system?

@prakashsurya it can be for a meta-data heavy workload. I think the right way to think about this is that the Linux prune functionality is directly analogous to the dnlc_reduce_cache() functionality on illumos. You would want to call it under exactly the same circumstances. We've just needed to make that hook generic under Linux because the Linux VFS doesn't provide us the needed interface.

I think one reasonable way to do this portably would be just to agree on a common zfs specific interface. That could be called arc_prune() or something entirely different, but all it would need to be on illumos is a maco or inline function. Right now our arc_prune() takes a byte count and dnlc_reduce_cache() a percentage but that's the only meaningful way they differ.

#define arc_prune(x) dnlc_reduce_cache(x)

As for Lustre I think that's a slightly different kettle of fish. Your right they probably shouldn't be using this interface any more. But having something like this available is generally useful if you're considering building something which isn't a filesystem on top of the DMU. And these days there's increasing interest in doing just that.

@dweeezil I've refreshed my lock-contention-on-arcs_mtx branch which is based on your lock-contention-on-arcs_mtx branch. It depends on the changes in behlendorf/spl#453 and includes all the changes we discussed. In the end I just caved and went down the rabit hole of implementing the TASKQ_DYNAMIC flag from illumos. This way we can make all the multi-thread taskq's dynamic, restore the illumos default values, and not has a crazy number of threads on an idle system.

openzfs/spl#455
https://github.com/behlendorf/zfs/tree/lock-contention-on-arcs_mtx

I also added a patch to my stack which addressing the arc_prune() issues I saw for older kernels without per-filesystem shrinkers. With these changes in place the only issues I'm seeing are a cleanup time where it looks like an iput() is getting lost (or never issued). I still need to run that down.

As for your 3 perpetual waiters I'm not sure what's going on there. But in my updated patches I did create a dedicated arc_prune() taskq which make shed some light on things.

@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 5, 2015

@behlendorf I'll be re-testing today with your version of the stack (https://github.com/behlendorf/zfs/tree/lock-contention-on-arcs_mtx) and modified SPL as well as an updated version of my arc_c_min fix from https://github.com/dweeezil/zfs/tree/arc-c-too-small. I'll be interested to observe the taskq threads on my current test rig (40 cores/80 threads).

@behlendorf
Copy link
Contributor

@dweeezil I've refreshed my spl and zfs branches with the following small changes.

openzfs/spl#455
https://github.com/behlendorf/zfs/tree/lock-contention-on-arcs_mtx

  • Decreased the default spl_taskq_thread_sequential value to 1. At 8 it clearly wasn't ramping up the IO threads fast enough for certain workloads. This is helped considerably.
  • Added you arc_c_min path with 2 small additions I mentioned in the comment.
  • Added a patch to change several existing taskq_wait() callers to taskq_wait_outstanding() to preserve the existing behavior where safe.
  • Fixed a cstyle issue.

It appears the umount issues I referenced above isn't a new issue introduced by these changes. I was able to reproduce it on the master branch with a bit of work. So I don't consider that a blocker although it needs to be explained.

My intention is to run these patch stacks over the weekend and see how they hold up. So far things are looking promising.

@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 6, 2015

@behlendorf I'm using your behlendorf/spl@a0f5e41 and behlendorf/zfs@af184e8. If I run my create.sh stress test with 32 process concurrency, after awhile, the (now single initial) z_wr_iss thread blows up into 34 threads and all of them are blocked:

16636 ?        DN     0:00 [z_wr_iss]
16638 ?        DN     0:00 [z_wr_iss]
16639 ?        DN     0:00 [z_wr_iss]
16640 ?        DN     0:00 [z_wr_iss]
16643 ?        DN     0:00 [z_wr_iss]
16647 ?        DN     0:00 [z_wr_iss]
16650 ?        DN     0:00 [z_wr_iss]
16651 ?        DN     0:00 [z_wr_iss]
etc...

root@zfs-dev:~# cat /proc/16640/stack
[<ffffffff81090cf2>] kthread_create_on_node+0xd2/0x180
[<ffffffffc0308774>] spl_kthread_create+0x94/0xe0 [spl]
[<ffffffffc0309c9f>] taskq_thread_create+0x6f/0x100 [spl]
[<ffffffffc030a19c>] taskq_thread+0x46c/0x540 [spl]
[<ffffffff81090e72>] kthread+0xd2/0xf0
[<ffffffff8179be98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

Access to the ZFS filesystem is, likewise, blocked as you might imagine. I'm going to reboot and see what's at kthread_create_on_node+0xd2 in my kernel.

@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 7, 2015

The z_wr_iss threads (or likely any other thread on a dynamic taskq) stall in do_wait_common() from wait_for_completion_killable() when kthreadd enters reclaim as follows:

root@zfs-dev:/home/dweeezil# cat /proc/2/stack
[<ffffffffc036a2ed>] taskq_wait_outstanding+0x7d/0xe0 [spl]
[<ffffffffc076f386>] arc_kmem_reap_now+0xf6/0x100 [zfs] *** second line in arc_prune() ***
[<ffffffffc076f467>] __arc_shrinker_func.isra.23+0xd7/0x180 [zfs]
[<ffffffffc076f527>] arc_shrinker_func_scan_objects+0x17/0x30 [zfs]
[<ffffffff811812e6>] shrink_node_slabs+0x1d6/0x370
[<ffffffff81183f3a>] shrink_zone+0x20a/0x240
[<ffffffff811842d5>] do_try_to_free_pages+0x155/0x440
[<ffffffff8118467a>] try_to_free_pages+0xba/0x150
[<ffffffff81177b6b>] __alloc_pages_nodemask+0x61b/0xa60
[<ffffffff811bd1a1>] alloc_pages_current+0x91/0x100
[<ffffffff811c7ceb>] new_slab+0x34b/0x450
[<ffffffff81792ac9>] __slab_alloc+0x315/0x475
[<ffffffff811c854c>] kmem_cache_alloc_node+0x8c/0x250
[<ffffffff8106f853>] copy_process.part.26+0xf3/0x1c20
[<ffffffff81071535>] do_fork+0xd5/0x340
[<ffffffff810717c6>] kernel_thread+0x26/0x30
[<ffffffff8109185a>] kthreadd+0x15a/0x1c0
[<ffffffff8179be98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

Unfortunately, arc_prune_taskq has the following stack:

root@zfs-dev:/home/dweeezil/src/zfs-mtx# cat /proc/3404/stack
[<ffffffff81090cf2>] kthread_create_on_node+0xd2/0x180
[<ffffffffc0368774>] spl_kthread_create+0x94/0xe0 [spl]
[<ffffffffc0369c9f>] taskq_thread_create+0x6f/0x100 [spl]
[<ffffffffc036a19c>] taskq_thread+0x46c/0x540 [spl]
[<ffffffff81090e72>] kthread+0xd2/0xf0
[<ffffffff8179be98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

Deadlock.

@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 7, 2015

Testing my theory, the following gross hack (and a similar one in the SPL shrinker) does fix the deadlock with kthreadd:

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 84676ff..5d98c87 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -3247,7 +3247,7 @@ __arc_shrinker_func(struct shrinker *shrink, struct shrink_control *sc)
                return (pages);

        /* Not allowed to perform filesystem reclaim */
-       if (!(sc->gfp_mask & __GFP_FS))
+       if (!(sc->gfp_mask & __GFP_FS) || current->pid == 2)
                return (SHRINK_STOP);

        /* Reclaim in progress */

EDIT: Never mind, the deadlock just takes longer to appear.

EDIT[2]: Because it's a different deadlock, involving kswapd (see next comment).

@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 7, 2015

I'm going to try locking down kswapd as well (at least there's a proper API for detecting it). Clearly dynamic taskqs add a layer of memory allocations which wouldn't have happened in ZoL in the past.

@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 7, 2015

@behlendorf I'm going to continue testing with spl_taskq_thread_dynamic=0 for the time being which seems to work around deadlocks caused by reclaim from kthreadd and kswapd. This appears to be latent problem which caused no problems in the past because ZoL never (rarely?) created new threads after module initialization.

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Jun 8, 2015
…to be written to l2arc device

If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual increment
to l2ad_hand was introduced in
commit e14bb3258d05c1b1077e2db7cf77088924e56919

Also, consistently use asize / a_sz for the allocated size, psize / p_sz
for the physical size.  Where the latter accounts for possible size
reduction because of compression, whereas the former accounts for possible
size expansion because of alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical I/O
has a suitable size.

modified to account for the changes of

openzfs#2129 (ABD) ,

openzfs#3115 (Illumos - 5408 managing ZFS cache devices requires lots of RAM)

and

Illumos - 5701 zpool list reports incorrect "alloc" value for cache devices
@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 8, 2015

With spl_taskq_thread_dynamic=0 my memory-constrained (about 8GiB usable) test system survived an overnight run of dbench -x 32 with only a handful of "120 second block" messages. This is very good news. I'll be re-running the test today with 512GiB memory available. I'll also likely try it with xattr=sa and/or relatime=on.

@behlendorf
Copy link
Contributor

@dweeezil my testing survived all weekend as well. I was running high thread versions of xdd, dbench, and iozone concurrently in a loop for ~2 days. No stack traces or lockups. Not too bad!

Interestingly I didn't observe the issue TASKQ_DYNAMIC lockups you hit. This was almost certainly because I wasn't pushing as hard on the meta limit. But this is clearly an issue introduced by the patch which needs to be addressed.

The creation of dynamic threads was moved from taskq_create() in to the taskq thread itself which is what caused the issue. Let me give some thought about how to resolve this. Worst case we can drop the TASKQ_DYNAMIC patches for now they're not critical just an optimization. We might want to keep them separate anyway (or disabled by default) to quantify any performance benefit or penalty.

@dweeezil
Copy link
Contributor Author

dweeezil commented Jun 8, 2015

@behlendorf My 40-process 512GiB instance of dbench just passed 6 hours and ran fine the whole time (as expected). I also had xattr=sa and relatime=on set (dynamic taskqs were disabled but likely would have been just fine in this case). I'd have to say this patch stack looks pretty good now. [EDIT] Did you see my note about the comparison operator in arc_adapt()?

As to the dynamic taskqs, I do like their effect with lots of cores, especially on pools with a large number of top-level vdevs. I made a pool with 40 top-level vdevs once and with the non-dynamic scheme, wound up with over 1600 (!) metaslab_group_taskq threads. Low-memory systems, however, can be trivially deadlocked if dynamic taskqs are enabled. I'm thinking the dynamic taskq feature ought to be its own patch in its own pull request anyway.

@behlendorf
Copy link
Contributor

@dweeezil I think we're in good shape as well. I've opened #3481 with a finalized version of this patch stack with the following modifications.

  • All TASKQ_DYNAMIC patches have been dropped. I'll open a new dedicated pull request for these improvements after resolving the issues you uncovered. I have new patches they just need some testing before I post them.
  • The zfs_sb_prune_compat() patch has been dropped. I was able to expose a deadlock with the existing patch which still needs to be resolved. However, this same problem currently exists in master so it doesn't need to hold up these changes.
  • The taskq_wait_outstanding() patch has been merged in to the SPL master source. There have no no observed issues with those changes and it is required for the buildbot to test the patch stack.
  • I've made all the changes you suggested in previous comments.

I'll queue this patch stack up for another evening of testing. If everything looks good and you don't have any objections I'd like to merge these changes tomorrow. Please review it one final time.

@behlendorf
Copy link
Contributor

@dweeezil after addressing a minor build issue in the proposed patch stack they held up well to my overnight stress testing. At this point I think they're ready to be merged, so let me know when your happy and I'll merge them.

@dweeezil
Copy link
Contributor Author

@behlendorf I'll get the stack (your -final branch) reviewed and give it a bit more testing today.

@dweeezil
Copy link
Contributor Author

@behlendorf So far, it's looking good to me. I'll continue running tests throughout the rest of day, but at this point, I'd say it's ready to merge. More than anything, I'd like to move on to investigate some of the other scalability bottlenecks I've discovered while working on this.

@behlendorf
Copy link
Contributor

@dweeezil sounds good. I'll leave my tests running overnight as well and if everything still looks good in the morning I'll get this merged so we can pursue the other improvements.

@dweeezil
Copy link
Contributor Author

This is still looking good after overnight testing with fio and dbench. I ran some final numbers with @prakashsurya's original test. For reference, it's fio with the following configuration:

[global]
group_reporting=1
fallocate=none
ioengine=psync
numjobs=32
iodepth=1
bs=8k
filesize=512m
size=512m
randrepeat=0
use_os_rand=1

[test]
rw=randread
time_based
runtime=1m
directory=/tank/fish

on a filesystem with recordsize=8k.

With current master code, there are 873419 contentions on arcs_mtx and an average wait time of 1.85ms. With this patch stack, the ARC-related contentions are pushed down into the noise. There are only 7589 contentsions on mls_lock with an average wait time of 14.95us.

Unfortunately, as mentioned in previous issue comments, the overall performance for 8K block reads does not improve at all with this patch. There are clearly other ZoL-specific issues at play here. With this patch stack, the top source of lock contention becomes the wait_lock and rr_lock which are part of the SPL's rwlock and rwsem implementations. These are separate problems and will soon get their own issues as soon as this stack is committed and I can start to seriously look into the other scalability issues specific to ZoL.

In summary, I think this patch stack is ready to go as-is.

@kernelOfTruth
Copy link
Contributor

With current master code, there are 873419 contentions on arcs_mtx and an average wait time of 1.85ms. With this patch stack, the ARC-related contentions are pushed down into the noise. There are only 7589 contentsions on mls_lock with an average wait time of 14.95us.

That sounds VERY interesting

guess I'll give this a spin against vanilla (if it's not merged before I have to opportunity to update my zfs drivers),

@dweeezil
Occasionally I get sound stutters (currently running 4.1-rc7 with threadirqs, low-latency) - hopefully this shows an positive effect in that direction

maximal latency during "high" i/o (one rsync backup job to btrfs volume) measured with latencytop was around 3000 ms (3 seconds !)

Using a different cpu scheduler (BFS instead of CFS) also helps but so far there was some noticable stalls, latency spikes, etc. when using ZFS - so looking forward to the merge

Thanks

@behlendorf
Copy link
Contributor

@dweeezil my testing held up as expected too. Since these changes are ready to go I've merged them to master. That's a nice big step forwards. Now to address some of the lingering performance issues you uncovered. Thanks for all your hard work on this!

06358ea Merge branch 'lock-contention-on-arcs_mtx-final'
121b3ca Increase arc_c_min to allow safe operation of arc_adapt()
f604673 Make arc_prune() asynchronous
c5528b9 Use taskq_wait_outstanding() function
4f34bd9 Add taskq_wait_outstanding() function
ca0bf58 Illumos 5497 - lock contention on arcs_mtx
b9541d6 Illumos 5408 - managing ZFS cache devices requires lots of RAM
2a43241 Illumos 5369 - arc flags should be an enum
ad4af89 Partially revert "Add ddt, ddt_entry, and l2arc_hdr caches"
97639d0 Revert "Allow arc_evict_ghost() to only evict meta data"
f6b3b1f Revert "fix l2arc compression buffers leak"
7807028 Revert "arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc"

@prakashsurya
Copy link
Member

Woo! 👍 Nice work!! It's too bad this doesn't help the bottom line performance numbers, but at least it moves the contention out of the ARC; which was the point. Reducing the average latency from milliseconds to 10s of microseconds is good to hear! :)

EDIT: Oh, and let's not forget, this will also effectively give people that use an L2ARC device more RAM, since Illumos 5408 - managing ZFS cache devices requires lots of RAM was also merged in the context of this work. Good stuff.

@behlendorf
Copy link
Contributor

@prakashsurya speaking us further slimming down the ARC header memory usage it would get great to get your feedback on #1971. It pulls the kmutex_t and kcondvar_t out of the l1arc_buf_hdr structure which results in significant additional savings with minimal additional contention.

@behlendorf behlendorf mentioned this pull request Jun 29, 2015
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants