-
Notifications
You must be signed in to change notification settings - Fork 178
spl kmem cache atomic operations hurt performance on large-scale systems #463
Comments
@dweeezil wow, that's a huge difference. I wouldn't have thought the atomic's would be that expensive. It sounds like the thing to do here is:
We can manually set --enable-debug-kmem for the buildbot so things like memory leaks still get automatically flagged. Alternately, these could be turned in to per-cpu values. (BTW I think you pasted the performance data in backwards) |
/me rubs his eyes @dweeezil are those stats inverted for each case (debug and non-debug) ? Otherwise it would be less throughput and thus performance, no ? Besides that: that's a HUGE difference ! I wonder how this change affects performance with especially lz4 compression Great find =) |
Oops, yes, I pasted it in backward. BTW, with this change, we're now up to 50-60% of the illumos performance on this test (highly-concurrent small-block cached reads). I've run a number of instances of the test with all processes pinned to a single socket and, as expected, things are a lot faster. This issue is one of the big causes of the slowdown when the processes are spread across sockets. |
The default kmem debugging (--enable-debug-kmem) can severely impact performance on large-scale NUMA systems due to the atomic operations used in the memory accounting. A 32-thread fio test running on a 40-core 80-thread system and performing 100% cached reads with kmem debugging is: Enabled: READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s, Disabled: READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s, Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#463
Specifically I'd suggest #464 a a reasonable fix (just compiled, not yet tested). I've shamelessly stolen @dweeezil's comment above for the commit but I think this is the most straight forward way to address this. We don't loss anything from a user perspective by disabling this debugging. It's primarily there for developers and was only enabled by default as a convenience. |
I've updated the title of this issue to reflect what I believe is the balance of the performance degradation we're seeing relative to the same benchmark run under illumos. Even with SPL kmem accounting debugging disabled, it appears a large amount of time is spent on atomic operations managing |
The default kmem debugging (--enable-debug-kmem) can severely impact performance on large-scale NUMA systems due to the atomic operations used in the memory accounting. A 32-thread fio test running on a 40-core 80-thread system and performing 100% cached reads with kmem debugging is: Enabled: READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s, Disabled: READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s, Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issues #463
@dweeezil that would be very good news. We're already well underway on shifting over to the Linux slabs entirely which shouldn't suffer from this issue. The only thing remaining blocking this is finalizing the ABD patches which are the next major planned changed. This looks like it be be pretty straight forward to address in the SPL layer too. I've love to see some performance results! |
@behlendorf I've been juggling a lot of things over the past few days and haven't been able to follow up on this as well as I'd like; it seems that whenever I remove a bottleneck another one appears. Here's where things are at the moment, IIRC: I removed the I zeroed in on the After disabling I just popped on to the system again and re-ran the benchmark and it looks like I'm getting pretty bad contention on I'm normally running the benchmark with 32 processes. As a cross-check, I ran an instance with 10 processes and pinned them to a single socket (system is 4-socket with 10 cores/20 threads each). The Have you got any thought as to why not using I'll continue to investigate this as I have time. In the mean time, both the existing SPL change (default to no kmem accounting) and your propsed #465 appear to be Good Things. |
referencing @ryao 's comment that mentions CONFIG_PARAVIRT_SPINLOCKS http://lwn.net/Articles/495597/ Paravirtualized ticket spinlocks (2012) |
@dweeezil I'm not sure why this is the case. Certainly disabling CONFIG_PARAVIRT_SPINLOCKS will result in a different spinlock implementation being used that could disrupt the performance. But then setting CONFIG_DEBUG_SPINLOCK or enabling the kernel lock profiling would have a similar effect too. Any of these changes could perturb the lock contention at the higher ZFS layers if the time required to take/drop locks changes significantly. I'm glad your taking the time to look at this so we can continue to chip away at it and make improvements. Speaking of which if you can sign off on the #465 change I'll get it merged. There's no real downside to making that change although I doubt it will help any single socket systems. |
As described in spl_kmem_cache_destroy() the ->skc_ref count was added to address the case of a cache reap or grow racing with a destroy. They are not strictly needed in the alloc/free paths because consumers of the cache are responsible for not using it while it's being destroyed. Removing this code is desirable because there is some evidence that contention on this atomic negative impacts performance on large-scale NUMA systems. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #463
I've finally gotten back to testing on the big NUMA system with current master code for both spl and zfs and the performance is looking pretty good. AFAICT, the spl-related issues are gone (skc_ref & whatever else). I'm not sure where things stand w.r.t. In any case, I'm going to close this issue because I don't think there are any more spl-related things at the moment which are keeping us from matching illumos' performance. I will be opening a new issue in the zfs tracker with a similar title. |
The default kmem debugging (
--enable-debug-kmem
) can severely impact performance on large-scale NUMA systems due to the atomic operations used in the memory accounting. A 32-thread fio test running on a 40-core 80-thread system and performing 100% cached reads with kmem debugging is:and when SPL/ZFS is built with
disable-debug-kmem
is:This was discovered via flame graph profiling.
EDIT: Fixed the order of the stats.
The text was updated successfully, but these errors were encountered: