[buildbot][pt2] zfs master 18.11.2017 (0.7.3) + openzfs 8585 + pr6568 + zfetch prefetch + ARC fixes (incl. commit callbacks from tail of list) #7020

kernelOfTruth · 2018-01-08T21:06:55Z

testing latest "working" (non-conflicting branch with the zfetch / prefetch performance improvements and the new ARC hitrate fix #6989 ) for testing / bleeding-edge

via build-bots

sequential scrub and resilvers are left out

Fixed the 'Fix ARC hit rate' commit to leave out the bits that were left out with the revert of the 'sequential scrub and resilvers' commit.

This is a temporary branch (if it passes the buildbots) for those who focus on performance enhancements and don't want to miss out the benefits of the zfetch and zil enhancement patches and can work with the "old" scrub prefetch behavior

Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Prakash Surya <prakash.surya@delphix.com> Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Porting Notes ============= Due to the changes made in commit 119a394, the lifetime of an itx structure differs than in OpenZFS. Specifically, the itx structure is kept around until the data associated with the itx is considered to be safe on disk; this is so that the itx's callback can be called after the data is committed to stable storage. Since OpenZFS doesn't have this itx callback mechanism, it's able to destroy the itx structure immediately after the itx is committed to an lwb (before the lwb is written to disk). To support this difference, and to ensure the itx's callbacks can still be called after the itx's data is on disk, a few changes had to be made: * A list of itxs was added to the lwb structure. This list contains all of the itxs that have been committed to the lwb, such that the callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(), after the data for the itxs is committed to disk. * A list of itxs was added on the stack of the zil_process_commit_list() function; the "nolwb_itxs" list. In some circumstances, an itx may not be committed to an lwb (e.g. if allocating the "next" ZIL block on disk fails), so this list is used to keep track of which itxs fall into this state, such that their callbacks can be called after the ZIL's writer pipeline is "stalled". * The logic to actually call the itx's callback was moved into the zil_itx_destroy() function. Since all consumers of zil_itx_destroy() were effectively performing the same logic (i.e. if callback is non-null, call the callback), it seemed like useful code cleanup to consolidate this logic into a single function. Additionally, the existing Linux tracepoint infrastructure dealing with the ZIL's probes and structures had to be updated to reflect these code changes. Specifically: * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be removed from "trace_zil.h" as well. * Some of the zilog structure's fields were removed, which affected the tracepoint definitions of the structure. * New tracepoints had to be added for the following 3 new probes: * zil__process__commit__itx * zil__process__normal__itx * zil__commit__io__error OpenZFS-issue: https://www.illumos.org/issues/8585 OpenZFS-commit: openzfs/openzfs@5d95a3a

This should address the following panic seen when running on a system with 8 SSDs and 32 vCPUs: [ 1806.416279] VERIFY3(itx->itx_wr_state == WR_INDIRECT) failed (4294967295 == 0) [ 1806.416315] PANIC at zil.c:1506:zil_lwb_commit() [ 1806.416333] Showing stack for process 28365 [ 1806.416336] CPU: 5 PID: 28365 Comm: fio Tainted: P OE 4.4.0-93-generic openzfs#116-Ubuntu [ 1806.416337] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 [ 1806.416338] 0000000000000286 16a1c125c5c4eec7 ffff887cd21678c0 ffffffff813f9f83 [ 1806.416341] ffffffffc136d454 00000000000005e2 ffff887cd21678d0 ffffffffc02d6fd2 [ 1806.416342] ffff887cd2167a58 ffffffffc02d70a5 ffff887b665d75d0 ffff887c00000030 [ 1806.416344] Call Trace: [ 1806.416350] [<ffffffff813f9f83>] dump_stack+0x63/0x90 [ 1806.416362] [<ffffffffc02d6fd2>] spl_dumpstack+0x42/0x50 [spl] [ 1806.416366] [<ffffffffc02d70a5>] spl_panic+0xc5/0x100 [spl] [ 1806.416428] [<ffffffffc127ab1e>] ? zio_rewrite+0x6e/0x70 [zfs] [ 1806.416469] [<ffffffffc126bd90>] ? zil_lwb_write_open+0x520/0x520 [zfs] [ 1806.416509] [<ffffffffc126b96f>] ? zil_lwb_write_open+0xff/0x520 [zfs] [ 1806.416551] [<ffffffffc126bd90>] ? zil_lwb_write_open+0x520/0x520 [zfs] [ 1806.416591] [<ffffffffc1271daa>] zil_commit+0x14ba/0x1d90 [zfs] [ 1806.416635] [<ffffffffc1260d6d>] zfs_write+0xe0d/0x1020 [zfs] [ 1806.416677] [<ffffffffc128b391>] zpl_write_common_iovec+0x91/0x130 [zfs] [ 1806.416717] [<ffffffffc128b4e1>] zpl_iter_write+0xb1/0xf0 [zfs] [ 1806.416721] [<ffffffff8120ec0b>] new_sync_write+0x9b/0xe0 [ 1806.416723] [<ffffffff8120ec76>] __vfs_write+0x26/0x40 [ 1806.416724] [<ffffffff8120f5f9>] vfs_write+0xa9/0x1a0 [ 1806.416726] [<ffffffff81210465>] SyS_pwrite64+0x95/0xb0 [ 1806.416729] [<ffffffff81039d03>] ? fpu__restore+0xf3/0x150 [ 1806.416734] [<ffffffff818431f2>] entry_SYSCALL_64_fastpath+0x16/0x71 The problem was that we'd add the itx to the incorrect lwb, causing the itx to be free'd when it was still being used. Specifially, we'd add the itx to the lwb that was passed into zil_lwb_commit, but it's possible that this lwb will not be the one that the itx is committed to. If there is insufficient space in that lwb for this itx, that lwb will be issued to disk, and the itx will be committed to the next lwb. In this scenario, the itx would have been linked into the lwb's list for the lwb that was just submitted. When this lwb completes, the itx will be free'd. If the lwb happens to complete, prior to the call to zil_lwb_commit completing, the itx will be free'd while it's still being actively used in zil_lwb_commit. I believe this issue become more easy to occur when the underlying storage is "fast", and would help explain why I've only managed to trigger this crash when using SSDs. Signed-off-by: Prakash Surya <prakash.surya@delphix.com>

This reverts commit d4a72f2.

range_tree: access the space bounds range_tree: walk in the range range_tree: init/deinit range_tree: range_tree_overlaps & range_tree_set Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>

dmu_zfetch: initial range_tree based prefetcher dmu_zfetch: range_tree based prefetcher with streams support dmu_zfetch: more sensible inital prefetch ranges dmu_zfetch: bail if the lock is taken dmu_zfetch: avoid doing too much work on repeated access Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>

Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: https://jira.hpdd.intel.com/browse/LU-6527 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au> Closes openzfs#6986

When the compressed ARC feature was added in commit d3c2ae1 the method of reference counting in the ARC was modified. As part of this accounting change the arc_buf_add_ref() function was removed entirely. This would have be fine but the arc_buf_add_ref() function served a second undocumented purpose of updating the ARC access information when taking a hold on a dbuf. Without this logic in place a cached dbuf would not migrate its associated arc_buf_hdr_t to the MFU list. This would negatively impact the ARC hit rate, particularly on systems with a small ARC. This change reinstates the missing call to arc_access() from dbuf_hold() by implementing a new arc_buf_access() function. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#6171 Removing the reference to HDR_PRESCIENT_PREFETCH since it was introduced with the commit 'Sequential scrub and resilvers' which was reverted for this branch Signed-off-by: kernelOfTruth <kerneloftruth@gmail.com>

codecov · 2018-01-09T19:24:50Z

Codecov Report

Merging #7020 into master will decrease coverage by 0.22%.
The diff coverage is 94.04%.

@@            Coverage Diff             @@
##           master    #7020      +/-   ##
==========================================
- Coverage   75.44%   75.21%   -0.23%     
==========================================
  Files         296      296              
  Lines       95503    94820     -683     
==========================================
- Hits        72051    71323     -728     
- Misses      23452    23497      +45

Flag	Coverage Δ
#kernel	`74.36% <89.79%> (-0.51%)`	⬇️
#user	`67.45% <87.97%> (-0.21%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0873bb6...d0694dc. Read the comment docs.

kernelOfTruth · 2018-01-10T23:12:53Z

nope, this branch uses an outdated openzfs 8585 and is missing the fix for the use-after-free kernel panic

closing for now

Prakash Surya and others added 10 commits November 24, 2017 00:03

fix: account for ERESTART in zil_lwb_write_issue

f275c7e

fix: remove redundant ASSERT

27967a2

fix: remove more redundant ASSERTs

a07b986

Revert "Sequential scrub and resilvers"

6ee33ea

This reverts commit d4a72f2.

range_tree: change stack

f88296e

range_tree: access the space bounds range_tree: walk in the range range_tree: init/deinit range_tree: range_tree_overlaps & range_tree_set Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>

kernelOfTruth closed this Jan 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[buildbot][pt2] zfs master 18.11.2017 (0.7.3) + openzfs 8585 + pr6568 + zfetch prefetch + ARC fixes (incl. commit callbacks from tail of list) #7020

[buildbot][pt2] zfs master 18.11.2017 (0.7.3) + openzfs 8585 + pr6568 + zfetch prefetch + ARC fixes (incl. commit callbacks from tail of list) #7020

kernelOfTruth commented Jan 8, 2018 •

edited

Loading

codecov bot commented Jan 9, 2018

kernelOfTruth commented Jan 10, 2018 •

edited

Loading

[buildbot][pt2] zfs master 18.11.2017 (0.7.3) + openzfs 8585 + pr6568 + zfetch prefetch + ARC fixes (incl. commit callbacks from tail of list) #7020

[buildbot][pt2] zfs master 18.11.2017 (0.7.3) + openzfs 8585 + pr6568 + zfetch prefetch + ARC fixes (incl. commit callbacks from tail of list) #7020

Conversation

kernelOfTruth commented Jan 8, 2018 • edited Loading

codecov bot commented Jan 9, 2018

Codecov Report

kernelOfTruth commented Jan 10, 2018 • edited Loading

kernelOfTruth commented Jan 8, 2018 •

edited

Loading

kernelOfTruth commented Jan 10, 2018 •

edited

Loading