Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[buildbot][pt2] zfs master 18.11.2017 (0.7.3) + openzfs 8585 + pr6568 + zfetch prefetch + ARC fixes (incl. commit callbacks from tail of list) #7020

Conversation

kernelOfTruth
Copy link
Contributor

@kernelOfTruth kernelOfTruth commented Jan 8, 2018

testing latest "working" (non-conflicting branch with the zfetch / prefetch performance improvements and the new ARC hitrate fix #6989 ) for testing / bleeding-edge

via build-bots

sequential scrub and resilvers are left out

Fixed the 'Fix ARC hit rate' commit to leave out the bits that were left out with the revert of the 'sequential scrub and resilvers' commit.

This is a temporary branch (if it passes the buildbots) for those who focus on performance enhancements and don't want to miss out the benefits of the zfetch and zil enhancement patches and can work with the "old" scrub prefetch behavior

Prakash Surya and others added 10 commits November 24, 2017 00:03
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

Problem
=======

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

 1. When there's outstanding ZIL blocks being written (i.e. there's
    already a "writer thread" in progress), then any new calls to
    zil_commit() will block waiting for the currently oustanding ZIL
    blocks to complete. The blocks written for each "writer thread" is
    coined a "batch", and there can only ever be a single "batch" being
    written at a time. When a batch is being written, any new ZIL
    transactions will have to wait for the next batch to be written,
    which won't occur until the current batch finishes.

    As a result, the underlying storage may not be used as efficiently
    as possible. While "new" threads enter zil_commit() and are blocked
    waiting for the next batch, it's possible that the underlying
    storage isn't fully utilized by the current batch of ZIL blocks. In
    that case, it'd be better to allow these new threads to generate
    (and issue) a new ZIL block, such that it could be serviced by the
    underlying storage concurrently with the other ZIL blocks that are
    being serviced.

 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
    to complete, prior to zil_commit() returning. The size of any given
    batch is proportional to the number of ZIL transaction in the queue
    at the time that the batch starts processing the queue; which
    doesn't occur until the previous batch completes. Thus, if there's a
    lot of transactions in the queue, the batch could be composed of
    many ZIL blocks, and each call to zil_commit() will have to wait for
    all of these writes to complete (even if the thread calling
    zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

 3. If a given batch takes longer to complete than normal, this results
    in larger batch sizes, which then take longer to complete and
    further drive up the latency of zil_commit(). This can occur for a
    number of reasons, including (but not limited to): transient changes
    in the workload, and storage latency irregularites.

Solution
========

The solution attempted by this change has the following goals:

 1. no on-disk changes; maintain current on-disk format.
 2. modify the "batch size" to be equal to the "ZIL block size".
 3. allow new batches to be generated and issued to disk, while there's
    already batches being serviced by the disk.
 4. allow zil_commit() to wait for as few ZIL blocks as possible.
 5. use as few ZIL blocks as possible, for the same amount of ZIL
    transactions, without introducing significant latency to any
    individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

 1. new ZIL blocks can be generated and issued, while there's already
    oustanding ZIL blocks being serviced by the storage.
 2. the latency of zil_commit() should be proportional to the underlying
    storage latency, rather than the incoming synchronous workload.

Porting Notes
=============

Due to the changes made in commit 119a394, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).

To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:

  * A list of itxs was added to the lwb structure. This list contains
    all of the itxs that have been committed to the lwb, such that the
    callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
    after the data for the itxs is committed to disk.

  * A list of itxs was added on the stack of the zil_process_commit_list()
    function; the "nolwb_itxs" list. In some circumstances, an itx may
    not be committed to an lwb (e.g. if allocating the "next" ZIL block
    on disk fails), so this list is used to keep track of which itxs
    fall into this state, such that their callbacks can be called after
    the ZIL's writer pipeline is "stalled".

  * The logic to actually call the itx's callback was moved into the
    zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
    were effectively performing the same logic (i.e. if callback is
    non-null, call the callback), it seemed like useful code cleanup to
    consolidate this logic into a single function.

Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:

  * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
    removed from "trace_zil.h" as well.

  * Some of the zilog structure's fields were removed, which affected
    the tracepoint definitions of the structure.

  * New tracepoints had to be added for the following 3 new probes:
      * zil__process__commit__itx
      * zil__process__normal__itx
      * zil__commit__io__error

OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: openzfs/openzfs@5d95a3a
This should address the following panic seen when running on a system
with 8 SSDs and 32 vCPUs:

    [ 1806.416279] VERIFY3(itx->itx_wr_state == WR_INDIRECT) failed (4294967295 == 0)
    [ 1806.416315] PANIC at zil.c:1506:zil_lwb_commit()
    [ 1806.416333] Showing stack for process 28365
    [ 1806.416336] CPU: 5 PID: 28365 Comm: fio Tainted: P           OE   4.4.0-93-generic openzfs#116-Ubuntu
    [ 1806.416337] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
    [ 1806.416338]  0000000000000286 16a1c125c5c4eec7 ffff887cd21678c0 ffffffff813f9f83
    [ 1806.416341]  ffffffffc136d454 00000000000005e2 ffff887cd21678d0 ffffffffc02d6fd2
    [ 1806.416342]  ffff887cd2167a58 ffffffffc02d70a5 ffff887b665d75d0 ffff887c00000030
    [ 1806.416344] Call Trace:
    [ 1806.416350]  [<ffffffff813f9f83>] dump_stack+0x63/0x90
    [ 1806.416362]  [<ffffffffc02d6fd2>] spl_dumpstack+0x42/0x50 [spl]
    [ 1806.416366]  [<ffffffffc02d70a5>] spl_panic+0xc5/0x100 [spl]
    [ 1806.416428]  [<ffffffffc127ab1e>] ? zio_rewrite+0x6e/0x70 [zfs]
    [ 1806.416469]  [<ffffffffc126bd90>] ? zil_lwb_write_open+0x520/0x520 [zfs]
    [ 1806.416509]  [<ffffffffc126b96f>] ? zil_lwb_write_open+0xff/0x520 [zfs]
    [ 1806.416551]  [<ffffffffc126bd90>] ? zil_lwb_write_open+0x520/0x520 [zfs]
    [ 1806.416591]  [<ffffffffc1271daa>] zil_commit+0x14ba/0x1d90 [zfs]
    [ 1806.416635]  [<ffffffffc1260d6d>] zfs_write+0xe0d/0x1020 [zfs]
    [ 1806.416677]  [<ffffffffc128b391>] zpl_write_common_iovec+0x91/0x130 [zfs]
    [ 1806.416717]  [<ffffffffc128b4e1>] zpl_iter_write+0xb1/0xf0 [zfs]
    [ 1806.416721]  [<ffffffff8120ec0b>] new_sync_write+0x9b/0xe0
    [ 1806.416723]  [<ffffffff8120ec76>] __vfs_write+0x26/0x40
    [ 1806.416724]  [<ffffffff8120f5f9>] vfs_write+0xa9/0x1a0
    [ 1806.416726]  [<ffffffff81210465>] SyS_pwrite64+0x95/0xb0
    [ 1806.416729]  [<ffffffff81039d03>] ? fpu__restore+0xf3/0x150
    [ 1806.416734]  [<ffffffff818431f2>] entry_SYSCALL_64_fastpath+0x16/0x71

The problem was that we'd add the itx to the incorrect lwb, causing the
itx to be free'd when it was still being used.

Specifially, we'd add the itx to the lwb that was passed into
zil_lwb_commit, but it's possible that this lwb will not be the one that
the itx is committed to. If there is insufficient space in that lwb for
this itx, that lwb will be issued to disk, and the itx will be committed
to the next lwb.

In this scenario, the itx would have been linked into the lwb's list for
the lwb that was just submitted. When this lwb completes, the itx will
be free'd. If the lwb happens to complete, prior to the call to
zil_lwb_commit completing, the itx will be free'd while it's still being
actively used in zil_lwb_commit. I believe this issue become more easy
to occur when the underlying storage is "fast", and would help explain
why I've only managed to trigger this crash when using SSDs.

Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
range_tree: access the space bounds
range_tree: walk in the range
range_tree: init/deinit
range_tree: range_tree_overlaps & range_tree_set

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
dmu_zfetch: initial range_tree based prefetcher
dmu_zfetch: range_tree based prefetcher with streams support
dmu_zfetch: more sensible inital prefetch ranges
dmu_zfetch: bail if the lock is taken
dmu_zfetch: avoid doing too much work on repeated access

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes openzfs#6986
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified.  As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.

This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf.  Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list.  This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.

This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6171

Removing the reference to HDR_PRESCIENT_PREFETCH since it was introduced
with the commit 'Sequential scrub and resilvers' which was reverted for
this branch

Signed-off-by: kernelOfTruth <kerneloftruth@gmail.com>
@codecov
Copy link

codecov bot commented Jan 9, 2018

Codecov Report

Merging #7020 into master will decrease coverage by 0.22%.
The diff coverage is 94.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7020      +/-   ##
==========================================
- Coverage   75.44%   75.21%   -0.23%     
==========================================
  Files         296      296              
  Lines       95503    94820     -683     
==========================================
- Hits        72051    71323     -728     
- Misses      23452    23497      +45
Flag Coverage Δ
#kernel 74.36% <89.79%> (-0.51%) ⬇️
#user 67.45% <87.97%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0873bb6...d0694dc. Read the comment docs.

@kernelOfTruth
Copy link
Contributor Author

kernelOfTruth commented Jan 10, 2018

nope, this branch uses an outdated openzfs 8585 and is missing the fix for the use-after-free kernel panic

closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants