Skip to content

Commit

Permalink
fix prefetching of indirect blocks while destroying
Browse files Browse the repository at this point in the history
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
  • Loading branch information
ahrens authored and andrewc12 committed Mar 19, 2023
1 parent 1428f38 commit 4789ef9
Showing 1 changed file with 14 additions and 14 deletions.
28 changes: 14 additions & 14 deletions module/zfs/dmu_traverse.c
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ typedef enum resume_skip {
* Otherwise returns RESUME_SKIP_NONE.
*/
static resume_skip_t
resume_skip_check(traverse_data_t *td, const dnode_phys_t *dnp,
resume_skip_check(const traverse_data_t *td, const dnode_phys_t *dnp,
const zbookmark_phys_t *zb)
{
if (td->td_resume != NULL && !ZB_IS_ZERO(td->td_resume)) {
Expand All @@ -165,12 +165,7 @@ resume_skip_check(traverse_data_t *td, const dnode_phys_t *dnp,
if (zbookmark_subtree_completed(dnp, zb, td->td_resume))
return (RESUME_SKIP_ALL);

/*
* If we found the block we're trying to resume from, zero
* the bookmark out to indicate that we have resumed.
*/
if (memcmp(zb, td->td_resume, sizeof (*zb)) == 0) {
memset(td->td_resume, 0, sizeof (*zb));
if (td->td_flags & TRAVERSE_POST)
return (RESUME_SKIP_CHILDREN);
}
Expand All @@ -182,7 +177,7 @@ resume_skip_check(traverse_data_t *td, const dnode_phys_t *dnp,
* Returns B_TRUE, if prefetch read is issued, otherwise B_FALSE.
*/
static boolean_t
traverse_prefetch_metadata(traverse_data_t *td,
traverse_prefetch_metadata(traverse_data_t *td, const dnode_phys_t *dnp,
const blkptr_t *bp, const zbookmark_phys_t *zb)
{
arc_flags_t flags = ARC_FLAG_NOWAIT | ARC_FLAG_PREFETCH |
Expand All @@ -192,11 +187,10 @@ traverse_prefetch_metadata(traverse_data_t *td,
if (!(td->td_flags & TRAVERSE_PREFETCH_METADATA))
return (B_FALSE);
/*
* If we are in the process of resuming, don't prefetch, because
* some children will not be needed (and in fact may have already
* been freed).
* If this bp is before the resume point, it may have already been
* freed.
*/
if (td->td_resume != NULL && !ZB_IS_ZERO(td->td_resume))
if (resume_skip_check(td, dnp, zb) != RESUME_SKIP_NONE)
return (B_FALSE);
if (BP_IS_HOLE(bp) || bp->blk_birth <= td->td_min_txg)
return (B_FALSE);
Expand Down Expand Up @@ -241,6 +235,12 @@ traverse_visitbp(traverse_data_t *td, const dnode_phys_t *dnp,
ASSERT(0);
}

if (td->td_resume != NULL &&
memcmp(zb, td->td_resume, sizeof (*zb)) == 0) {
/* found the resume point; no longer resuming */
memset(td->td_resume, 0, sizeof (*zb));
}

if (bp->blk_birth == 0) {
/*
* Since this block has a birth time of 0 it must be one of
Expand Down Expand Up @@ -344,7 +344,7 @@ traverse_visitbp(traverse_data_t *td, const dnode_phys_t *dnp,
SET_BOOKMARK(czb, zb->zb_objset,
zb->zb_object, zb->zb_level - 1,
zb->zb_blkid * epb + pidx);
if (traverse_prefetch_metadata(td,
if (traverse_prefetch_metadata(td, dnp,
&((blkptr_t *)buf->b_data)[pidx],
czb) == B_TRUE) {
prefetched++;
Expand Down Expand Up @@ -506,12 +506,12 @@ prefetch_dnode_metadata(traverse_data_t *td, const dnode_phys_t *dnp,

for (j = 0; j < dnp->dn_nblkptr; j++) {
SET_BOOKMARK(&czb, objset, object, dnp->dn_nlevels - 1, j);
traverse_prefetch_metadata(td, &dnp->dn_blkptr[j], &czb);
traverse_prefetch_metadata(td, dnp, &dnp->dn_blkptr[j], &czb);
}

if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) {
SET_BOOKMARK(&czb, objset, object, 0, DMU_SPILL_BLKID);
traverse_prefetch_metadata(td, DN_SPILL_BLKPTR(dnp), &czb);
traverse_prefetch_metadata(td, dnp, DN_SPILL_BLKPTR(dnp), &czb);
}
}

Expand Down

0 comments on commit 4789ef9

Please sign in to comment.