Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: metaslab_sync() running after spa_final_dirty_txg() #9186

Closed
sdimitro opened this issue Aug 20, 2019 · 0 comments
Closed

panic: metaslab_sync() running after spa_final_dirty_txg() #9186

sdimitro opened this issue Aug 20, 2019 · 0 comments

Comments

@sdimitro
Copy link
Contributor

Paul Dagnelie hit the following assertion in metaslab_sync() while running zpool_upgrade_002 from the test suite:

void
metaslab_sync(metaslab_t *msp, uint64_t txg)
{
<... cropped ..>
	/*
	 * Normally, we don't want to process a metaslab if there are no
	 * allocations or frees to perform. However, if the metaslab is being
	 * forced to condense and it's loaded, we need to let it through.
	 */
	if (range_tree_is_empty(alloctree) &&
	    range_tree_is_empty(msp->ms_freeing) &&
	    range_tree_is_empty(msp->ms_checkpointing) &&
	    !(msp->ms_loaded && msp->ms_condense_wanted))
		return;

	VERIFY(txg <= spa_final_dirty_txg(spa)); // <---- This one
<... cropped ..>

The current TXG was 60 and spa_final_dirty() had returned 59.
Looking at the zfs_dbgmsg output we'd see the following relevant data about the pool in question:

<... cropped ...>
spa_history.c:319:spa_history_log_sync(): txg 32 set feature@userobj_accounting=enabled
spa_history.c:319:spa_history_log_sync(): txg 33 set feature@encryption=enabled
spa_history.c:319:spa_history_log_sync(): txg 34 set feature@project_quota=enabled
spa_history.c:319:spa_history_log_sync(): txg 35 set feature@device_removal=enabled
spa_history.c:319:spa_history_log_sync(): txg 36 set feature@obsolete_counts=enabled
spa_history.c:319:spa_history_log_sync(): txg 37 set feature@zpool_checkpoint=enabled
spa_history.c:319:spa_history_log_sync(): txg 38 set feature@spacemap_v2=enabled
spa_history.c:319:spa_history_log_sync(): txg 39 set feature@allocation_classes=enabled
spa_history.c:319:spa_history_log_sync(): txg 40 set feature@resilver_defer=enabled
spa_history.c:319:spa_history_log_sync(): txg 41 set feature@bookmark_v2=enabled
spa_history.c:319:spa_history_log_sync(): txg 42 set feature@redaction_bookmarks=enabled
spa_history.c:319:spa_history_log_sync(): txg 43 set feature@redacted_datasets=enabled
spa_history.c:319:spa_history_log_sync(): txg 44 set feature@bookmark_written=enabled
spa_history.c:319:spa_history_log_sync(): txg 45 set feature@log_spacemap=enabled
spa_history.c:306:spa_history_log_sync(): command: zpool upgrade pool-v3stripe
metaslab.c:3814:metaslab_flush(): flushing: txg 46, spa pool-v3stripe, vdev_id 0, ms_id 1, unflushed_allocs 7680, unflushed_frees 9216, appended 64 bytes
spa_history.c:306:spa_history_log_sync(): command: zpool destroy -f pool-v3stripe
metaslab.c:3814:metaslab_flush(): flushing: txg 50, spa pool-v3stripe, vdev_id 0, ms_id 2, unflushed_allocs 80896, unflushed_frees 18432, appended 80 bytes
metaslab.c:2850:metaslab_set_fragmentation(): txg 57, requesting force condense: ms_id 0, vdev_id 1
metaslab.c:2850:metaslab_set_fragmentation(): txg 58, requesting force condense: ms_id 0, vdev_id 1
spa_misc.c:409:spa_load_note(): spa_load(pool-v3stripe, config trusted): UNLOADING
metaslab.c:3814:metaslab_flush(): flushing: txg 59, spa pool-v3stripe, vdev_id 0, ms_id 3, unflushed_allocs 12288, unflushed_frees 23040, appended 96 bytes
metaslab.c:2422:metaslab_load_impl(): loading: txg 59, spa pool-v3stripe, vdev_id 1, ms_id 0, smp_length 136, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, loading_time 0 ms, ms_max_size 524288, max size error 0
metaslab.c:2422:metaslab_load_impl(): loading: txg 59, spa pool-v3stripe, vdev_id 2, ms_id 8, smp_length 0, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, loading_time 0 ms, ms_max_size 524288, max size error 524288
metaslab.c:2422:metaslab_load_impl(): loading: txg 59, spa pool-v3stripe, vdev_id 0, ms_id 8, smp_length 0, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, loading_time 0 ms, ms_max_size 524288, max size error 524288

The metaslab that metaslab_sync() panicked was {ms_id: 0, vdev_id 1} and there are a few interesting things about it. First of all the metaslab was loaded at some point between TXG 59 and 60 as we can see from the above message. In addition, we can also see that the metaslab went through metaslab_set_fragmentation() twice right before, in TXG 57 and 58. Finally, what is also of interest but can just as well be a coincidence is that a total of 3 metaslabs were all loaded at the same time TXG 59 (including the one in question).

Other interesting information is that the txg_sync_thread()/spa_sync() had not been notified to stop by the pool destroy/spa_export_common() as we can see from the relevant stack:

502      UNINTERRUPTABLE                    1
         context_switch+0xeb
         __schedule+0x291
         schedule+0x2c
         io_schedule+0x16
         cv_wait_common+0x105 [spl]
         __cv_wait_io+0x18 [spl]
         txg_wait_synced_impl+0x146 [zfs]
         txg_wait_synced+0x10 [zfs]
         vdev_initialize_stop_all+0x80 [zfs]
         spa_unload+0x8f [zfs]
         spa_export_common+0x1fa [zfs]
         spa_destroy+0x1a [zfs]
         zfs_ioc_pool_destroy+0x1a [zfs]
         zfsdev_ioctl+0x619 [zfs]
         vfs_ioctl+0x1c
         do_vfs_ioctl+0xa8
         SYSC_ioctl+0x5a
         SyS_ioctl+0x79
         do_syscall_64+0x73
         entry_SYSCALL_64

relevant code snippet from spa_unload():

	if (spa->spa_root_vdev) {
		vdev_t *root_vdev = spa->spa_root_vdev;
		vdev_initialize_stop_all(root_vdev, VDEV_INITIALIZE_ACTIVE);
		vdev_trim_stop_all(root_vdev, VDEV_TRIM_ACTIVE);
		vdev_autotrim_stop_all(spa);
	}

	/*
	 * Stop syncing.
	 */
	if (spa->spa_sync_on) {
		txg_sync_stop(spa->spa_dsl_pool);
		spa->spa_sync_on = B_FALSE;
	}

My theory is that the following happened:
[1] At TXG 57, for whatever reason, our metaslab went through metaslab_weight()->metaslab_set_fragmentation() and went into the special case code in that function that sets metaslab_condense_wanted and dirties the metaslab because it has not been used before the SPACEMAP_HISTOGRAM feature was enabled:

	/*
	 * If this metaslab's space map has not been upgraded, flag it
	 * so that we upgrade next time we encounter it.
	 */
	if (msp->ms_sm->sm_dbuf->db_size != sizeof (space_map_phys_t)) {
		uint64_t txg = spa_syncing_txg(spa);
		vdev_t *vd = msp->ms_group->mg_vd;

		/*
		 * If we've reached the final dirty txg, then we must
		 * be shutting down the pool. We don't want to dirty
		 * any data past this point so skip setting the condense
		 * flag. We can retry this action the next time the pool
		 * is imported.
		 */
		if (spa_writeable(spa) && txg < spa_final_dirty_txg(spa)) {
			msp->ms_condense_wanted = B_TRUE;
			vdev_dirty(vd, VDD_METASLAB, msp, txg + 1);
			zfs_dbgmsg("txg %llu, requesting force condense: "
			    "ms_id %llu, vdev_id %llu", txg, msp->ms_id,
			    vd->vdev_id);
		}
		msp->ms_fragmentation = ZFS_FRAG_INVALID;
		return;
	}

[2] Unfortunately, the metaslab was not loaded so it didn't get condensed in the next metaslab_sync() at TXG 58 (see first code snippet of metaslab_sync() above). Yet it went through metaslab_weight()/metaslab_set_fragmentation() again as we can see in the zfs_dbgmsg output, which means that it was dirty for TXG 59 too.
[3] Then at TXG 59 we went through vdev_sync_done()>metaslab_sync_reasses()>metaslab_group_preload() and we ended up preloading that metaslab because metaslab_condense_wanted was set:

static void
metaslab_group_preload(metaslab_group_t *mg)
{
	avl_tree_t *t = &mg->mg_metaslab_tree;
<... cropped ...>
	/*
	 * Load the next potential metaslabs
	 */
	for (msp = avl_first(t); msp != NULL; msp = AVL_NEXT(t, msp)) {
		ASSERT3P(msp->ms_group, ==, mg);

		/*
		 * We preload only the maximum number of metaslabs specified
		 * by metaslab_preload_limit. If a metaslab is being forced
		 * to condense then we preload it too. This will ensure
		 * that force condensing happens in the next txg.
		 */
		if (++m > metaslab_preload_limit && !msp->ms_condense_wanted) {
			continue;
		}

		VERIFY(taskq_dispatch(mg->mg_taskq, metaslab_preload,
		    msp, TQ_SLEEP) != TASKQID_INVALID);
	}
	mutex_exit(&mg->mg_lock);
}

Thus the metaslab was loaded at TXG 59 and most probably dirtied for TXG 60, which caused it to go through metaslab_sync() and trigger the panic.

There are multiple things in this bug:
[1] In general the spa_final_txg mechanism is hacky and probably needs to be rethought together with its relevant assertions.
[2] Looking at the potential codepaths that metaslab_set_fragmentation() can be called from, I saw that a common one can be the verify function metaslab_verify_weight_and_frag(). This function has helped us find bugs in the past but now I realize that it can also have unexpected side-effects, like dirtying a metaslab if we just upgraded the pool. I issued #9185 for this.
[3] Depending on how we deal with (1) and assuming that my above theory is correct, we may also want to rethink the situation where preloading is triggered from spa_sync(). I was a bit surprised that we don't have any explicit checks or assertions about preloading happening when the pools is being destroyed (e.g. the spa_pool_state was literally set to POOL_DESTROYED)

sdimitro added a commit to sdimitro/zfs that referenced this issue Aug 23, 2019
`metaslab_verify_weight_and_frag()` a verification function and
by the end of it there shouldn't be any side-effects.

The function calls `metaslab_weight()` which in turn calls
`metaslab_set_fragmentation()`. The latter can dirty and otherwise
not dirty metaslab fro the next TXGand set `metaslab_condense_wanted`
if the spacemaps were just upgraded (meaning we just enabled the
SPACEMAP_HISTOGRAM feature through upgrade).

This patch ensures that metaslabs like these are skipped thus
avoiding that problem. We could also get rid of that function
completely but I hesitated because it has caught issues during
development of other features in the past.

Fixing this issue should also help with with most failures that
issue openzfs#9186 has been causing to the test-bots recently.
sdimitro added a commit to sdimitro/zfs that referenced this issue Aug 23, 2019
`metaslab_verify_weight_and_frag()` a verification function and
by the end of it there shouldn't be any side-effects.

The function calls `metaslab_weight()` which in turn calls
`metaslab_set_fragmentation()`. The latter can dirty and otherwise
not dirty metaslab fro the next TXGand set `metaslab_condense_wanted`
if the spacemaps were just upgraded (meaning we just enabled the
SPACEMAP_HISTOGRAM feature through upgrade).

This patch ensures that metaslabs like these are skipped thus
avoiding that problem. We could also get rid of that function
completely but I hesitated because it has caught issues during
development of other features in the past.

Fixing this issue should also help with with most failures that
issue openzfs#9186 has been causing to the test-bots recently.

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Aug 27, 2019
`metaslab_verify_weight_and_frag()` a verification function and
by the end of it there shouldn't be any side-effects.

The function calls `metaslab_weight()` which in turn calls
`metaslab_set_fragmentation()`. The latter can dirty and otherwise
not dirty metaslab fro the next TXGand set `metaslab_condense_wanted`
if the spacemaps were just upgraded (meaning we just enabled the
SPACEMAP_HISTOGRAM feature through upgrade).

This patch ensures that metaslabs like these are skipped thus
avoiding that problem. We could also get rid of that function
completely but I hesitated because it has caught issues during
development of other features in the past.

Fixing this issue should also help with with most failures that
issue openzfs#9186 has been causing to the test-bots recently.

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 28, 2019
Until issues openzfs#9185 and openzfs#9186 have been resolved the following zpool
upgrade tests are being disabled to prevent CI failures.

  zpool_upgrade_002_pos,
  zpool_upgrade_003_pos,
  zpool_upgrade_004_pos,
  zpool_upgrade_007_pos,
  zpool_upgrade_008_pos

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
behlendorf added a commit that referenced this issue Aug 28, 2019
Until issues #9185 and #9186 have been resolved the following zpool
upgrade tests are being disabled to prevent CI failures.

  zpool_upgrade_002_pos,
  zpool_upgrade_003_pos,
  zpool_upgrade_004_pos,
  zpool_upgrade_007_pos,
  zpool_upgrade_008_pos

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #9185
Issue #9186
Closes #9225
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Aug 29, 2019
`metaslab_verify_weight_and_frag()` a verification function and
by the end of it there shouldn't be any side-effects.

The function calls `metaslab_weight()` which in turn calls
`metaslab_set_fragmentation()`. The latter can dirty and otherwise
not dirty metaslab fro the next TXGand set `metaslab_condense_wanted`
if the spacemaps were just upgraded (meaning we just enabled the
SPACEMAP_HISTOGRAM feature through upgrade).

This patch ensures that metaslabs like these are skipped thus
avoiding that problem. We could also get rid of that function
completely but I hesitated because it has caught issues during
development of other features in the past.

Fixing this issue should also help with with most failures that
issue openzfs#9186 has been causing to the test-bots recently.

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Aug 29, 2019
`metaslab_verify_weight_and_frag()` a verification function and
by the end of it there shouldn't be any side-effects.

The function calls `metaslab_weight()` which in turn calls
`metaslab_set_fragmentation()`. The latter can dirty and otherwise
not dirty metaslab fro the next TXGand set `metaslab_condense_wanted`
if the spacemaps were just upgraded (meaning we just enabled the
SPACEMAP_HISTOGRAM feature through upgrade).

This patch ensures that metaslabs like these are skipped thus
avoiding that problem. We could also get rid of that function
completely but I hesitated because it has caught issues during
development of other features in the past.

Fixing this issue should also help with with most failures that
issue openzfs#9186 has been causing to the test-bots recently.

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Dec 24, 2019
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#9185
Closes openzfs#9186
Closes openzfs#9231
Closes openzfs#9253
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Dec 27, 2019
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#9185
Closes openzfs#9186
Closes openzfs#9231
Closes openzfs#9253
tonyhutter pushed a commit that referenced this issue Jan 23, 2020
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9185
Closes #9186
Closes #9231
Closes #9253
allanjude pushed a commit to KlaraSystems/zfs that referenced this issue Apr 28, 2020
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#9185
Closes openzfs#9186
Closes openzfs#9231
Closes openzfs#9253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant