panic: metaslab_sync() running after spa_final_dirty_txg() #9186

sdimitro · 2019-08-20T17:32:45Z

Paul Dagnelie hit the following assertion in metaslab_sync() while running zpool_upgrade_002 from the test suite:

void
metaslab_sync(metaslab_t *msp, uint64_t txg)
{
<... cropped ..>
	/*
	 * Normally, we don't want to process a metaslab if there are no
	 * allocations or frees to perform. However, if the metaslab is being
	 * forced to condense and it's loaded, we need to let it through.
	 */
	if (range_tree_is_empty(alloctree) &&
	    range_tree_is_empty(msp->ms_freeing) &&
	    range_tree_is_empty(msp->ms_checkpointing) &&
	    !(msp->ms_loaded && msp->ms_condense_wanted))
		return;

	VERIFY(txg <= spa_final_dirty_txg(spa)); // <---- This one
<... cropped ..>

The current TXG was 60 and spa_final_dirty() had returned 59.
Looking at the zfs_dbgmsg output we'd see the following relevant data about the pool in question:

<... cropped ...>
spa_history.c:319:spa_history_log_sync(): txg 32 set feature@userobj_accounting=enabled
spa_history.c:319:spa_history_log_sync(): txg 33 set feature@encryption=enabled
spa_history.c:319:spa_history_log_sync(): txg 34 set feature@project_quota=enabled
spa_history.c:319:spa_history_log_sync(): txg 35 set feature@device_removal=enabled
spa_history.c:319:spa_history_log_sync(): txg 36 set feature@obsolete_counts=enabled
spa_history.c:319:spa_history_log_sync(): txg 37 set feature@zpool_checkpoint=enabled
spa_history.c:319:spa_history_log_sync(): txg 38 set feature@spacemap_v2=enabled
spa_history.c:319:spa_history_log_sync(): txg 39 set feature@allocation_classes=enabled
spa_history.c:319:spa_history_log_sync(): txg 40 set feature@resilver_defer=enabled
spa_history.c:319:spa_history_log_sync(): txg 41 set feature@bookmark_v2=enabled
spa_history.c:319:spa_history_log_sync(): txg 42 set feature@redaction_bookmarks=enabled
spa_history.c:319:spa_history_log_sync(): txg 43 set feature@redacted_datasets=enabled
spa_history.c:319:spa_history_log_sync(): txg 44 set feature@bookmark_written=enabled
spa_history.c:319:spa_history_log_sync(): txg 45 set feature@log_spacemap=enabled
spa_history.c:306:spa_history_log_sync(): command: zpool upgrade pool-v3stripe
metaslab.c:3814:metaslab_flush(): flushing: txg 46, spa pool-v3stripe, vdev_id 0, ms_id 1, unflushed_allocs 7680, unflushed_frees 9216, appended 64 bytes
spa_history.c:306:spa_history_log_sync(): command: zpool destroy -f pool-v3stripe
metaslab.c:3814:metaslab_flush(): flushing: txg 50, spa pool-v3stripe, vdev_id 0, ms_id 2, unflushed_allocs 80896, unflushed_frees 18432, appended 80 bytes
metaslab.c:2850:metaslab_set_fragmentation(): txg 57, requesting force condense: ms_id 0, vdev_id 1
metaslab.c:2850:metaslab_set_fragmentation(): txg 58, requesting force condense: ms_id 0, vdev_id 1
spa_misc.c:409:spa_load_note(): spa_load(pool-v3stripe, config trusted): UNLOADING
metaslab.c:3814:metaslab_flush(): flushing: txg 59, spa pool-v3stripe, vdev_id 0, ms_id 3, unflushed_allocs 12288, unflushed_frees 23040, appended 96 bytes
metaslab.c:2422:metaslab_load_impl(): loading: txg 59, spa pool-v3stripe, vdev_id 1, ms_id 0, smp_length 136, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, loading_time 0 ms, ms_max_size 524288, max size error 0
metaslab.c:2422:metaslab_load_impl(): loading: txg 59, spa pool-v3stripe, vdev_id 2, ms_id 8, smp_length 0, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, loading_time 0 ms, ms_max_size 524288, max size error 524288
metaslab.c:2422:metaslab_load_impl(): loading: txg 59, spa pool-v3stripe, vdev_id 0, ms_id 8, smp_length 0, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, loading_time 0 ms, ms_max_size 524288, max size error 524288

The metaslab that metaslab_sync() panicked was {ms_id: 0, vdev_id 1} and there are a few interesting things about it. First of all the metaslab was loaded at some point between TXG 59 and 60 as we can see from the above message. In addition, we can also see that the metaslab went through metaslab_set_fragmentation() twice right before, in TXG 57 and 58. Finally, what is also of interest but can just as well be a coincidence is that a total of 3 metaslabs were all loaded at the same time TXG 59 (including the one in question).

Other interesting information is that the txg_sync_thread()/spa_sync() had not been notified to stop by the pool destroy/spa_export_common() as we can see from the relevant stack:

502      UNINTERRUPTABLE                    1
         context_switch+0xeb
         __schedule+0x291
         schedule+0x2c
         io_schedule+0x16
         cv_wait_common+0x105 [spl]
         __cv_wait_io+0x18 [spl]
         txg_wait_synced_impl+0x146 [zfs]
         txg_wait_synced+0x10 [zfs]
         vdev_initialize_stop_all+0x80 [zfs]
         spa_unload+0x8f [zfs]
         spa_export_common+0x1fa [zfs]
         spa_destroy+0x1a [zfs]
         zfs_ioc_pool_destroy+0x1a [zfs]
         zfsdev_ioctl+0x619 [zfs]
         vfs_ioctl+0x1c
         do_vfs_ioctl+0xa8
         SYSC_ioctl+0x5a
         SyS_ioctl+0x79
         do_syscall_64+0x73
         entry_SYSCALL_64

relevant code snippet from spa_unload():

	if (spa->spa_root_vdev) {
		vdev_t *root_vdev = spa->spa_root_vdev;
		vdev_initialize_stop_all(root_vdev, VDEV_INITIALIZE_ACTIVE);
		vdev_trim_stop_all(root_vdev, VDEV_TRIM_ACTIVE);
		vdev_autotrim_stop_all(spa);
	}

	/*
	 * Stop syncing.
	 */
	if (spa->spa_sync_on) {
		txg_sync_stop(spa->spa_dsl_pool);
		spa->spa_sync_on = B_FALSE;
	}

My theory is that the following happened:
[1] At TXG 57, for whatever reason, our metaslab went through metaslab_weight()->metaslab_set_fragmentation() and went into the special case code in that function that sets metaslab_condense_wanted and dirties the metaslab because it has not been used before the SPACEMAP_HISTOGRAM feature was enabled:

	/*
	 * If this metaslab's space map has not been upgraded, flag it
	 * so that we upgrade next time we encounter it.
	 */
	if (msp->ms_sm->sm_dbuf->db_size != sizeof (space_map_phys_t)) {
		uint64_t txg = spa_syncing_txg(spa);
		vdev_t *vd = msp->ms_group->mg_vd;

		/*
		 * If we've reached the final dirty txg, then we must
		 * be shutting down the pool. We don't want to dirty
		 * any data past this point so skip setting the condense
		 * flag. We can retry this action the next time the pool
		 * is imported.
		 */
		if (spa_writeable(spa) && txg < spa_final_dirty_txg(spa)) {
			msp->ms_condense_wanted = B_TRUE;
			vdev_dirty(vd, VDD_METASLAB, msp, txg + 1);
			zfs_dbgmsg("txg %llu, requesting force condense: "
			    "ms_id %llu, vdev_id %llu", txg, msp->ms_id,
			    vd->vdev_id);
		}
		msp->ms_fragmentation = ZFS_FRAG_INVALID;
		return;
	}

[2] Unfortunately, the metaslab was not loaded so it didn't get condensed in the next metaslab_sync() at TXG 58 (see first code snippet of metaslab_sync() above). Yet it went through metaslab_weight()/metaslab_set_fragmentation() again as we can see in the zfs_dbgmsg output, which means that it was dirty for TXG 59 too.
[3] Then at TXG 59 we went through vdev_sync_done()>metaslab_sync_reasses()>metaslab_group_preload() and we ended up preloading that metaslab because metaslab_condense_wanted was set:

static void
metaslab_group_preload(metaslab_group_t *mg)
{
	avl_tree_t *t = &mg->mg_metaslab_tree;
<... cropped ...>
	/*
	 * Load the next potential metaslabs
	 */
	for (msp = avl_first(t); msp != NULL; msp = AVL_NEXT(t, msp)) {
		ASSERT3P(msp->ms_group, ==, mg);

		/*
		 * We preload only the maximum number of metaslabs specified
		 * by metaslab_preload_limit. If a metaslab is being forced
		 * to condense then we preload it too. This will ensure
		 * that force condensing happens in the next txg.
		 */
		if (++m > metaslab_preload_limit && !msp->ms_condense_wanted) {
			continue;
		}

		VERIFY(taskq_dispatch(mg->mg_taskq, metaslab_preload,
		    msp, TQ_SLEEP) != TASKQID_INVALID);
	}
	mutex_exit(&mg->mg_lock);
}

Thus the metaslab was loaded at TXG 59 and most probably dirtied for TXG 60, which caused it to go through metaslab_sync() and trigger the panic.

There are multiple things in this bug:
[1] In general the spa_final_txg mechanism is hacky and probably needs to be rethought together with its relevant assertions.
[2] Looking at the potential codepaths that metaslab_set_fragmentation() can be called from, I saw that a common one can be the verify function metaslab_verify_weight_and_frag(). This function has helped us find bugs in the past but now I realize that it can also have unexpected side-effects, like dirtying a metaslab if we just upgraded the pool. I issued #9185 for this.
[3] Depending on how we deal with (1) and assuming that my above theory is correct, we may also want to rethink the situation where preloading is triggered from spa_sync(). I was a bit surprised that we don't have any explicit checks or assertions about preloading happening when the pools is being destroyed (e.g. the spa_pool_state was literally set to POOL_DESTROYED)

The text was updated successfully, but these errors were encountered:

`metaslab_verify_weight_and_frag()` a verification function and by the end of it there shouldn't be any side-effects. The function calls `metaslab_weight()` which in turn calls `metaslab_set_fragmentation()`. The latter can dirty and otherwise not dirty metaslab fro the next TXGand set `metaslab_condense_wanted` if the spacemaps were just upgraded (meaning we just enabled the SPACEMAP_HISTOGRAM feature through upgrade). This patch ensures that metaslabs like these are skipped thus avoiding that problem. We could also get rid of that function completely but I hesitated because it has caught issues during development of other features in the past. Fixing this issue should also help with with most failures that issue openzfs#9186 has been causing to the test-bots recently.

`metaslab_verify_weight_and_frag()` a verification function and by the end of it there shouldn't be any side-effects. The function calls `metaslab_weight()` which in turn calls `metaslab_set_fragmentation()`. The latter can dirty and otherwise not dirty metaslab fro the next TXGand set `metaslab_condense_wanted` if the spacemaps were just upgraded (meaning we just enabled the SPACEMAP_HISTOGRAM feature through upgrade). This patch ensures that metaslabs like these are skipped thus avoiding that problem. We could also get rid of that function completely but I hesitated because it has caught issues during development of other features in the past. Fixing this issue should also help with with most failures that issue openzfs#9186 has been causing to the test-bots recently. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>

Until issues openzfs#9185 and openzfs#9186 have been resolved the following zpool upgrade tests are being disabled to prevent CI failures. zpool_upgrade_002_pos, zpool_upgrade_003_pos, zpool_upgrade_004_pos, zpool_upgrade_007_pos, zpool_upgrade_008_pos Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Until issues #9185 and #9186 have been resolved the following zpool upgrade tests are being disabled to prevent CI failures. zpool_upgrade_002_pos, zpool_upgrade_003_pos, zpool_upgrade_004_pos, zpool_upgrade_007_pos, zpool_upgrade_008_pos Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #9185 Issue #9186 Closes #9225

`metaslab_verify_weight_and_frag()` a verification function and by the end of it there shouldn't be any side-effects. The function calls `metaslab_weight()` which in turn calls `metaslab_set_fragmentation()`. The latter can dirty and otherwise not dirty metaslab fro the next TXGand set `metaslab_condense_wanted` if the spacemaps were just upgraded (meaning we just enabled the SPACEMAP_HISTOGRAM feature through upgrade). This patch ensures that metaslabs like these are skipped thus avoiding that problem. We could also get rid of that function completely but I hesitated because it has caught issues during development of other features in the past. Fixing this issue should also help with with most failures that issue openzfs#9186 has been causing to the test-bots recently. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>

If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being exported, we can enter a situation that causes a kernel panic. Any metaslabs that are loaded during the final dirty txg and haven't already been condensed will cause metaslab_sync to proceed after the final dirty txg so that the condense can be performed, which there are assertions to prevent. Because of the nature of this issue, there are a number of ways we can enter this state. Rather than try to prevent each of them one by one, potentially missing some edge cases, we instead cut it off at the point of intersection; by preventing metaslab_sync from proceeding if it would only do so to perform a condense and we're past the final dirty txg, we preserve the utility of the existing asserts while preventing this particular issue. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes openzfs#9185 Closes openzfs#9186 Closes openzfs#9231 Closes openzfs#9253

If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being exported, we can enter a situation that causes a kernel panic. Any metaslabs that are loaded during the final dirty txg and haven't already been condensed will cause metaslab_sync to proceed after the final dirty txg so that the condense can be performed, which there are assertions to prevent. Because of the nature of this issue, there are a number of ways we can enter this state. Rather than try to prevent each of them one by one, potentially missing some edge cases, we instead cut it off at the point of intersection; by preventing metaslab_sync from proceeding if it would only do so to perform a condense and we're past the final dirty txg, we preserve the utility of the existing asserts while preventing this particular issue. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9185 Closes #9186 Closes #9231 Closes #9253

If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being exported, we can enter a situation that causes a kernel panic. Any metaslabs that are loaded during the final dirty txg and haven't already been condensed will cause metaslab_sync to proceed after the final dirty txg so that the condense can be performed, which there are assertions to prevent. Because of the nature of this issue, there are a number of ways we can enter this state. Rather than try to prevent each of them one by one, potentially missing some edge cases, we instead cut it off at the point of intersection; by preventing metaslab_sync from proceeding if it would only do so to perform a condense and we're past the final dirty txg, we preserve the utility of the existing asserts while preventing this particular issue. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes openzfs#9185 Closes openzfs#9186 Closes openzfs#9231 Closes openzfs#9253

sdimitro mentioned this issue Aug 23, 2019

9185 metaslab_verify_weight_and_frag() shouldn't cause side-effects #9209

Closed

12 tasks

pcd1193182 mentioned this issue Aug 26, 2019

Don't preload at or after final dirty txg #9216

Closed

12 tasks

behlendorf mentioned this issue Aug 28, 2019

ZTS: Temporarily disable several upgrade tests #9225

Merged

12 tasks

pcd1193182 mentioned this issue Aug 29, 2019

Prevent metaslab_sync panic due to spa_final_dirty_txg #9231

Merged

12 tasks

behlendorf closed this as completed in 475aa97 Aug 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic: metaslab_sync() running after spa_final_dirty_txg() #9186

panic: metaslab_sync() running after spa_final_dirty_txg() #9186

sdimitro commented Aug 20, 2019

panic: metaslab_sync() running after spa_final_dirty_txg() #9186

panic: metaslab_sync() running after spa_final_dirty_txg() #9186

Comments

sdimitro commented Aug 20, 2019