-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: metaslab_sync() running after spa_final_dirty_txg() #9186
Comments
sdimitro
added a commit
to sdimitro/zfs
that referenced
this issue
Aug 23, 2019
`metaslab_verify_weight_and_frag()` a verification function and by the end of it there shouldn't be any side-effects. The function calls `metaslab_weight()` which in turn calls `metaslab_set_fragmentation()`. The latter can dirty and otherwise not dirty metaslab fro the next TXGand set `metaslab_condense_wanted` if the spacemaps were just upgraded (meaning we just enabled the SPACEMAP_HISTOGRAM feature through upgrade). This patch ensures that metaslabs like these are skipped thus avoiding that problem. We could also get rid of that function completely but I hesitated because it has caught issues during development of other features in the past. Fixing this issue should also help with with most failures that issue openzfs#9186 has been causing to the test-bots recently.
12 tasks
sdimitro
added a commit
to sdimitro/zfs
that referenced
this issue
Aug 23, 2019
`metaslab_verify_weight_and_frag()` a verification function and by the end of it there shouldn't be any side-effects. The function calls `metaslab_weight()` which in turn calls `metaslab_set_fragmentation()`. The latter can dirty and otherwise not dirty metaslab fro the next TXGand set `metaslab_condense_wanted` if the spacemaps were just upgraded (meaning we just enabled the SPACEMAP_HISTOGRAM feature through upgrade). This patch ensures that metaslabs like these are skipped thus avoiding that problem. We could also get rid of that function completely but I hesitated because it has caught issues during development of other features in the past. Fixing this issue should also help with with most failures that issue openzfs#9186 has been causing to the test-bots recently. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
12 tasks
pcd1193182
pushed a commit
to pcd1193182/zfs
that referenced
this issue
Aug 27, 2019
`metaslab_verify_weight_and_frag()` a verification function and by the end of it there shouldn't be any side-effects. The function calls `metaslab_weight()` which in turn calls `metaslab_set_fragmentation()`. The latter can dirty and otherwise not dirty metaslab fro the next TXGand set `metaslab_condense_wanted` if the spacemaps were just upgraded (meaning we just enabled the SPACEMAP_HISTOGRAM feature through upgrade). This patch ensures that metaslabs like these are skipped thus avoiding that problem. We could also get rid of that function completely but I hesitated because it has caught issues during development of other features in the past. Fixing this issue should also help with with most failures that issue openzfs#9186 has been causing to the test-bots recently. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
behlendorf
added a commit
to behlendorf/zfs
that referenced
this issue
Aug 28, 2019
Until issues openzfs#9185 and openzfs#9186 have been resolved the following zpool upgrade tests are being disabled to prevent CI failures. zpool_upgrade_002_pos, zpool_upgrade_003_pos, zpool_upgrade_004_pos, zpool_upgrade_007_pos, zpool_upgrade_008_pos Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
12 tasks
behlendorf
added a commit
that referenced
this issue
Aug 28, 2019
Until issues #9185 and #9186 have been resolved the following zpool upgrade tests are being disabled to prevent CI failures. zpool_upgrade_002_pos, zpool_upgrade_003_pos, zpool_upgrade_004_pos, zpool_upgrade_007_pos, zpool_upgrade_008_pos Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #9185 Issue #9186 Closes #9225
pcd1193182
pushed a commit
to pcd1193182/zfs
that referenced
this issue
Aug 29, 2019
`metaslab_verify_weight_and_frag()` a verification function and by the end of it there shouldn't be any side-effects. The function calls `metaslab_weight()` which in turn calls `metaslab_set_fragmentation()`. The latter can dirty and otherwise not dirty metaslab fro the next TXGand set `metaslab_condense_wanted` if the spacemaps were just upgraded (meaning we just enabled the SPACEMAP_HISTOGRAM feature through upgrade). This patch ensures that metaslabs like these are skipped thus avoiding that problem. We could also get rid of that function completely but I hesitated because it has caught issues during development of other features in the past. Fixing this issue should also help with with most failures that issue openzfs#9186 has been causing to the test-bots recently. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
pcd1193182
pushed a commit
to pcd1193182/zfs
that referenced
this issue
Aug 29, 2019
`metaslab_verify_weight_and_frag()` a verification function and by the end of it there shouldn't be any side-effects. The function calls `metaslab_weight()` which in turn calls `metaslab_set_fragmentation()`. The latter can dirty and otherwise not dirty metaslab fro the next TXGand set `metaslab_condense_wanted` if the spacemaps were just upgraded (meaning we just enabled the SPACEMAP_HISTOGRAM feature through upgrade). This patch ensures that metaslabs like these are skipped thus avoiding that problem. We could also get rid of that function completely but I hesitated because it has caught issues during development of other features in the past. Fixing this issue should also help with with most failures that issue openzfs#9186 has been causing to the test-bots recently. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
12 tasks
tonyhutter
pushed a commit
to tonyhutter/zfs
that referenced
this issue
Dec 24, 2019
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being exported, we can enter a situation that causes a kernel panic. Any metaslabs that are loaded during the final dirty txg and haven't already been condensed will cause metaslab_sync to proceed after the final dirty txg so that the condense can be performed, which there are assertions to prevent. Because of the nature of this issue, there are a number of ways we can enter this state. Rather than try to prevent each of them one by one, potentially missing some edge cases, we instead cut it off at the point of intersection; by preventing metaslab_sync from proceeding if it would only do so to perform a condense and we're past the final dirty txg, we preserve the utility of the existing asserts while preventing this particular issue. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes openzfs#9185 Closes openzfs#9186 Closes openzfs#9231 Closes openzfs#9253
tonyhutter
pushed a commit
to tonyhutter/zfs
that referenced
this issue
Dec 27, 2019
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being exported, we can enter a situation that causes a kernel panic. Any metaslabs that are loaded during the final dirty txg and haven't already been condensed will cause metaslab_sync to proceed after the final dirty txg so that the condense can be performed, which there are assertions to prevent. Because of the nature of this issue, there are a number of ways we can enter this state. Rather than try to prevent each of them one by one, potentially missing some edge cases, we instead cut it off at the point of intersection; by preventing metaslab_sync from proceeding if it would only do so to perform a condense and we're past the final dirty txg, we preserve the utility of the existing asserts while preventing this particular issue. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes openzfs#9185 Closes openzfs#9186 Closes openzfs#9231 Closes openzfs#9253
tonyhutter
pushed a commit
that referenced
this issue
Jan 23, 2020
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being exported, we can enter a situation that causes a kernel panic. Any metaslabs that are loaded during the final dirty txg and haven't already been condensed will cause metaslab_sync to proceed after the final dirty txg so that the condense can be performed, which there are assertions to prevent. Because of the nature of this issue, there are a number of ways we can enter this state. Rather than try to prevent each of them one by one, potentially missing some edge cases, we instead cut it off at the point of intersection; by preventing metaslab_sync from proceeding if it would only do so to perform a condense and we're past the final dirty txg, we preserve the utility of the existing asserts while preventing this particular issue. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9185 Closes #9186 Closes #9231 Closes #9253
allanjude
pushed a commit
to KlaraSystems/zfs
that referenced
this issue
Apr 28, 2020
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being exported, we can enter a situation that causes a kernel panic. Any metaslabs that are loaded during the final dirty txg and haven't already been condensed will cause metaslab_sync to proceed after the final dirty txg so that the condense can be performed, which there are assertions to prevent. Because of the nature of this issue, there are a number of ways we can enter this state. Rather than try to prevent each of them one by one, potentially missing some edge cases, we instead cut it off at the point of intersection; by preventing metaslab_sync from proceeding if it would only do so to perform a condense and we're past the final dirty txg, we preserve the utility of the existing asserts while preventing this particular issue. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes openzfs#9185 Closes openzfs#9186 Closes openzfs#9231 Closes openzfs#9253
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Paul Dagnelie hit the following assertion in
metaslab_sync()
while runningzpool_upgrade_002
from the test suite:The current TXG was 60 and
spa_final_dirty()
had returned 59.Looking at the
zfs_dbgmsg
output we'd see the following relevant data about the pool in question:The metaslab that
metaslab_sync()
panicked was{ms_id: 0, vdev_id 1}
and there are a few interesting things about it. First of all the metaslab was loaded at some point between TXG 59 and 60 as we can see from the above message. In addition, we can also see that the metaslab went throughmetaslab_set_fragmentation()
twice right before, in TXG 57 and 58. Finally, what is also of interest but can just as well be a coincidence is that a total of 3 metaslabs were all loaded at the same time TXG 59 (including the one in question).Other interesting information is that the
txg_sync_thread()/spa_sync()
had not been notified to stop by the pool destroy/spa_export_common() as we can see from the relevant stack:relevant code snippet from
spa_unload()
:My theory is that the following happened:
[1] At TXG 57, for whatever reason, our metaslab went through
metaslab_weight()->metaslab_set_fragmentation()
and went into the special case code in that function that setsmetaslab_condense_wanted
and dirties the metaslab because it has not been used before theSPACEMAP_HISTOGRAM
feature was enabled:[2] Unfortunately, the metaslab was not loaded so it didn't get condensed in the next
metaslab_sync()
at TXG 58 (see first code snippet ofmetaslab_sync()
above). Yet it went throughmetaslab_weight()/metaslab_set_fragmentation()
again as we can see in thezfs_dbgmsg
output, which means that it was dirty for TXG 59 too.[3] Then at TXG 59 we went through
vdev_sync_done()>metaslab_sync_reasses()>metaslab_group_preload()
and we ended up preloading that metaslab becausemetaslab_condense_wanted
was set:Thus the metaslab was loaded at TXG 59 and most probably dirtied for TXG 60, which caused it to go through
metaslab_sync()
and trigger the panic.There are multiple things in this bug:
[1] In general the
spa_final_txg
mechanism is hacky and probably needs to be rethought together with its relevant assertions.[2] Looking at the potential codepaths that
metaslab_set_fragmentation()
can be called from, I saw that a common one can be the verify functionmetaslab_verify_weight_and_frag()
. This function has helped us find bugs in the past but now I realize that it can also have unexpected side-effects, like dirtying a metaslab if we just upgraded the pool. I issued #9185 for this.[3] Depending on how we deal with (1) and assuming that my above theory is correct, we may also want to rethink the situation where preloading is triggered from
spa_sync()
. I was a bit surprised that we don't have any explicit checks or assertions about preloading happening when the pools is being destroyed (e.g. the spa_pool_state was literally set toPOOL_DESTROYED
)The text was updated successfully, but these errors were encountered: