Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS: Fix add_nested_replacing_spare test case #7342

Merged
merged 1 commit into from
Apr 4, 2018

Conversation

loli10K
Copy link
Contributor

@loli10K loli10K commented Mar 25, 2018

Description

It seems zpool scrub is not a good candidate to kick in a hot spare, use zpool reopen instead.

Spurious failures are caused by a race condition: ZED replaces a faulted device with a spare after zfs_ioc_pool_scan() -> dsl_scan() has reopened the root vdev (which finds the leaf "faulted" by zinject) but before the scrub is actually started: dsl_scan_setup_check() returns EBUSY which results in the following failure:

21:52:46.94 NOTE: Starting ZED
21:52:46.95 SUCCESS: truncate -s 0 /var/tmp/zed/zed.debug.log
21:52:46.95 SUCCESS: eval zed -vF -d /var/tmp/zed -p /var/tmp/zed/zed.pid -P /var/tmp/constrained_path.T0sJ -s /var/tmp/zed/state 2>/var/tmp/zed/zed.log &
21:52:47.12 SUCCESS: zpool create testpool mirror /mnt/fault-dev /mnt/safe-dev1 /mnt/safe-dev2 /mnt/safe-dev3
21:52:47.20 SUCCESS: zpool add testpool spare /mnt/spare-dev1
21:52:47.21 Added handler 273 with the following properties:
21:52:47.21   pool: testpool
21:52:47.21   vdev: 3e41072b2fd21d1a
21:52:47.21 SUCCESS: zinject -d /mnt/fault-dev -e nxio -T all -f 100 testpool
21:52:47.32 cannot scrub testpool: currently resilvering
21:52:47.32 ERROR: zpool scrub testpool exited 1

Output of Systemtap tracing:

[execname  :pid   ] caller <-> probefunc
------------------------------------------------------------------
[vdev_open :20176 ] vdev_open -> vdev_set_state 7
[vdev_open :20177 ] vdev_open -> vdev_set_state 7
[vdev_open :20178 ] vdev_open -> vdev_set_state 7
[vdev_open :20179 ] vdev_open -> vdev_set_state 7
[vdev_open :20175 ] vdev_open -> vdev_set_state 7
[zpool     :20135 ] vdev_open -> vdev_set_state 7
[zpool     :20135 ] vdev_open -> vdev_set_state 7
[zpool     :20135 ] vdev_open -> vdev_set_state 7
[zpool     :20386 ] zfs_ioc_pool_scan -> dsl_scan
[zpool     :20386 ] dsl_scan -> vdev_reopen
[vdev_open :20388 ] vdev_open -> vdev_set_state 4 (VDEV_STATE_CANT_OPEN)
[vdev_open :20389 ] vdev_open -> vdev_set_state 7
[vdev_open :20390 ] vdev_open -> vdev_set_state 7
[vdev_open :20388 ] vdev_open -> vdev_set_state 7
[vdev_open :20387 ] vdev_open -> vdev_set_state 7
[vdev_open :20387 ] vdev_open -> vdev_set_state 6
[zpool     :20386 ] vdev_open -> vdev_set_state 7
[zpool     :20386 ] vdev_open -> vdev_set_state 6
[zpool     :20386 ] vdev_propagate_state -> vdev_set_state 6
[zpool     :20386 ] 0xffffffff8104f605 <- vdev_reopen 
[zed       :17565 ] zfs_ioc_vdev_attach -> spa_vdev_attach * ZED kicks in the spare *
[vdev_open :20397 ] vdev_open -> vdev_set_state 7
[zed       :17565 ] vdev_open -> vdev_set_state 7
[zed       :17565 ] vdev_propagate_state -> vdev_set_state 6
[zed       :17565 ] vdev_propagate_state -> vdev_set_state 6
[zed       :17565 ] vdev_propagate_state -> vdev_set_state 6
[zed       :17565 ] vdev_propagate_state -> vdev_set_state 6
[zed       :17565 ] vdev_propagate_state -> vdev_set_state 6
[zed       :17565 ] spa_vdev_attach -> dsl_resilver_restart * resilver starts after the spare is attached *
[txg_sync  :20195 ] dsl_scan_sync -> dsl_scan_setup_sync
[txg_sync  :20195 ] spa_sync <- dsl_scan_setup_sync 
[zpool     :20386 ] dsl_sync_task -> dsl_scan_setup_check
[zpool     :20386 ] dsl_scan <- dsl_scan_setup_check return=16 (EBUSY)
[zpool     :20386 ] zfsdev_ioctl <- dsl_scan return=16
[txg_sync  :20195 ] dsl_scan_done ->spa_async_request 8

Motivation and Context

Fix #7247

How Has This Been Tested?

Tested locally on Debian builder, traced code with Systemtap to find the race condition.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@loli10K loli10K added the Status: Work in Progress Not yet ready for general review label Mar 25, 2018
@loli10K loli10K force-pushed the issue-7247 branch 9 times, most recently from 395be68 to c7e2506 Compare March 31, 2018 10:44
@loli10K loli10K force-pushed the issue-7247 branch 2 times, most recently from 2939286 to 2f9f138 Compare April 2, 2018 15:58
@loli10K loli10K changed the title [WIP] ZTS: Fix add_nested_replacing_spare test case ZTS: Fix add_nested_replacing_spare test case Apr 2, 2018
@loli10K loli10K removed the Status: Work in Progress Not yet ready for general review label Apr 2, 2018
Use 'zpool reopen' instead of 'zpool scrub' to kick in the spare device:
this is required to avoid spurious failures caused by a race condition
in events processing by the ZFS Event Daemon:

P1 (zpool scrub)                            P2 (zed)
---
zfs_ioc_pool_scan()
 -> dsl_scan()
  -> vdev_reopen()
   -> vdev_set_state(VDEV_STATE_CANT_OPEN)
                                            zfs_ioc_vdev_attach()
                                             -> spa_vdev_attach()
                                              -> dsl_resilver_restart()
  -> dsl_sync_task()
   -> dsl_scan_setup_check()
   <- dsl_scan_setup_check(): EBUSY

Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
@codecov
Copy link

codecov bot commented Apr 2, 2018

Codecov Report

Merging #7342 into master will decrease coverage by 0.15%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7342      +/-   ##
==========================================
- Coverage   76.38%   76.23%   -0.16%     
==========================================
  Files         329      329              
  Lines      104214   104214              
==========================================
- Hits        79606    79446     -160     
- Misses      24608    24768     +160
Flag Coverage Δ
#kernel 76.01% <ø> (-0.42%) ⬇️
#user 65.43% <ø> (-0.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10adee2...19fe3ce. Read the comment docs.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a nice simple fix.

@behlendorf behlendorf merged commit f119d00 into openzfs:master Apr 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test case: add_nested_replacing_spare
2 participants