ZTS: Standardize use of destroy_dataset in cleanup #12663

behlendorf · 2021-10-20T22:36:45Z

Motivation and Context

After the Ubuntu 18.04 and 20.04 CI builder VMs were last updated we're
reliably seeing instances where ZFS volumes are active (still open) when
zfs destroy is run. This isn't unexpected since processes like blkid will
open the device when it's first created. The fix for this is to retry on busy
in the ZTS for Linix. This had been done previously but wasn't done
exhaustively in all places. This change is intended to address those remaining
cases by systematically updating the cleanup functions to use destroy_dataset
which does retry.

Description

When cleaning up a test case standardize on using the convention:

    datasetexists $ds && destroy_dataset $ds <flags>

By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure
that the destroy is retried in the event that a ZFS volume is busy.
This helps ensures ensure tests are fully cleaned up and prevents false
positive test failures on Linux.

Note that all of the tests which used 'zfs destroy' in cleanup have
been updated even if they don't use volumes. This was done to
clearly establish the expected convention.

How Has This Been Tested?

Locally ran the majority of the test suite on Ubuntu 20.04 which I
was able to reproduce this issue. With the change applied the
testing which previously failed are passing.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

When cleaning up a test case standardize on using the convention: datasetexists $ds && destroy_dataset $ds <flags> By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure that the destroy is retried in the event that a ZFS volume is busy. This helps ensures ensure tests are fully cleaned up and prevents false positive test failures on Linux. Note that all of the tests which used 'zfs destroy' in cleanup have been updated even if they don't use volumes. This was done to clearly establish the expected convention. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

tests/zfs-tests/tests/functional/cli_root/zfs_rename/zfs_rename.kshlib

rincebrain · 2021-10-21T09:23:49Z

This patch seems to have missed

zfs/tests/zfs-tests/tests/functional/snapused/snapused_001_pos.ksh

Line 54 in ec64fdb

log_must zfs destroy -rR $USEDTEST

And the copy-paste jobs in the other snapused, which explains those continued failures.

history_002_pos is failing on the destroys in:

zfs/tests/zfs-tests/tests/functional/history/history_002_pos.ksh

Lines 182 to 197 in ec64fdb

    
           run_and_verify "zfs destroy $fssnap2" 
        
           run_and_verify "zfs destroy $volsnap2" 
        
           run_and_verify "zfs receive $fs < $tmpfile" 
        
           run_and_verify "zfs receive $vol < $tmpfile2" 
        
           run_and_verify "zfs rollback -r $fssnap" 
        
           run_and_verify "zfs rollback -r $volsnap" 
        
           run_and_verify "zfs clone $fssnap $fsclone" 
        
           run_and_verify "zfs clone $volsnap $volclone" 
        
           run_and_verify "zfs rename $fs $newfs" 
        
           run_and_verify "zfs rename $vol $newvol" 
        
           run_and_verify "zfs promote $fsclone" 
        
           run_and_verify "zfs promote $volclone" 
        
           run_and_verify "zfs destroy $newfs" 
        
           run_and_verify "zfs destroy $newvol" 
        
           run_and_verify "zfs destroy -rf $fsclone" 
        
           run_and_verify "zfs destroy -rf $volclone"

So I'd probably just dataset_exists && destroy_dataset $FOO them too.

iostat/setup failed because zfs_list/cleanup failed, and that failed because

zfs/tests/zfs-tests/tests/functional/cli_root/zfs_get/zfs_get_list_d.kshlib

Lines 79 to 82 in ec64fdb

    
           function depth_fs_cleanup 
        
           { 
        
           	log_must zfs destroy -rR $DEPTH_FS 
        
           }

needs the same love as everything else in this PR.

zfs_unload-key_all is still failing on trying to unload-key -a while the zvol is still open. If "retry a couple times" is the order of the day, perhaps a s/log_must/log_must_busy/ is the fix in kind here?

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2021-10-21T19:43:54Z

Thanks for taking a look. I've added a commit to the PR to address the latest failures we saw in the CI. That should improve things but I'd still like to run this PR through the Ubuntu builders a few times to make sure it reliably passes. It wouldn't shock me if I missed some test cases since I only updated those which did a "datasetexists && zfs destroy". It looks like that's why I missed the "snapused" tests.

rincebrain · 2021-10-21T19:51:51Z

Of course.

I'm really curious to know how you reproduced it locally, because I tried Ubuntu 18.04/20.04 under Hyper-V, VirtualBox, and KVM, and they all were perfectly happy with life backed by SSD or spinning disks.

behlendorf · 2021-10-21T19:55:07Z

Somewhat to my surprise I was able to easily reproduce the issue using using an ec2 t2.xlarge instance and Ubuntu 18.04.

rincebrain · 2021-10-21T20:11:09Z

...that just raises further questions, since the other testbots on AWS seem content with their lives. Huh. Maybe the intersection of recent kernel and VM setup? But Fedora 33 should be new enough...hm.

I also worry about whether busywaiting like this is the wrong fix, if this worked consistently everywhere before and is breaking consistently in only some places now - like the need for an explicit block_device_wait in the reservation tests now seems like something is wrong, if zfs destroy returns and the zvol isn't actually gone. Or are my expectations just wrong?

behlendorf · 2021-10-21T20:37:46Z

At least on Linux it's always been the case that some zfs / zpool commands will fail with EBUSY if a volume is open. It's a consequence of how the Linux kernel handles mount points and block devices we can't really avoid. The situation is different on FreeBSD and Illumos where the kernel allows tearing down in use devices (in which case the accessing process gets the error).

What is odd, is that we're suddenly seeing this all the time in some environments. Specifically we're racing with blkid which has the device open and is trying to detect any filesystems on it. I'm not sure why this timing has changed, but it's driven by udev detecting the new block devices. An alternative we kicked around, but never implemented, was to add some EBUSY retry logic in to the CLI tools instead.

Well the additional block_device_wait in the reservation tests is a bit misleading here. It isn't required to get those tests to pass, I merely added to head off any future problems if those tests are updated in such a way that the need to read/write from the zvol. We could drop that change if you like.

rincebrain · 2021-10-21T20:48:00Z

I have no burning desire to drop it, I just picked on it because I recalled needing changes in that test and it wasn't a simple matter of zfs destroy not apparently succeeding. (Though I suppose if it's passing now then the problem was presumably lingering destroys from the prior test? Huh.)

I think, more than anything, it just bothers me a lot that it's suddenly failing consistently on some platforms and acting like nothing changed on others, and it's not particularly evident what's changed, because it might break other expectations.

behlendorf · 2021-10-21T21:00:17Z

Yes, I completely agree. I'm not happy about needing this change, but making the test suite a little more resilient to this kind of known behavior seemed like the least terrible option.

jwk404

Thanks for taking this on.

tests/zfs-tests/tests/functional/cli_root/zfs_create/zfs_create_004_pos.ksh

This was missed in the first pass of changes but caught by the CI. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

gmelikov · 2021-10-22T08:50:36Z

zfs_load-key/zfs_load-key_all:

ERROR: zfs unload-key testpool/testfs2 exited 255
Key unload error: 'testpool/testfs2' is busy.

Same cause, different command? I think it's good to tackle it in separate PR.

This issue may also occur when unloading keys. We made this same fix to zfs_unload-key_all.ksh so do it here as well. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2021-10-22T17:44:32Z

Right, same cause different command. I went ahead and updated this PR to handle it since we'd already made the same change to zfs_unload-key_all.ksh. Might as well get both.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Edit the workaround in zfs-tests-*.yml to print if it successfully edited the rules file, and add explicit cleanup calls in a couple tests that have occasionally failed in ways that look like more fun from openzfs#12663 even with all this. Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

When cleaning up a test case standardize on using the convention: datasetexists $ds && destroy_dataset $ds <flags> By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure that the destroy is retried in the event that a ZFS volume is busy. This helps ensures ensure tests are fully cleaned up and prevents false positive test failures on Linux. Note that all of the tests which used 'zfs destroy' in cleanup have been updated even if they don't use volumes. This was done to clearly establish the expected convention. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12663

When cleaning up a test case standardize on using the convention: datasetexists $ds && destroy_dataset $ds <flags> By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure that the destroy is retried in the event that a ZFS volume is busy. This helps ensures ensure tests are fully cleaned up and prevents false positive test failures on Linux. Note that all of the tests which used 'zfs destroy' in cleanup have been updated even if they don't use volumes. This was done to clearly establish the expected convention. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12663

When cleaning up a test case standardize on using the convention: datasetexists $ds && destroy_dataset $ds <flags> By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure that the destroy is retried in the event that a ZFS volume is busy. This helps ensures ensure tests are fully cleaned up and prevents false positive test failures on Linux. Note that all of the tests which used 'zfs destroy' in cleanup have been updated even if they don't use volumes. This was done to clearly establish the expected convention. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12663

behlendorf added Component: Test Suite Indicates an issue with the test framework or a test case Status: Code Review Needed Ready for review and testing labels Oct 20, 2021

behlendorf force-pushed the zts-cleanup branch 2 times, most recently from cf7eee0 to e305710 Compare October 20, 2021 22:54

behlendorf force-pushed the zts-cleanup branch from e305710 to 3c01e0b Compare October 20, 2021 23:43

rincebrain reviewed Oct 21, 2021

View reviewed changes

tests/zfs-tests/tests/functional/cli_root/zfs_rename/zfs_rename.kshlib Outdated Show resolved Hide resolved

jwk404 self-assigned this Oct 21, 2021

behlendorf added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Oct 21, 2021

Additional cleanup

4dcf7d3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

jwk404 suggested changes Oct 21, 2021

View reviewed changes

tests/zfs-tests/tests/functional/cli_root/zfs_create/zfs_create_004_pos.ksh Show resolved Hide resolved

jwk404 approved these changes Oct 21, 2021

View reviewed changes

Add missing log_must_busy to zfs_destroy_003_pos

4c87cd5

This was missed in the first pass of changes but caught by the CI. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Oct 22, 2021

gmelikov approved these changes Oct 22, 2021

View reviewed changes

rincebrain mentioned this pull request Oct 22, 2021

Curious zvol behavior oddities in GH Actions since 20211004.1 #12644

Closed

Add missing log_must_busy to zfs_load-key_all.ksh

8115e4c

This issue may also occur when unloading keys. We made this same fix to zfs_unload-key_all.ksh so do it here as well. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix bootfs_002_neg.ksh cleanup

b0df53a

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 23, 2021

behlendorf added 2 commits October 23, 2021 09:40

Update reservation_007_pos, reservation_009_pos

8ddbf8e

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Update zfs_create_crypt_combos.ksh

8d0531d

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

jwk404 merged commit 90b77a0 into openzfs:master Oct 25, 2021

rincebrain mentioned this pull request Oct 28, 2021

Is there a known-issues.html with PRs? openzfs/zfs-buildbot#239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZTS: Standardize use of destroy_dataset in cleanup #12663

ZTS: Standardize use of destroy_dataset in cleanup #12663

behlendorf commented Oct 20, 2021

rincebrain commented Oct 21, 2021

behlendorf commented Oct 21, 2021

rincebrain commented Oct 21, 2021

behlendorf commented Oct 21, 2021 •

edited

Loading

rincebrain commented Oct 21, 2021

behlendorf commented Oct 21, 2021

rincebrain commented Oct 21, 2021

behlendorf commented Oct 21, 2021

jwk404 left a comment

gmelikov commented Oct 22, 2021

behlendorf commented Oct 22, 2021 •

edited

Loading

ZTS: Standardize use of destroy_dataset in cleanup #12663

ZTS: Standardize use of destroy_dataset in cleanup #12663

Conversation

behlendorf commented Oct 20, 2021

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

rincebrain commented Oct 21, 2021

behlendorf commented Oct 21, 2021

rincebrain commented Oct 21, 2021

behlendorf commented Oct 21, 2021 • edited Loading

rincebrain commented Oct 21, 2021

behlendorf commented Oct 21, 2021

rincebrain commented Oct 21, 2021

behlendorf commented Oct 21, 2021

jwk404 left a comment

Choose a reason for hiding this comment

gmelikov commented Oct 22, 2021

behlendorf commented Oct 22, 2021 • edited Loading

behlendorf commented Oct 21, 2021 •

edited

Loading

behlendorf commented Oct 22, 2021 •

edited

Loading