Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS: Standardize use of destroy_dataset in cleanup #12663

Merged
merged 7 commits into from
Oct 25, 2021

Conversation

behlendorf
Copy link
Contributor

Motivation and Context

After the Ubuntu 18.04 and 20.04 CI builder VMs were last updated we're
reliably seeing instances where ZFS volumes are active (still open) when
zfs destroy is run. This isn't unexpected since processes like blkid will
open the device when it's first created. The fix for this is to retry on busy
in the ZTS for Linix. This had been done previously but wasn't done
exhaustively in all places. This change is intended to address those remaining
cases by systematically updating the cleanup functions to use destroy_dataset
which does retry.

Description

When cleaning up a test case standardize on using the convention:

    datasetexists $ds && destroy_dataset $ds <flags>

By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure
that the destroy is retried in the event that a ZFS volume is busy.
This helps ensures ensure tests are fully cleaned up and prevents false
positive test failures on Linux.

Note that all of the tests which used 'zfs destroy' in cleanup have
been updated even if they don't use volumes. This was done to
clearly establish the expected convention.

How Has This Been Tested?

Locally ran the majority of the test suite on Ubuntu 20.04 which I
was able to reproduce this issue. With the change applied the
testing which previously failed are passing.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added Component: Test Suite Indicates an issue with the test framework or a test case Status: Code Review Needed Ready for review and testing labels Oct 20, 2021
@behlendorf behlendorf force-pushed the zts-cleanup branch 2 times, most recently from cf7eee0 to e305710 Compare October 20, 2021 22:54
When cleaning up a test case standardize on using the convention:

    datasetexists $ds && destroy_dataset $ds <flags>

By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure
that the destroy is retried in the event that a ZFS volume is busy.
This helps ensures ensure tests are fully cleaned up and prevents false
positive test failures on Linux.

Note that all of the tests which used 'zfs destroy' in cleanup have
been updated even if they don't use volumes.  This was done to
clearly establish the expected convention.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@rincebrain
Copy link
Contributor

This patch seems to have missed

And the copy-paste jobs in the other snapused, which explains those continued failures.

history_002_pos is failing on the destroys in:

run_and_verify "zfs destroy $fssnap2"
run_and_verify "zfs destroy $volsnap2"
run_and_verify "zfs receive $fs < $tmpfile"
run_and_verify "zfs receive $vol < $tmpfile2"
run_and_verify "zfs rollback -r $fssnap"
run_and_verify "zfs rollback -r $volsnap"
run_and_verify "zfs clone $fssnap $fsclone"
run_and_verify "zfs clone $volsnap $volclone"
run_and_verify "zfs rename $fs $newfs"
run_and_verify "zfs rename $vol $newvol"
run_and_verify "zfs promote $fsclone"
run_and_verify "zfs promote $volclone"
run_and_verify "zfs destroy $newfs"
run_and_verify "zfs destroy $newvol"
run_and_verify "zfs destroy -rf $fsclone"
run_and_verify "zfs destroy -rf $volclone"

So I'd probably just dataset_exists && destroy_dataset $FOO them too.

iostat/setup failed because zfs_list/cleanup failed, and that failed because

function depth_fs_cleanup
{
log_must zfs destroy -rR $DEPTH_FS
}

needs the same love as everything else in this PR.

zfs_unload-key_all is still failing on trying to unload-key -a while the zvol is still open. If "retry a couple times" is the order of the day, perhaps a s/log_must/log_must_busy/ is the fix in kind here?

@jwk404 jwk404 self-assigned this Oct 21, 2021
@behlendorf behlendorf added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Oct 21, 2021
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf
Copy link
Contributor Author

Thanks for taking a look. I've added a commit to the PR to address the latest failures we saw in the CI. That should improve things but I'd still like to run this PR through the Ubuntu builders a few times to make sure it reliably passes. It wouldn't shock me if I missed some test cases since I only updated those which did a "datasetexists && zfs destroy". It looks like that's why I missed the "snapused" tests.

@rincebrain
Copy link
Contributor

Of course.

I'm really curious to know how you reproduced it locally, because I tried Ubuntu 18.04/20.04 under Hyper-V, VirtualBox, and KVM, and they all were perfectly happy with life backed by SSD or spinning disks.

@behlendorf
Copy link
Contributor Author

behlendorf commented Oct 21, 2021

Somewhat to my surprise I was able to easily reproduce the issue using using an ec2 t2.xlarge instance and Ubuntu 18.04.

@rincebrain
Copy link
Contributor

...that just raises further questions, since the other testbots on AWS seem content with their lives. Huh. Maybe the intersection of recent kernel and VM setup? But Fedora 33 should be new enough...hm.

I also worry about whether busywaiting like this is the wrong fix, if this worked consistently everywhere before and is breaking consistently in only some places now - like the need for an explicit block_device_wait in the reservation tests now seems like something is wrong, if zfs destroy returns and the zvol isn't actually gone. Or are my expectations just wrong?

@behlendorf
Copy link
Contributor Author

At least on Linux it's always been the case that some zfs / zpool commands will fail with EBUSY if a volume is open. It's a consequence of how the Linux kernel handles mount points and block devices we can't really avoid. The situation is different on FreeBSD and Illumos where the kernel allows tearing down in use devices (in which case the accessing process gets the error).

What is odd, is that we're suddenly seeing this all the time in some environments. Specifically we're racing with blkid which has the device open and is trying to detect any filesystems on it. I'm not sure why this timing has changed, but it's driven by udev detecting the new block devices. An alternative we kicked around, but never implemented, was to add some EBUSY retry logic in to the CLI tools instead.

Well the additional block_device_wait in the reservation tests is a bit misleading here. It isn't required to get those tests to pass, I merely added to head off any future problems if those tests are updated in such a way that the need to read/write from the zvol. We could drop that change if you like.

@rincebrain
Copy link
Contributor

I have no burning desire to drop it, I just picked on it because I recalled needing changes in that test and it wasn't a simple matter of zfs destroy not apparently succeeding. (Though I suppose if it's passing now then the problem was presumably lingering destroys from the prior test? Huh.)

I think, more than anything, it just bothers me a lot that it's suddenly failing consistently on some platforms and acting like nothing changed on others, and it's not particularly evident what's changed, because it might break other expectations.

@behlendorf
Copy link
Contributor Author

Yes, I completely agree. I'm not happy about needing this change, but making the test suite a little more resilient to this kind of known behavior seemed like the least terrible option.

Copy link
Contributor

@jwk404 jwk404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on.

This was missed in the first pass of changes but caught by the CI.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Oct 22, 2021
@gmelikov
Copy link
Member

zfs_load-key/zfs_load-key_all:

ERROR: zfs unload-key testpool/testfs2 exited 255
Key unload error: 'testpool/testfs2' is busy.

Same cause, different command? I think it's good to tackle it in separate PR.

This issue may also occur when unloading keys.  We made this same
fix to zfs_unload-key_all.ksh so do it here as well.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf
Copy link
Contributor Author

behlendorf commented Oct 22, 2021

Right, same cause different command. I went ahead and updated this PR to handle it since we'd already made the same change to zfs_unload-key_all.ksh. Might as well get both.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 23, 2021
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@jwk404 jwk404 merged commit 90b77a0 into openzfs:master Oct 25, 2021
rincebrain added a commit to rincebrain/zfs that referenced this pull request Oct 26, 2021
Edit the workaround in zfs-tests-*.yml to print if it successfully
edited the rules file, and add explicit cleanup calls in a couple
tests that have occasionally failed in ways that look like more
fun from openzfs#12663 even with all this.

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
rincebrain pushed a commit to rincebrain/zfs that referenced this pull request Oct 28, 2021
When cleaning up a test case standardize on using the convention:

    datasetexists $ds && destroy_dataset $ds <flags>

By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure
that the destroy is retried in the event that a ZFS volume is busy.
This helps ensures ensure tests are fully cleaned up and prevents false
positive test failures on Linux.

Note that all of the tests which used 'zfs destroy' in cleanup have
been updated even if they don't use volumes.  This was done to
clearly establish the expected convention.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12663
rincebrain pushed a commit to rincebrain/zfs that referenced this pull request Oct 28, 2021
When cleaning up a test case standardize on using the convention:

    datasetexists $ds && destroy_dataset $ds <flags>

By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure
that the destroy is retried in the event that a ZFS volume is busy.
This helps ensures ensure tests are fully cleaned up and prevents false
positive test failures on Linux.

Note that all of the tests which used 'zfs destroy' in cleanup have
been updated even if they don't use volumes.  This was done to
clearly establish the expected convention.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12663
tonyhutter pushed a commit that referenced this pull request Nov 1, 2021
When cleaning up a test case standardize on using the convention:

    datasetexists $ds && destroy_dataset $ds <flags>

By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure
that the destroy is retried in the event that a ZFS volume is busy.
This helps ensures ensure tests are fully cleaned up and prevents false
positive test failures on Linux.

Note that all of the tests which used 'zfs destroy' in cleanup have
been updated even if they don't use volumes.  This was done to
clearly establish the expected convention.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12663
tonyhutter pushed a commit that referenced this pull request Nov 2, 2021
When cleaning up a test case standardize on using the convention:

    datasetexists $ds && destroy_dataset $ds <flags>

By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure
that the destroy is retried in the event that a ZFS volume is busy.
This helps ensures ensure tests are fully cleaned up and prevents false
positive test failures on Linux.

Note that all of the tests which used 'zfs destroy' in cleanup have
been updated even if they don't use volumes.  This was done to
clearly establish the expected convention.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12663
ghost pushed a commit to truenas/zfs that referenced this pull request Nov 3, 2021
When cleaning up a test case standardize on using the convention:

    datasetexists $ds && destroy_dataset $ds <flags>

By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure
that the destroy is retried in the event that a ZFS volume is busy.
This helps ensures ensure tests are fully cleaned up and prevents false
positive test failures on Linux.

Note that all of the tests which used 'zfs destroy' in cleanup have
been updated even if they don't use volumes.  This was done to
clearly establish the expected convention.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12663
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Nov 13, 2021
When cleaning up a test case standardize on using the convention:

    datasetexists $ds && destroy_dataset $ds <flags>

By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure
that the destroy is retried in the event that a ZFS volume is busy.
This helps ensures ensure tests are fully cleaned up and prevents false
positive test failures on Linux.

Note that all of the tests which used 'zfs destroy' in cleanup have
been updated even if they don't use volumes.  This was done to
clearly establish the expected convention.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12663
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Nov 13, 2021
When cleaning up a test case standardize on using the convention:

    datasetexists $ds && destroy_dataset $ds <flags>

By using 'destroy_dataset' instead of 'log_must zfs destroy' we ensure
that the destroy is retried in the event that a ZFS volume is busy.
This helps ensures ensure tests are fully cleaned up and prevents false
positive test failures on Linux.

Note that all of the tests which used 'zfs destroy' in cleanup have
been updated even if they don't use volumes.  This was done to
clearly establish the expected convention.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12663
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants