-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some ZFS Test Suite issues #6632
Conversation
This looks promising. Feel free to drop the "Requires-builders: none" whenever you'd like. It may help you identify some tests cases which still occasionally fail. |
I'm currently running an updated version on a local machine with the following pools imported (
Yes, there's a Another thing i've been wanting to do is update every test script that still uses hardcoded temporary paths like "/tmp" and "/var/tmp" in favor of $TESTDIR. I'm trying to avoid spinning up too many buildslaves (wasted resources and money), but i hope to push an updated (and hopefully working) version later tonight. |
Maybe not every script, i'll end up rewriting a lot of minor stuff making it harder to port OpenZFS patches, but at least where /tmp is used to put file VDEVs. Totally unrelated question: would it be possible to update the STYLE buildslave to Ubuntu Zesty? AFAIK it's the easier way to get
|
Getting any large files out of /tmp would be a big step forward, as would making sure we properly cleanup any temporary files in /tmp and /var/tmp.
Sure, I can look at updating it today. The only hitch is because the builder name will change we'll lose the details of previous the build results. But I don't think that's a big deal. |
| would it be possible to update the STYLE buildslave to Ubuntu Zesty? Done. Style builder updated to Zesty and mandoc installed. |
@loli10K it'd be great to get a version of this we could merge which tackles the most common failures first. |
@behlendorf unfortunately the only common failure i've been able to fix so far is I may have just fixed |
|
@loli10K FWIW I see |
@behlendorf i'll try to reproduce that locally while i fix zpool_create_002. If we have a little more time i would like to investigate the possibility to have EDIT: i don't think we have enough context to know $TESTDIR in |
Codecov Report
@@ Coverage Diff @@
## master #6632 +/- ##
==========================================
- Coverage 72.24% 72.22% -0.03%
==========================================
Files 294 294
Lines 93862 93863 +1
==========================================
- Hits 67815 67788 -27
- Misses 26047 26075 +28
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. I'm happy to see this cleanup address a lot of long standing issues with the way the test suite is run.
As an additional bit of cleanup I noticed the following two calls in the test suite to /dev/random
which can needlessly deplete the entropy in the VMs resulting the rngd
consuming a lot of cpu. Can you switch them to use /dev/urandom
in this PR.
diff --git a/tests/zfs-tests/tests/functional/cache/cache.kshlib b/tests/zfs-tests/tests/functional/cache/cache.kshlib
index 26b56f68e..69a661139 100644
--- a/tests/zfs-tests/tests/functional/cache/cache.kshlib
+++ b/tests/zfs-tests/tests/functional/cache/cache.kshlib
@@ -57,7 +57,7 @@ function display_status
((ret |= $?))
typeset mntpnt=$(get_prop mountpoint $pool)
- dd if=/dev/random of=$mntpnt/testfile.$$ &
+ dd if=/dev/urandom of=$mntpnt/testfile.$$ &
typeset pid=$!
zpool iostat -v 1 3 > /dev/null
diff --git a/tests/zfs-tests/tests/functional/slog/slog.kshlib b/tests/zfs-tests/tests/functional/slog/slog.kshlib
index ca1b5ed21..d9459151c 100644
--- a/tests/zfs-tests/tests/functional/slog/slog.kshlib
+++ b/tests/zfs-tests/tests/functional/slog/slog.kshlib
@@ -59,7 +59,7 @@ function display_status
((ret |= $?))
typeset mntpnt=$(get_prop mountpoint $pool)
- dd if=/dev/random of=$mntpnt/testfile.$$ &
+ dd if=/dev/urandom of=$mntpnt/testfile.$$ &
typeset pid=$!
zpool iostat -v 1 3 > /dev/null
As an aside, the codecov.io integration is in progress but it may take a few more days before we have it fully sorted out.
@@ -61,7 +61,8 @@ export NO_POOLS="no pools available" | |||
# pattern to ignore from 'zfs list'. | |||
export NO_DATASETS="no datasets available" | |||
|
|||
export TEST_BASE_DIR="/var/tmp" | |||
# Default directory used for test files | |||
export TEST_BASE_DIR="${FILEDIR:-/var/tmp}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
@@ -70,6 +70,9 @@ post = | |||
[tests/functional/cli_root/zfs] | |||
tests = ['zfs_001_neg', 'zfs_002_pos', 'zfs_003_neg'] | |||
|
|||
[tests/functional/cli_root/zfs_bookmark] | |||
tests = ['zfs_bookmark_cliargs'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along similar lines it turns out there's no coverage for zfs diff
. That's something we'll want to get addressed in a potential follow up PR.
# See issue: https://github.com/zfsonlinux/zfs/issues/6145 | ||
if is_linux; then | ||
log_unsupported "Test case occasionally fails" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see cleanup fixes for this test case. Are we sure that's it's actually fixed? Also it see that the wrong issue is referenced here which is strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried enabling some SKIPped tests and this wasn't failing; i just discovered the cleanup issue with a custom runfile.
The reason this was disabled is in 95401cb: Skip until layering pools on zvols is solid. Here $TESTDIR is a ZFS mountpoint, i could move this to $TEST_BASE_DIR like i did with the other file VDEVs (i'll also update the commit message with the next push).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would explain it. Layering pools on zvols is much better than it used to be but still doesn't seem to be 100% solid and could deadlock. Since we don't need to layer a pool on a zvol to perform this test I think moving it as you suggest would be best. I'd feel much better about re-enabling it then.
sync | ||
# check if we started reusing objects | ||
object=$(ls -i $mntpnt | sort -n | awk -v object=$object \ | ||
'{if ($1 <= object) {exit 1}} END {print $1}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my clarification, the speed up here is the result of checking for any object <= object
rather than trying to match exactly the first object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, from my (limited) understanding the issue is caused by large_dnodes
+ multi-threaded dmu_object_alloc()
: ever since large_dnodes was merged the "SA master node" seems to always be object 32; when we create the first test file ($first_object
in the ksh script) we get the very first free object which is 2:
root@linux:~# touch /testpool/file
root@linux:~# zdb -dd testpool/
Dataset testpool [ZPL], ID 54, cr_txg 1, 24.0K, 7 objects
Object lvl iblk dblk dsize dnsize lsize %full type
0 6 128K 16K 13.0K 512 32K 10.94 DMU dnode
-1 1 128K 512 0 512 512 100.00 ZFS user/group used
-2 1 128K 512 0 512 512 100.00 ZFS user/group used
1 1 128K 512 1K 512 512 100.00 ZFS master node
2 1 128K 512 0 512 512 0.00 ZFS plain file
32 1 128K 512 0 512 512 100.00 SA master node
33 1 128K 512 0 512 512 100.00 ZFS delete queue
34 1 128K 512 0 512 512 100.00 ZFS directory
35 1 128K 1.50K 1K 512 1.50K 100.00 SA attr registration
36 1 128K 16K 8K 512 32K 100.00 SA attr layouts
But with the multi-threaded object allocator it doesn't seem to be possible (or maybe it just more difficult) to re-allocate objects <64, the test case is basically a while(true) loop.
Given that the objective is to "verify that 'zfs send' drills appropriate holes" we just need to verify we reallocated an object, which is why i've changed the condition to <= object
: we could play it safe and also verify we correctly generate FREEOBJECT records between OBJECT records (zstreamdump
+ awk
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, that makes sense. The change in object id of the "SA master node" was a known side effect the multi-threaded allocator which wasn't expected to cause any problems. It just means we may create a few holes, but this the normal state for any active filesystem so that's fine. Your fix for the test case looks entirely reasonable to me. We could add the zstreamdump check too but I don't think it's necessary.
# export unintentionally imported pools | ||
for poolname in $(get_all_pools); do | ||
if [[ -z ${testpools[$poolname]} ]]; then | ||
log_must zpool export $poolname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use log_must_busy
here.
log_must zpool export ${TESTPOOL}-$id | ||
for poolname in ${testpools[@]}; do | ||
if poolexists $poolname ; then | ||
log_must zpool export $poolname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use log_must_busy
here.
@@ -36,7 +36,7 @@ | |||
verify_runnable "both" | |||
|
|||
# See issue: https://github.com/zfsonlinux/zfs/issues/6086 | |||
if is_linux; then | |||
if is_32bit; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have been reports that this test, and other rsend tests, are occasionally failing due to low default zfs_arc_min
value. We may want to consider increase the minimum value slightly, although the fixes in 787acae may make that unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need this test enabled to push #6623 forward: i wonder if dumping the send streams in filesystems like /$sendpool
and /$streamfs
is really needed or is just stressing ZFS more than necessary for rsend tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for enabling this test case as long as we can run it reliably.
I've opened PR #6659 which increases the default arc_c_min
on systems which have more then 1G of total system memory. It will be set to 1/32 of total system memory with the existing 32M floor preserved. This brings the code a little more inline with OpenZFS although we still permit a lower floor in order to support low memory systems.
It also means that for the buildbot TESTERS arc_c_min
will now default to 128M (1/32 of 4G) which based open issues is expected to be large enough to avoid the reported performance with the rsend tests.
As for dumping the send streams in /$sendpool
and /$streamfs
that may put further stress on the system but that's exactly the kind of thing we want to test. Let's leave it as is unless we determine we have to change it.
cdcb34a
to
9e0441c
Compare
As it turns out, #6143 seems to be reproducible but very specific to some kernels; when we
This is also problematic when we have different filesystems with the same mountpoint set:
The proposed fix umounts using the dataset name instead of the mountpoint path: this seems to fix #6143. We could add a new test case to verify the second case (multiple fs with the same mountpoint). |
@@ -57,7 +57,7 @@ function display_status | |||
((ret |= $?)) | |||
|
|||
typeset mntpnt=$(get_prop mountpoint $pool) | |||
dd if=/dev/random of=$mntpnt/testfile.$$ & | |||
dd if=/dev/urandom of=$mntpnt/testfile.$$ & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is triggeting "busy" failures in some cache
test scripts:
Test: /usr/share/zfs/zfs-tests/tests/functional/cache/cache_006_pos (run as root) [00:04] [FAIL]
08:41:00.92 ASSERTION: Exporting and importing pool with cache devices passes.
08:41:01.70 SUCCESS: zpool create testpool /mnt/testdir/disk.cache/a /mnt/testdir/disk.cache/b /mnt/testdir/disk.cache/c cache loop0 loop1
08:41:01.71 SUCCESS: verify_cache_device testpool loop0 ONLINE
08:41:02.36 SUCCESS: zpool export testpool
08:41:03.76 SUCCESS: zpool import -d /mnt/testdir/disk.cache testpool
08:41:05.80 SUCCESS: display_status testpool
08:41:05.81 SUCCESS: verify_cache_device testpool loop1 ONLINE
08:41:05.82 ERROR: zpool destroy testpool exited 1
08:41:05.82 umount: /testpool: target is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) cannot unmount '/testpool': umount failed could not destroy 'testpool': could not unmount datasets
We either s/log_must/log_must_busy/ zpool destroy
or run sync
just after kill -9 $dd_pid
: i think the latter is preferred, we shouldn't burden verify_cache_device()
callers to deal explicitly with an optionally busy fs, it's cleaner to deal with this issue inside the function.
EDIT: the function i was referring to is display_status()
, not verify_cache_device()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or how about wait
until everything terminates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behlendorf i don't know why it's working but it's working indeed, i'll have to RTFM on wait
(which i think is more lightweight than sync
, which is good). Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work because the problem isn't that there's unwritten dirty data. This would get written out as part of the unmount regardless. The problem is that there are still open file handles which are preventing the unmount. The test need only wait until those child processes exit and the file handles are closed.
@loli10K do you think you're close to have a version of this cleanup ready to merge? Selfishly I'd love to stop seeing some of these known test failures!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be almost ready, the only known failure is inherit_001_pos
which is reproducible and was introduced with the fix for #6143, right now i'm working on it.
fa7b799
to
a5f1766
Compare
I may have found the issue ( Here we are attached to a "soon-to-be failed"
We Unfortunately with a fresh libshare_hdl the share cannot be found: "cannot unshare 'TESTPOOL/TESTCTR/TESTFS1': not found: unshare(1M) failed" 1023 for (curr_proto = proto; *curr_proto != PROTO_END;
1024 curr_proto++) {
1025
1026 if (is_shared(hdl, mntpt, *curr_proto) &&
1027 unshare_one(hdl, zhp->zfs_name,
1028 mntpt, *curr_proto) != 0) {
1029 if (mntpt != NULL)
1030 free(mntpt);
1031 return (-1);
This is the issue, now we just need a fix. EDIT: the fix is to |
ac02a38
to
5258aff
Compare
For some reason the Coverage builder is not picking up my new test case:
I should have fixed the STYLE warning locally, i just need to tackle this coverage issue before the final push: i have already rebased on master and also added a new test to cover OpenZFS 8166 zpool scrub thinks it repaired offline device. |
@loli10K I believe your new |
@dinatale2 that's it, thanks; though i wonder why the other builders are not (rightfully) complaining. |
* Add 'zfs bookmark' coverage (zfs_bookmark_cliargs) * Add OpenZFS 8166 coverage (zpool_scrub_offline_device) * Fix "busy" zfs_mount_remount failures * Fix bootfs_003_pos, bootfs_004_neg, zdb_005_pos local cleanup * Update usage of $KEEP variable, add get_all_pools() function * Enable history_008_pos and rsend_019_pos (non-32bit builders) * Enable zfs_copies_005_neg, update local cleanup * Fix zfs_send_007_pos (large_dnode + OpenZFS 8199) * Fix rollback_003_pos (use dataset name, not mountpoint, to unmount) * Update default_raidz_setup() to work properly with more than 3 disks * Use $TEST_BASE_DIR instead of hardcoded (/var)/tmp for file VDEVs * Update usage of /dev/random to /dev/urandom Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
@loli10K my guess is you only saw this on the coverage builder because it's the only one which runs in-tree so we can collect the gcov data. The other scripts may have had their permissions fixed by the packaging. EDIT: I've resubmitted the test builders to get another test run. |
@behlendorf nice, thanks. I really like this coverage thing, it's very easy to spot code paths that are not being exercised by the ZTS: https://codecov.io/gh/zfsonlinux/zfs/src/master/cmd/zfs/zfs_main.c#L6874. I started working on |
@loli10K I'm a big fan of the code coverage too. You can thank @prakashsurya for it, he's done all the heavy lifting to get it working! As for this PR, I'm going to give it one last review before merging it. Thank you for addressing all of these issues. I'm curious to see which ZTS issues peculate up after these are resolved. |
* Add 'zfs bookmark' coverage (zfs_bookmark_cliargs) * Add OpenZFS 8166 coverage (zpool_scrub_offline_device) * Fix "busy" zfs_mount_remount failures * Fix bootfs_003_pos, bootfs_004_neg, zdb_005_pos local cleanup * Update usage of $KEEP variable, add get_all_pools() function * Enable history_008_pos and rsend_019_pos (non-32bit builders) * Enable zfs_copies_005_neg, update local cleanup * Fix zfs_send_007_pos (large_dnode + OpenZFS 8199) * Fix rollback_003_pos (use dataset name, not mountpoint, to unmount) * Update default_raidz_setup() to work properly with more than 3 disks * Use $TEST_BASE_DIR instead of hardcoded (/var)/tmp for file VDEVs * Update usage of /dev/random to /dev/urandom Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Issue openzfs#6086 Closes openzfs#5658 Closes openzfs#6143 Closes openzfs#6421 Closes openzfs#6627 Closes openzfs#6632
Description
See below.
Motivation and Context
In progress
Idea
How Has This Been Tested?
Untested, for now: i will drop "Requires-builders: none" from the commit message once this is squashed and ready.
Types of changes
Checklist:
Signed-off-by
.