Fix some ZFS Test Suite issues #6632

loli10K · 2017-09-12T05:56:01Z

Description

See below.

Motivation and Context

Fix Some tests fail to cleanup when existing zpools are imported #6627
Fix Test case: zfs_send_007_pos #6421
Fix Test case: rollback_003_pos #6143
Fix Test case: history_008_pos #5658 (well, not really "fix", just enabled; it no longer seems to be failing)
Add 'zfs bookmark' coverage
Add OpenZFS 8166 coverage
Fix "busy" failures on zfs_mount/zfs_mount_remount test case
Fix bootfs_003_pos, bootfs_004_neg, zdb_005_pos cleanup
Update use of $KEEP variable, add get_all_pools() function
Enable rsend_019_pos on non-32bit builders
Use $TEST_BASE_DIR instead of hardcoded /tmp, /var/tmp paths for file VDEVs
Fix zpool_set cleanup/setup
Fix zpool_create_002

In progress

Investigate zfs_rollback_001/zfs_rollback_002

Idea

Verify correct cleanup after every testgroup in test-runner.py (DROPPED)

How Has This Been Tested?

Untested, for now: i will drop "Requires-builders: none" from the commit message once this is squashed and ready.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

behlendorf · 2017-09-12T17:14:36Z

I will drop "Requires-builders: none" from the commit message once this is squashed and ready.

This looks promising. Feel free to drop the "Requires-builders: none" whenever you'd like. It may help you identify some tests cases which still occasionally fail.

loli10K · 2017-09-12T18:04:47Z

I'm currently running an updated version on a local machine with the following pools imported (Keep pool(s)):

root@linux:/usr/src/zfs# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G  4.5G   15G  24% /
udev             10M     0   10M   0% /dev
tmpfs           1.6G  8.5M  1.6G   1% /run
tmpfs           4.0G     0  4.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           4.0G     0  4.0G   0% /sys/fs/cgroup
tmpfs           4.0G  768M  3.2G  20% /var/tmp
rpool            56M     0   56M   0% /rpool
test pool        56M     0   56M   0% /test pool
test:pool        56M     0   56M   0% /test:pool
test-pool        56M     0   56M   0% /test-pool
test_pool        56M     0   56M   0% /test_pool
test.pool        56M     0   56M   0% /test.pool

Yes, there's a test[space]pool imported and mounted, i'm testing this too.

Another thing i've been wanting to do is update every test script that still uses hardcoded temporary paths like "/tmp" and "/var/tmp" in favor of $TESTDIR.

I'm trying to avoid spinning up too many buildslaves (wasted resources and money), but i hope to push an updated (and hopefully working) version later tonight.

loli10K · 2017-09-13T05:48:33Z

Another thing i've been wanting to do is update every test script that still uses hardcoded temporary paths like "/tmp" and "/var/tmp" in favor of $TESTDIR.

Maybe not every script, i'll end up rewriting a lot of minor stuff making it harder to port OpenZFS patches, but at least where /tmp is used to put file VDEVs.

Totally unrelated question: would it be possible to update the STYLE buildslave to Ubuntu Zesty? AFAIK it's the easier way to get mandoc, we could use it to do some checking on manpages:

root@linux:/usr/src/zfs# mandoc -Tlint man/man8/zpool.8 
mandoc: man/man8/zpool.8:29:2: WARNING: cannot parse date, using it verbatim: June 28, 2017
mandoc: man/man8/zpool.8:1503:2: WARNING: skipping empty macro: No
mandoc: man/man8/zpool.8:1514:2: WARNING: skipping paragraph macro: Pp before Bl
mandoc: man/man8/zpool.8:2292:2: ERROR: skipping unknown macro: .Sy::findleaks .
mandoc: man/man8/zpool.8:2383:2: ERROR: inserting missing end of block: Sh breaks Bl
mandoc: man/man8/zpool.8:2388:2: WARNING: unusual Xr order: zfs-events(5) after zfs(8)
root@linux:/usr/src/zfs# mandoc -Tlint man/man8/zfs.8 
mandoc: man/man8/zfs.8:32:2: WARNING: cannot parse date, using it verbatim: June 28, 2017
mandoc: man/man8/zfs.8:2177:2: WARNING: skipping empty macro: Sy
mandoc: man/man8/zfs.8:3446:2: WARNING: skipping empty macro: Sy
mandoc: man/man8/zfs.8:3448:2: WARNING: skipping empty macro: Sy
mandoc: man/man8/zfs.8:3461:2: WARNING: skipping empty macro: Sy
mandoc: man/man8/zfs.8:4543:2: WARNING: unusual Xr order: selinux after zpool
mandoc: man/man8/zfs.8:4544:2: WARNING: unusual Xr order: chmod(2) after selinux(8)
mandoc: man/man8/zfs.8:4547:2: WARNING: unusual Xr order: fsync after write
mandoc: man/man8/zfs.8:4548:2: WARNING: unusual Xr order: attr(1) after fsync(2)
mandoc: man/man8/zfs.8:4553:2: WARNING: unusual Xr order: attributes(5) after net(8)

behlendorf · 2017-09-13T17:00:11Z

at least where /tmp is used to put file VDEVs.

Getting any large files out of /tmp would be a big step forward, as would making sure we properly cleanup any temporary files in /tmp and /var/tmp.

would it be possible to update the STYLE buildslave to Ubuntu Zesty?

Sure, I can look at updating it today. The only hitch is because the builder name will change we'll lose the details of previous the build results. But I don't think that's a big deal.

behlendorf · 2017-09-13T17:18:01Z

| would it be possible to update the STYLE buildslave to Ubuntu Zesty?

Done. Style builder updated to Zesty and mandoc installed.

behlendorf · 2017-09-15T19:54:10Z

@loli10K it'd be great to get a version of this we could merge which tackles the most common failures first.

loli10K · 2017-09-15T21:25:50Z

@behlendorf unfortunately the only common failure i've been able to fix so far is zfs_mount_remount; i'm also seeing some mmp_active_import but i would need to do some MMP studying first. Did you have any other case in mind?

I may have just fixed zpool_set setup/cleanup scripts issue locally, i can work on these last failures now before the next push.

behlendorf · 2017-09-15T22:05:26Z

mmp_active_import is understood and @ofaaland was going to look in to making a fix. You're right, zfs_mount_remount is the only major one so nevermind. Take your time.

behlendorf · 2017-09-16T04:26:06Z

@loli10K FWIW I see zfs_rollback_001_pos and zfs_rollback_002_pos occasionally fail too on the Amazon Linux TESTER.

loli10K · 2017-09-16T06:53:39Z

@behlendorf i'll try to reproduce that locally while i fix zpool_create_002.

If we have a little more time i would like to investigate the possibility to have test-runner.py verify the cleanup after every TestGroup (no files left in /tmp and /var/tmp, no pools imported) and avoid running all other test when something goes wrong, like when a pool gets suspended and we can't export/destroy it: http://build.zfsonlinux.org/builders/CentOS%207%20x86_64%20Mainline%20%28TEST%29/builds/4027/steps/shell_8/logs/stdio.

EDIT: i don't think we have enough context to know $TESTDIR in test-runner.py, it seems this info is only available in include/default.cfg and zfs-tests.sh. Also to list imported pools we would need to run zfs binaries / link libzfs(_core) / use libzfs python bindings: there is probably a lot of work and actually not very much to be gained here, but i'm still interested to hear other opinions.

codecov · 2017-09-17T20:14:03Z

Codecov Report

Merging #6632 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #6632      +/-   ##
==========================================
- Coverage   72.24%   72.22%   -0.03%     
==========================================
  Files         294      294              
  Lines       93862    93863       +1     
==========================================
- Hits        67815    67788      -27     
- Misses      26047    26075      +28

Flag	Coverage Δ
#kernel	`73.94% <ø> (+0.57%)`	⬆️
#user	`60.96% <100%> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
lib/libzfs/libzfs_mount.c	`80.7% <100%> (+0.04%)`	⬆️
cmd/zfs/zfs_main.c	`76.93% <100%> (+0.06%)`	⬆️
module/zfs/ddt.c	`87.18% <0%> (-8.67%)`	⬇️
cmd/zvol_id/zvol_id_main.c	`72.72% <0%> (-6.07%)`	⬇️
module/zfs/ddt_zap.c	`88.09% <0%> (-4.77%)`	⬇️
cmd/zed/agents/zfs_mod.c	`65.61% <0%> (-4.22%)`	⬇️
cmd/ztest/ztest.c	`78.66% <0%> (-3%)`	⬇️
module/zfs/vdev_disk.c	`68.68% <0%> (-2.85%)`	⬇️
module/zfs/zio_compress.c	`94.44% <0%> (-2.78%)`	⬇️
module/zfs/vdev_raidz.c	`94.75% <0%> (-2.74%)`	⬇️
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b33d668...0f13c94. Read the comment docs.

behlendorf

This is looking good. I'm happy to see this cleanup address a lot of long standing issues with the way the test suite is run.

As an additional bit of cleanup I noticed the following two calls in the test suite to /dev/random which can needlessly deplete the entropy in the VMs resulting the rngd consuming a lot of cpu. Can you switch them to use /dev/urandom in this PR.

diff --git a/tests/zfs-tests/tests/functional/cache/cache.kshlib b/tests/zfs-tests/tests/functional/cache/cache.kshlib
index 26b56f68e..69a661139 100644
--- a/tests/zfs-tests/tests/functional/cache/cache.kshlib
+++ b/tests/zfs-tests/tests/functional/cache/cache.kshlib
@@ -57,7 +57,7 @@ function display_status
        ((ret |= $?))
 
        typeset mntpnt=$(get_prop mountpoint $pool)
-       dd if=/dev/random of=$mntpnt/testfile.$$ &
+       dd if=/dev/urandom of=$mntpnt/testfile.$$ &
        typeset pid=$!
 
        zpool iostat -v 1 3 > /dev/null
diff --git a/tests/zfs-tests/tests/functional/slog/slog.kshlib b/tests/zfs-tests/tests/functional/slog/slog.kshlib
index ca1b5ed21..d9459151c 100644
--- a/tests/zfs-tests/tests/functional/slog/slog.kshlib
+++ b/tests/zfs-tests/tests/functional/slog/slog.kshlib
@@ -59,7 +59,7 @@ function display_status
        ((ret |= $?))
 
        typeset mntpnt=$(get_prop mountpoint $pool)
-       dd if=/dev/random of=$mntpnt/testfile.$$ &
+       dd if=/dev/urandom of=$mntpnt/testfile.$$ &
        typeset pid=$!
 
        zpool iostat -v 1 3 > /dev/null

As an aside, the codecov.io integration is in progress but it may take a few more days before we have it fully sorted out.

behlendorf · 2017-09-18T18:07:18Z

tests/zfs-tests/include/default.cfg.in

@@ -61,7 +61,8 @@ export NO_POOLS="no pools available"
 # pattern to ignore from 'zfs list'.
 export NO_DATASETS="no datasets available"

-export TEST_BASE_DIR="/var/tmp"
+# Default directory used for test files
+export TEST_BASE_DIR="${FILEDIR:-/var/tmp}"


behlendorf · 2017-09-18T18:31:19Z

tests/runfiles/linux.run

@@ -70,6 +70,9 @@ post =
 [tests/functional/cli_root/zfs]
 tests = ['zfs_001_neg', 'zfs_002_pos', 'zfs_003_neg']

+[tests/functional/cli_root/zfs_bookmark]
+tests = ['zfs_bookmark_cliargs']


Along similar lines it turns out there's no coverage for zfs diff. That's something we'll want to get addressed in a potential follow up PR.

behlendorf · 2017-09-18T18:42:21Z

tests/zfs-tests/tests/functional/cli_root/zfs_copies/zfs_copies_005_neg.ksh

-# See issue: https://github.com/zfsonlinux/zfs/issues/6145
-if is_linux; then
-	log_unsupported "Test case occasionally fails"
-fi


I only see cleanup fixes for this test case. Are we sure that's it's actually fixed? Also it see that the wrong issue is referenced here which is strange.

I tried enabling some SKIPped tests and this wasn't failing; i just discovered the cleanup issue with a custom runfile.
The reason this was disabled is in 95401cb: Skip until layering pools on zvols is solid. Here $TESTDIR is a ZFS mountpoint, i could move this to $TEST_BASE_DIR like i did with the other file VDEVs (i'll also update the commit message with the next push).

That would explain it. Layering pools on zvols is much better than it used to be but still doesn't seem to be 100% solid and could deadlock. Since we don't need to layer a pool on a zvol to perform this test I think moving it as you suggest would be best. I'd feel much better about re-enabling it then.

behlendorf · 2017-09-18T18:48:07Z

tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_007_pos.ksh

+		sync
+		# check if we started reusing objects
+		object=$(ls -i $mntpnt | sort -n | awk -v object=$object \
+		    '{if ($1 <= object) {exit 1}} END {print $1}')


For my clarification, the speed up here is the result of checking for any object <= object rather than trying to match exactly the first object?

Yes, from my (limited) understanding the issue is caused by large_dnodes + multi-threaded dmu_object_alloc(): ever since large_dnodes was merged the "SA master node" seems to always be object 32; when we create the first test file ($first_object in the ksh script) we get the very first free object which is 2:

root@linux:~# touch /testpool/file root@linux:~# zdb -dd testpool/ Dataset testpool [ZPL], ID 54, cr_txg 1, 24.0K, 7 objects Object lvl iblk dblk dsize dnsize lsize %full type 0 6 128K 16K 13.0K 512 32K 10.94 DMU dnode -1 1 128K 512 0 512 512 100.00 ZFS user/group used -2 1 128K 512 0 512 512 100.00 ZFS user/group used 1 1 128K 512 1K 512 512 100.00 ZFS master node 2 1 128K 512 0 512 512 0.00 ZFS plain file 32 1 128K 512 0 512 512 100.00 SA master node 33 1 128K 512 0 512 512 100.00 ZFS delete queue 34 1 128K 512 0 512 512 100.00 ZFS directory 35 1 128K 1.50K 1K 512 1.50K 100.00 SA attr registration 36 1 128K 16K 8K 512 32K 100.00 SA attr layouts

But with the multi-threaded object allocator it doesn't seem to be possible (or maybe it just more difficult) to re-allocate objects <64, the test case is basically a while(true) loop.

Given that the objective is to "verify that 'zfs send' drills appropriate holes" we just need to verify we reallocated an object, which is why i've changed the condition to <= object: we could play it safe and also verify we correctly generate FREEOBJECT records between OBJECT records (zstreamdump + awk?)

Got it, that makes sense. The change in object id of the "SA master node" was a known side effect the multi-threaded allocator which wasn't expected to cause any problems. It just means we may create a few holes, but this the normal state for any active filesystem so that's fine. Your fix for the test case looks entirely reasonable to me. We could add the zstreamdump check too but I don't think it's necessary.

behlendorf · 2017-09-18T19:06:45Z

tests/zfs-tests/tests/functional/cli_root/zpool_import/zpool_import_all_001_pos.ksh

+	# export unintentionally imported pools
+	for poolname in $(get_all_pools); do
+		if [[ -z ${testpools[$poolname]} ]]; then
+			log_must zpool export $poolname


Maybe use log_must_busy here.

behlendorf · 2017-09-18T19:07:25Z

tests/zfs-tests/tests/functional/cli_root/zpool_import/zpool_import_all_001_pos.ksh

-			log_must zpool export ${TESTPOOL}-$id
+	for poolname in ${testpools[@]}; do
+		if poolexists $poolname ; then
+			log_must zpool export $poolname


Maybe use log_must_busy here.

behlendorf · 2017-09-18T19:12:08Z

tests/zfs-tests/tests/functional/rsend/rsend_019_pos.ksh

@@ -36,7 +36,7 @@
 verify_runnable "both"

 # See issue: https://github.com/zfsonlinux/zfs/issues/6086
-if is_linux; then
+if is_32bit; then


There have been reports that this test, and other rsend tests, are occasionally failing due to low default zfs_arc_min value. We may want to consider increase the minimum value slightly, although the fixes in 787acae may make that unnecessary.

I need this test enabled to push #6623 forward: i wonder if dumping the send streams in filesystems like /$sendpool and /$streamfs is really needed or is just stressing ZFS more than necessary for rsend tests.

I'm all for enabling this test case as long as we can run it reliably.

I've opened PR #6659 which increases the default arc_c_min on systems which have more then 1G of total system memory. It will be set to 1/32 of total system memory with the existing 32M floor preserved. This brings the code a little more inline with OpenZFS although we still permit a lower floor in order to support low memory systems.

It also means that for the buildbot TESTERS arc_c_min will now default to 128M (1/32 of 4G) which based open issues is expected to be large enough to avoid the reported performance with the rsend tests.

As for dumping the send streams in /$sendpool and /$streamfs that may put further stress on the system but that's exactly the kind of thing we want to test. Let's leave it as is unless we determine we have to change it.

loli10K · 2017-09-19T20:14:46Z

As it turns out, #6143 seems to be reproducible but very specific to some kernels; when we zfs rollback the parent dataset we also remove its children mountpoint, so issuing a zfs umount $subfs fails when mount cannot find the directory (umount syscall returns ENOENT):

root@linux:~# zfs create $POOLNAME/fs
root@linux:~# zfs snap $POOLNAME/fs@snap
root@linux:~# zfs create $POOLNAME/fs/subfs
root@linux:~# zfs rollback $POOLNAME/fs@snap
root@linux:~# # subfs seems to be mounted
root@linux:~# grep $POOLNAME /proc/mounts
testpool /testpool zfs rw,xattr,noacl 0 0
testpool/fs /testpool/fs zfs rw,xattr,noacl 0 0
testpool/fs/subfs /testpool/fs/subfs zfs rw,xattr,noacl 0 0
root@linux:~# # first time: fail
root@linux:~# strace -qf -e trace=execve,umount zfs umount -a
execve("/usr/sbin/zfs", ["zfs", "umount", "-a"], [/* 16 vars */]) = 0
[pid   305] execve("/bin/umount", ["/bin/umount", "-t", "zfs", "/testpool/fs/subfs"], [/* 16 vars */]) = 0
[pid   305] umount("/testpool/fs/subfs", 0) = -1 ENOENT (No such file or directory)
umount: /testpool/fs/subfs: mountpoint not found
[pid   305] +++ exited with 32 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=305, si_uid=0, si_status=32, si_utime=0, si_stime=0} ---
cannot unmount '/testpool/fs/subfs': umount failed
[pid   308] execve("/bin/umount", ["/bin/umount", "-t", "zfs", "/testpool/fs"], [/* 16 vars */]) = 0
[pid   308] umount("/testpool/fs", 0)   = 0
[pid   308] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=308, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
[pid   311] execve("/bin/umount", ["/bin/umount", "-t", "zfs", "/testpool"], [/* 16 vars */]) = 0
[pid   311] umount("/testpool", 0)      = 0
[pid   311] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=311, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
+++ exited with 1 +++
root@linux:~# # second time: ok
root@linux:~# zfs umount -a
root@linux:~#

This is also problematic when we have different filesystems with the same mountpoint set: zfs umount $fs could umount the wrong filesystem

root@linux:~# zfs create $POOLNAME/over1 -o mountpoint=/mnt/over
root@linux:~# zfs create $POOLNAME/over2 -o mountpoint=/mnt/over
root@linux:~# zfs create $POOLNAME/over3 -o mountpoint=/mnt/over
root@linux:~# mount -t zfs
testpool/over1 on /mnt/over type zfs (rw,xattr,noacl)
testpool/over2 on /mnt/over type zfs (rw,xattr,noacl)
testpool/over3 on /mnt/over type zfs (rw,xattr,noacl)
root@linux:~# # could umount wrong fs
root@linux:~# zfs umount $POOLNAME/over2
root@linux:~# mount -t zfs
testpool/over1 on /mnt/over type zfs (rw,xattr,noacl)
testpool/over2 on /mnt/over type zfs (rw,xattr,noacl)
root@linux:~#

The proposed fix umounts using the dataset name instead of the mountpoint path: this seems to fix #6143. We could add a new test case to verify the second case (multiple fs with the same mountpoint).

behlendorf · 2017-09-19T21:01:15Z

@loli10K nice find regarding #6143. Using the dataset name instead should be a good solution for both cases since it's unique.

behlendorf · 2017-09-20T01:46:56Z

@loli10K as a minor aside, you'll want to rebase this on master to pick up 8e2ddda for this to pass on the coverage builder.

loli10K · 2017-09-21T14:34:03Z

tests/zfs-tests/tests/functional/cache/cache.kshlib

@@ -57,7 +57,7 @@ function display_status
 	((ret |= $?))

 	typeset mntpnt=$(get_prop mountpoint $pool)
-	dd if=/dev/random of=$mntpnt/testfile.$$ &
+	dd if=/dev/urandom of=$mntpnt/testfile.$$ &


This change is triggeting "busy" failures in some cache test scripts:

Test: /usr/share/zfs/zfs-tests/tests/functional/cache/cache_006_pos (run as root) [00:04] [FAIL] 08:41:00.92 ASSERTION: Exporting and importing pool with cache devices passes. 08:41:01.70 SUCCESS: zpool create testpool /mnt/testdir/disk.cache/a /mnt/testdir/disk.cache/b /mnt/testdir/disk.cache/c cache loop0 loop1 08:41:01.71 SUCCESS: verify_cache_device testpool loop0 ONLINE 08:41:02.36 SUCCESS: zpool export testpool 08:41:03.76 SUCCESS: zpool import -d /mnt/testdir/disk.cache testpool 08:41:05.80 SUCCESS: display_status testpool 08:41:05.81 SUCCESS: verify_cache_device testpool loop1 ONLINE 08:41:05.82 ERROR: zpool destroy testpool exited 1 08:41:05.82 umount: /testpool: target is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) cannot unmount '/testpool': umount failed could not destroy 'testpool': could not unmount datasets

We either s/log_must/log_must_busy/ zpool destroy or run sync just after kill -9 $dd_pid: i think the latter is preferred, we shouldn't burden verify_cache_device() callers to deal explicitly with an optionally busy fs, it's cleaner to deal with this issue inside the function.

EDIT: the function i was referring to is display_status(), not verify_cache_device().

Or how about wait until everything terminates.

@behlendorf i don't know why it's working but it's working indeed, i'll have to RTFM on wait (which i think is more lightweight than sync, which is good). Thanks.

It should work because the problem isn't that there's unwritten dirty data. This would get written out as part of the unmount regardless. The problem is that there are still open file handles which are preventing the unmount. The test need only wait until those child processes exit and the file handles are closed.

@loli10K do you think you're close to have a version of this cleanup ready to merge? Selfishly I'd love to stop seeing some of these known test failures!

This should be almost ready, the only known failure is inherit_001_pos which is reproducible and was introduced with the fix for #6143, right now i'm working on it.

loli10K · 2017-09-23T08:09:27Z

I may have found the issue (inherit_001_pos): zfs inherit -r sharenfs tries to unshare the dataset twice; this should not happen because we carefully zfs_uninit_libshare() in changelist_postfix() to avoid cached values when checking if a dataset is_mounted()/is_shared().

Here we are attached to a "soon-to-be failed" zfs inherit -r sharenfs process which is trying to check if it's already shared before trying to exportfs -u: libshare was already uninitialized (cn->cn_handle->zfs_hdl->libzfs_sharehdl == 0x0) but we still have an open fd on the old (deleted) sharetab (cn->cn_handle->zfs_hdl->libzfs_sharetab == (FILE *) 0x10e2c60), so is_shared() is being called with a semi-uninitialized libshare.

(gdb) bt
#0  is_shared (hdl=0x10e2090, mountpoint=0x10eaa70 "/TESTPOOL/TESTCTR/TESTFS1", proto=PROTO_NFS) at libzfs_mount.c:136
#1  0x00007f8222870a50 in zfs_unshare_proto (zhp=0x10ea7d0, mountpoint=0x0, proto=0x7f8222aff808 <nfs_only>) at libzfs_mount.c:1026
#2  0x00007f8222870ae4 in zfs_unshare_nfs (zhp=0x10ea7d0, mountpoint=0x0) at libzfs_mount.c:1044
#3  0x00007f82228570ee in changelist_postfix (clp=0x10e6cf0) at libzfs_changelist.c:257
#4  0x00007f8222860165 in zfs_prop_inherit (zhp=0x10e5070, propname=0x7f82228ea774 "sharenfs", received=B_FALSE) at libzfs_dataset.c:2017
#5  0x000000000040963f in inherit_recurse_cb (zhp=0x10e5070, data=0x7ffd5a8d1060) at zfs_main.c:1914
#6  0x0000000000405f0c in zfs_for_each (argc=1, argv=0x7ffd5a8d11f8, flags=1, types=(ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT | ZFS_TYPE_VOLUME), sortcol=0x0, proplist=0x0, limit=0, callback=0x4095c8 <inherit_recurse_cb>, data=0x7ffd5a8d1060) at zfs_iter.c:477
#7  0x0000000000409a1d in zfs_do_inherit (argc=1, argv=0x7ffd5a8d11f8) at zfs_main.c:2027
#8  0x0000000000414aa0 in main (argc=5, argv=0x7ffd5a8d11d8) at zfs_main.c:7381
(gdb) frame 3
#3  0x00007f82228570ee in changelist_postfix (clp=0x10e6cf0) at libzfs_changelist.c:257
257				errors += zfs_unshare_nfs(cn->cn_handle, NULL);
(gdb) p cn->cn_handle->zfs_hdl->libzfs_sharetab 
$33 = (FILE *) 0x10e2c60
(gdb) p cn->cn_handle->zfs_hdl->libzfs_sharehdl 
$34 = (void *) 0x0
(gdb) call getpid()
$35 = 8246
(gdb) !lsof -p 8246
COMMAND  PID USER   FD   TYPE             DEVICE SIZE/OFF   NODE NAME
zfs     8246 root  cwd    DIR              253,0     4096 528419 /var/tmp/test_results/20170923T085947
zfs     8246 root  rtd    DIR              253,0     4096      2 /
[...]
zfs     8246 root    0u  unix 0xffff8800ace04400      0t0  93804 socket
zfs     8246 root    1w  FIFO                0,8      0t0  96487 pipe
zfs     8246 root    2w  FIFO                0,8      0t0  96488 pipe
zfs     8246 root    3u   CHR              10,57      0t0  81832 /dev/zfs
zfs     8246 root    4r   REG                0,3        0  95529 /proc/8246/mounts
zfs     8246 root    5r   REG              253,0       44 131789 /etc/dfs/sharetab.yQlNQh (deleted)
zfs     8246 root    6u   CHR              10,57      0t0  81832 /dev/zfs
zfs     8246 root    7w   REG              253,0        0 798178 /tmp/zfs.lock
zfs     8246 root   12u   REG              253,0      138 799936 /tmp/exportfs.uKUxIP (deleted)
(gdb) !exportfs -v
(gdb) !cat /proc/8246/fd/5
/TESTPOOL/TESTCTR/TESTFS1	-	nfs	rw,crossmnt
(gdb) !cat /proc/8246/fd/12
/TESTPOOL/TESTCTR/TESTFS1
		<world>(rw,sync,wdelay,hide,crossmnt,no_subtree_check,mountpoint,sec=sys,secure,no_root_squash,no_all_squash)
(gdb)

We zfs_init_libshare() again in unshare_one() but unfortunately it's too late: we already decided, a couple of frames up in zfs_unshare_proto(), that "/TESTPOOL/TESTCTR/TESTFS1" is still shared based on (wrong) cached values and we expect to be able to unshare it successfully.

Unfortunately with a fresh libshare_hdl the share cannot be found: "cannot unshare 'TESTPOOL/TESTCTR/TESTFS1': not found: unshare(1M) failed"

1023			for (curr_proto = proto; *curr_proto != PROTO_END;
1024			    curr_proto++) {
1025	
1026				if (is_shared(hdl, mntpt, *curr_proto) &&
1027				    unshare_one(hdl, zhp->zfs_name,
1028				    mntpt, *curr_proto) != 0) {
1029					if (mntpt != NULL)
1030						free(mntpt);
1031					return (-1);

(gdb) bt
#0  zfs_init_libshare (zhandle=0x19c8090, service=1) at libzfs_mount.c:788
#1  0x00007fbe4e3227d2 in unshare_one (hdl=0x19c8090, name=0x19d07e0 "TESTPOOL/TESTCTR/TESTFS1", mountpoint=0x19cd920 "/TESTPOOL/TESTCTR/TESTFS1", proto=PROTO_NFS) at libzfs_mount.c:975
#2  0x00007fbe4e322a72 in zfs_unshare_proto (zhp=0x19d07d0, mountpoint=0x0, proto=0x7fbe4e5b1808 <nfs_only>) at libzfs_mount.c:1027
#3  0x00007fbe4e322ae4 in zfs_unshare_nfs (zhp=0x19d07d0, mountpoint=0x0) at libzfs_mount.c:1044
#4  0x00007fbe4e3090ee in changelist_postfix (clp=0x19cccf0) at libzfs_changelist.c:257
#5  0x00007fbe4e312165 in zfs_prop_inherit (zhp=0x19cb070, propname=0x7fbe4e39c774 "sharenfs", received=B_FALSE) at libzfs_dataset.c:2017
#6  0x000000000040963f in inherit_recurse_cb (zhp=0x19cb070, data=0x7ffef6c4a140) at zfs_main.c:1914
#7  0x0000000000405f0c in zfs_for_each (argc=1, argv=0x7ffef6c4a2d8, flags=1, types=(ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT | ZFS_TYPE_VOLUME), sortcol=0x0, proplist=0x0, limit=0, callback=0x4095c8 <inherit_recurse_cb>, data=0x7ffef6c4a140) at zfs_iter.c:477
#8  0x0000000000409a1d in zfs_do_inherit (argc=1, argv=0x7ffef6c4a2d8) at zfs_main.c:2027
#9  0x0000000000414aa0 in main (argc=5, argv=0x7ffef6c4a2b8) at zfs_main.c:7381
(gdb) list libzfs_mount.c:975
970		 * value.
971		 */
972		mntpt = zfs_strdup(hdl, mountpoint);
973	
974		/* make sure libshare initialized */
975		if ((err = zfs_init_libshare(hdl, SA_INIT_SHARE_API)) != SA_OK) {
976			free(mntpt);	/* don't need the copy anymore */
977			return (zfs_error_fmt(hdl, proto_table[proto].p_unshare_err,
978			    dgettext(TEXT_DOMAIN, "cannot unshare '%s': %s"),
979			    name, sa_errorstr(err)));
(gdb) 
980		}
981	
982		share = sa_find_share(hdl->libzfs_sharehdl, mntpt);
983		free(mntpt);	/* don't need the copy anymore */
984	
985		if (share != NULL) {
986			err = sa_disable_share(share, proto_table[proto].p_name);
987			if (err != SA_OK) {
988				return (zfs_error_fmt(hdl,
989				    proto_table[proto].p_unshare_err,
(gdb) 
990				    dgettext(TEXT_DOMAIN, "cannot unshare '%s': %s"),
991				    name, sa_errorstr(err)));
992			}
993		} else {
994			return (zfs_error_fmt(hdl, proto_table[proto].p_unshare_err,
995			    dgettext(TEXT_DOMAIN, "cannot unshare '%s': not found"),
996			    name));
997		}
998		return (0);
999	}
(gdb) up
#1  0x00007fbe4e3227d2 in unshare_one (hdl=0x19c8090, name=0x19d07e0 "TESTPOOL/TESTCTR/TESTFS1", mountpoint=0x19cd920 "/TESTPOOL/TESTCTR/TESTFS1", proto=PROTO_NFS) at libzfs_mount.c:975
975		if ((err = zfs_init_libshare(hdl, SA_INIT_SHARE_API)) != SA_OK) {
(gdb) up
#2  0x00007fbe4e322a72 in zfs_unshare_proto (zhp=0x19d07d0, mountpoint=0x0, proto=0x7fbe4e5b1808 <nfs_only>) at libzfs_mount.c:1027
1027				    unshare_one(hdl, zhp->zfs_name,
(gdb) list
1022	
1023			for (curr_proto = proto; *curr_proto != PROTO_END;
1024			    curr_proto++) {
1025	
1026				if (is_shared(hdl, mntpt, *curr_proto) &&
1027				    unshare_one(hdl, zhp->zfs_name,
1028				    mntpt, *curr_proto) != 0) {
1029					if (mntpt != NULL)
1030						free(mntpt);
1031					return (-1);
(gdb)

This is the issue, now we just need a fix.

EDIT: the fix is to freopen(ZFS_SHARETAB) in is_shared(), like is_mounted() does (via libzfs_mnttab_find()).

loli10K · 2017-09-24T10:31:09Z

For some reason the Coverage builder is not picking up my new test case:

/var/lib/buildbot/slaves/zfs/Ubuntu_17_04_x86_64_Coverage__TEST_/build/zfs/tests/test-runner/bin/test-runner.py  -c /var/lib/buildbot/slaves/zfs/Ubuntu_17_04_x86_64_Coverage__TEST_/build/zfs/tests/runfiles/linux.run -i /var/lib/buildbot/slaves/zfs/Ubuntu_17_04_x86_64_Coverage__TEST_/build/zfs/tests/zfs-tests
Warning: TestGroup '/var/lib/buildbot/slaves/zfs/Ubuntu_17_04_x86_64_Coverage__TEST_/build/zfs/tests/zfs-tests/tests/functional/cli_root/zfs_bookmark' not added to this run. Auxiliary script '/var/lib/buildbot/slaves/zfs/Ubuntu_17_04_x86_64_Coverage__TEST_/build/zfs/tests/zfs-tests/tests/functional/cli_root/zfs_bookmark/setup' failed verification.

I should have fixed the STYLE warning locally, i just need to tackle this coverage issue before the final push: i have already rebased on master and also added a new test to cover OpenZFS 8166 zpool scrub thinks it repaired offline device.

dinatale2 · 2017-09-24T15:40:33Z

@loli10K I believe your new cleanup.ksh and setup.ksh files also need to be marked as executable. That may be why you're seeing that message.

loli10K · 2017-09-24T15:47:22Z

@dinatale2 that's it, thanks; though i wonder why the other builders are not (rightfully) complaining.

* Add 'zfs bookmark' coverage (zfs_bookmark_cliargs) * Add OpenZFS 8166 coverage (zpool_scrub_offline_device) * Fix "busy" zfs_mount_remount failures * Fix bootfs_003_pos, bootfs_004_neg, zdb_005_pos local cleanup * Update usage of $KEEP variable, add get_all_pools() function * Enable history_008_pos and rsend_019_pos (non-32bit builders) * Enable zfs_copies_005_neg, update local cleanup * Fix zfs_send_007_pos (large_dnode + OpenZFS 8199) * Fix rollback_003_pos (use dataset name, not mountpoint, to unmount) * Update default_raidz_setup() to work properly with more than 3 disks * Use $TEST_BASE_DIR instead of hardcoded (/var)/tmp for file VDEVs * Update usage of /dev/random to /dev/urandom Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

behlendorf · 2017-09-24T23:57:28Z

@loli10K my guess is you only saw this on the coverage builder because it's the only one which runs in-tree so we can collect the gcov data. The other scripts may have had their permissions fixed by the packaging.

EDIT: I've resubmitted the test builders to get another test run.

loli10K · 2017-09-25T05:34:24Z

@behlendorf nice, thanks. I really like this coverage thing, it's very easy to spot code paths that are not being exercised by the ZTS: https://codecov.io/gh/zfsonlinux/zfs/src/master/cmd/zfs/zfs_main.c#L6874.

I started working on zfs diff coverage yesterday night, i hope to push something in a couple of days (or less).

behlendorf · 2017-09-25T16:01:54Z

@loli10K I'm a big fan of the code coverage too. You can thank @prakashsurya for it, he's done all the heavy lifting to get it working!

As for this PR, I'm going to give it one last review before merging it. Thank you for addressing all of these issues. I'm curious to see which ZTS issues peculate up after these are resolved.

* Add 'zfs bookmark' coverage (zfs_bookmark_cliargs) * Add OpenZFS 8166 coverage (zpool_scrub_offline_device) * Fix "busy" zfs_mount_remount failures * Fix bootfs_003_pos, bootfs_004_neg, zdb_005_pos local cleanup * Update usage of $KEEP variable, add get_all_pools() function * Enable history_008_pos and rsend_019_pos (non-32bit builders) * Enable zfs_copies_005_neg, update local cleanup * Fix zfs_send_007_pos (large_dnode + OpenZFS 8199) * Fix rollback_003_pos (use dataset name, not mountpoint, to unmount) * Update default_raidz_setup() to work properly with more than 3 disks * Use $TEST_BASE_DIR instead of hardcoded (/var)/tmp for file VDEVs * Update usage of /dev/random to /dev/urandom Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Issue openzfs#6086 Closes openzfs#5658 Closes openzfs#6143 Closes openzfs#6421 Closes openzfs#6627 Closes openzfs#6632

loli10K added the Status: Work in Progress Not yet ready for general review label Sep 12, 2017

loli10K force-pushed the issue-6627 branch from bdb310e to 71a5bdb Compare September 12, 2017 15:32

loli10K force-pushed the issue-6627 branch from 71a5bdb to f187221 Compare September 12, 2017 21:48

loli10K mentioned this pull request Sep 15, 2017

Fix dryrun and intra-pool resumable 'zfs send' #6623

Merged

13 tasks

loli10K force-pushed the issue-6627 branch from f187221 to be5cbc8 Compare September 15, 2017 19:20

loli10K force-pushed the issue-6627 branch from be5cbc8 to a5d9208 Compare September 17, 2017 18:50

loli10K force-pushed the issue-6627 branch from a5d9208 to 6bc2fef Compare September 18, 2017 05:50

behlendorf requested changes Sep 18, 2017

View reviewed changes

loli10K force-pushed the issue-6627 branch 2 times, most recently from cdcb34a to 9e0441c Compare September 19, 2017 18:21

loli10K force-pushed the issue-6627 branch from 6d7e803 to 9c401df Compare September 19, 2017 20:26

loli10K force-pushed the issue-6627 branch from 9c401df to 94e4447 Compare September 20, 2017 17:37

loli10K commented Sep 21, 2017

View reviewed changes

loli10K force-pushed the issue-6627 branch 2 times, most recently from fa7b799 to a5f1766 Compare September 22, 2017 05:43

loli10K force-pushed the issue-6627 branch 3 times, most recently from ac02a38 to 5258aff Compare September 24, 2017 06:43

loli10K force-pushed the issue-6627 branch from 5258aff to 0f13c94 Compare September 24, 2017 15:50

loli10K changed the title ~~[WIP] Fix some ZFS Test Suite issues~~ Fix some ZFS Test Suite issues Sep 24, 2017

loli10K removed the Status: Work in Progress Not yet ready for general review label Sep 24, 2017

behlendorf approved these changes Sep 25, 2017

View reviewed changes

behlendorf merged commit 3fd3e56 into openzfs:master Sep 25, 2017

loli10K deleted the issue-6627 branch October 15, 2017 06:22

loli10K restored the issue-6627 branch September 22, 2018 09:44

Fix some ZFS Test Suite issues #6632

Fix some ZFS Test Suite issues #6632

Conversation

loli10K commented Sep 12, 2017 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

behlendorf commented Sep 12, 2017

loli10K commented Sep 12, 2017

loli10K commented Sep 13, 2017

behlendorf commented Sep 13, 2017

behlendorf commented Sep 13, 2017

behlendorf commented Sep 15, 2017

loli10K commented Sep 15, 2017

behlendorf commented Sep 15, 2017

behlendorf commented Sep 16, 2017

loli10K commented Sep 16, 2017 • edited Loading

codecov bot commented Sep 17, 2017 • edited Loading

Codecov Report

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loli10K commented Sep 19, 2017

behlendorf commented Sep 19, 2017

behlendorf commented Sep 20, 2017

loli10K Sep 21, 2017 • edited Loading

Choose a reason for hiding this comment

behlendorf Sep 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loli10K commented Sep 23, 2017 • edited Loading

loli10K commented Sep 24, 2017

dinatale2 commented Sep 24, 2017

loli10K commented Sep 24, 2017

behlendorf commented Sep 24, 2017 • edited Loading

loli10K commented Sep 25, 2017

behlendorf commented Sep 25, 2017

loli10K commented Sep 12, 2017 •

edited

Loading

loli10K commented Sep 16, 2017 •

edited

Loading

codecov bot commented Sep 17, 2017 •

edited

Loading

loli10K Sep 21, 2017 •

edited

Loading

behlendorf Sep 21, 2017 •

edited

Loading

loli10K commented Sep 23, 2017 •

edited

Loading

behlendorf commented Sep 24, 2017 •

edited

Loading