Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs_rename: support RENAME_* flags #8667

Closed
wants to merge 3 commits into from

Conversation

cyphar
Copy link
Contributor

@cyphar cyphar commented Apr 24, 2019

Motivation and Context

In order to allow overlayfs-on-ZFS (#8648), we need to support the renameat(2)
flags (most importantly, RENAME_WHITEOUT and RENAME_EXCHANGE). This
requires quite a few changes to the ordering of link operations during
rename (and the related error paths).

Signed-off-by: Aleksa Sarai cyphar@cyphar.com

How Has This Been Tested?

I conducted a variety of smoke-tests to check that all of the renameat(2) flags (and ordinary rename) all operated correctly. After all the runs, I checked that there were no leaks with zdb -bbb. However, I have not tested the error path at all (and I have a feeling this is almost certainly where I've made mistakes).

  • TODO: Run the rename-related xfstests.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@cyphar cyphar force-pushed the zfs-renameat2-flags branch 5 times, most recently from 7e9ef73 to 3a73e39 Compare April 25, 2019 17:46
@cyphar cyphar force-pushed the zfs-renameat2-flags branch 4 times, most recently from dab899c to d1b7f05 Compare April 26, 2019 15:04
@cyphar cyphar changed the title zfs: support renameat(2) flags zfs_rename: support RENAME_* flags Apr 26, 2019
@cyphar cyphar force-pushed the zfs-renameat2-flags branch 5 times, most recently from eddcef0 to e5c6926 Compare April 27, 2019 01:14
@cyphar cyphar marked this pull request as ready for review April 27, 2019 07:21
@cyphar
Copy link
Contributor Author

cyphar commented Apr 27, 2019

I think this is ready for review (this is my first ZFS patch, so I'm sure I've made quite a few bad mistakes but I've smoke-tested that the functionality works).

/cc @behlendorf

@cyphar cyphar force-pushed the zfs-renameat2-flags branch from e5c6926 to e53ab06 Compare April 27, 2019 07:24
@rlaager
Copy link
Member

rlaager commented Apr 30, 2019

I'm not well-versed in the ZFS kernel code, so perhaps this is obvious, but... what does the whiteout inode look like "on disk"? More specifically, is there a backwards compatibility issue here? In other words, what happens if I create a whiteout inode and then import my pool and mount the filesystem on a version of ZFS that predates your commits?

@cyphar
Copy link
Contributor Author

cyphar commented Apr 30, 2019

@rlaager The whiteout inode in overlayfs is just a character device with major and minor numbers 0 (this is distinct from BSD's own whiteout concept which is a whole different kettle of fish). There is no special treatment in ZFS of such character devices, the only reason we have to treat it specially in zfs_rename is because the creation of the whiteout inode during the rename has to be done atomically.

So importing old pools or new pools on old machines won't change anything since the inode is just a bog-standard character device. You could make one with mknod foo c 0 0.

@rlaager
Copy link
Member

rlaager commented Apr 30, 2019

@rlaager The whiteout inode in overlayfs is just a character device with major and minor numbers 0 (this is distinct from BSD's own whiteout concept which is a whole different kettle of fish). There is no special treatment in ZFS of such character devices, the only reason we have to treat it specially in zfs_rename is because the creation of the whiteout inode during the rename has to be done atomically.

Thanks. For other filesystems, do whiteout nodes work when the filesystem is mounted nodev? With your changes, is ZFS following that behavior with nodev/devices=off?

If the answer to the first question is "no", then that's an argument against devices=off in the Root-on-ZFS HOWTOs that I maintain, and I'd want to remove it.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label May 2, 2019
@cyphar
Copy link
Contributor Author

cyphar commented May 2, 2019

@rlaager MS_NODEV only disallows access to device inodes, it doesn't affect their creation. So RENAME_WHITEOUT does work (as does mknod) on a nodev-mounted filesystem. Since the purpose of a whiteout is to represent a tombstone for a file (which then overlayfs or other filesystems can use for internal representation), there's no need to ever access them.

It is my understanding that devices=off is supposed to operate in the same way, and from my testing it does.

@rlaager
Copy link
Member

rlaager commented May 2, 2019

@cyphar Thanks!

@behlendorf
Copy link
Contributor

@cyphar sorry about the delay in getting this reviewed. I should be able to take a proper look in the next week or so. Thank you in advance for implementing this needed functionality.

@cyphar
Copy link
Contributor Author

cyphar commented May 9, 2019

No problem, totally understandable. :D

I haven't worked on ZFS code before so hopefully I didn't make too many mistakes with regards to the error path and how txgs actually operate (as far as I can tell, there's isn't a rollback-on-failure mechanism?). It's also possible that zfs_drop_nlink is completely un-necessary because re-linking should always succeed.

@cyphar
Copy link
Contributor Author

cyphar commented Jun 15, 2019

/ping now that 0.8.1 is out.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar sorry for the delay, thanks for your patience. The broad stokes of the patch look reasonable to me. From my perspective, what's needed is to resolve the accidental on-disk format change and to add some test coverage. Additional comments inline.

@cyphar cyphar force-pushed the zfs-renameat2-flags branch 8 times, most recently from 4e5e969 to eb8a822 Compare June 23, 2019 07:46
cyphar added 3 commits June 23, 2019 17:52
This is in preparation for RENAME_EXCHANGE and RENAME_WHITEOUT support
for ZoL, but the changes here allow for far nicer fallbacks than the
previous implementation (the source and target are re-linked in case of
the final link failing).

In addition, a small cleanup was done for the "target exists but is a
different type" codepath so that it's more understandable.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Implement support for Linux's RENAME_* flags (for renameat2). Aside from
being quite useful for userspace (providing race-free ways to exchange
paths and implement mv --no-clobber), they are used by overlayfs and are
thus required in order to use overlayfs-on-ZFS.

In order for us to represent the new renameat2(2) flags in the ZIL, we
need to create a new transaction type (to be backwards-compatible).
Since RENAME_EXCHANGE and RENAME_WHITEOUT are mutually exclusive they
deserve separate types. We just re-use the logic of
zfs_{log,replay}_rename() with the only change being the transaction
types and the associate vfsflags passed to zfs_rename().

RENAME_NOREPLACE doesn't need an entry because if the renameat2(2) fails
because of RENAME_NOREPLACE there won't be a ZIL entry for the operation
(and if it succeeds then it should also succeed on-replay).

Unfortunately, more work is required in order use overlayfs-on-ZFS
(namely we have to remove our .d_revalidate hook, since overlayfs
refuses to use a filesystem with d_revalidate as an upperdir).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Since mv(1) doesn't expose the RENAME_* flags we need to have our own
variation in the tests/ tree. The tests are fairly obvious functional
tests, though in the future (once we solve the d_revalidate issue) we
might also add a full-stack overlayfs integration test.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
@cyphar cyphar force-pushed the zfs-renameat2-flags branch from eb8a822 to 4ea6f30 Compare June 23, 2019 07:52
@cyphar
Copy link
Contributor Author

cyphar commented Jun 23, 2019

Alright @behlendorf, I've made the discussed changes. The only one that isn't yet possible to do is:

Since the target use case is overlayfs-on-ZFS let's make sure to add a test case for it.

Because we still need to solve the d_revalidate issue in #8774.

commit_link_szp:
if (zfs_link_create(sdl, szp, tx, ZRENAMING))
VERIFY3U(zfs_drop_nlink(szp, tx, NULL), ==, 0);
goto commit;
Copy link
Contributor Author

@cyphar cyphar Jun 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behlendorf When you take a look at this next can you please let me know whether zfs_drop_nlink() is actually needed and/or safe? My understanding of the link-related errors is that you shouldn't get one from zfs_link_create given that we just removed the link in this function (so zfs_drop_nlink() might be completely unnecessary).

I also don't like that this will basically BUG_ON if we happen to hit this error with a non-empty directory...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use zfs_dirent_lock to lock the specific names to protect them. Then you wouldn't need to call zfs_drop_nlink here, since it shouldn't be possible for these operations to fail.

@codecov
Copy link

codecov bot commented Jun 23, 2019

Codecov Report

Merging #8667 into master will decrease coverage by 1.24%.
The diff coverage is 42.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8667      +/-   ##
==========================================
- Coverage   78.71%   77.46%   -1.25%     
==========================================
  Files         388      377      -11     
  Lines      120071   119677     -394     
==========================================
- Hits        94512    92706    -1806     
- Misses      25559    26971    +1412
Flag Coverage Δ
#kernel 77.54% <42.85%> (-1.97%) ⬇️
#user 65.86% <ø> (-0.65%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 186898b...4ea6f30. Read the comment docs.

@@ -625,15 +625,14 @@ zfs_replay_link(void *arg1, void *arg2, boolean_t byteswap)
}

static int
zfs_replay_rename(void *arg1, void *arg2, boolean_t byteswap)
_zfs_replay_renameat2(void *arg1, void *arg2, boolean_t byteswap, int vflg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you rename this zfs_replay_rename_impl which is the convention used by the rest of the code base.

sys_renameat2(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath, unsigned int flags)
{
int ret = syscall(SYS_renameat2, olddirfd, oldpath, newdirfd, newpath,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now I didn't realize that glibc doesn't provide a wrapper for renameat2. I'd mention that here, or in the comment above, so it clear why this is needed.

cleanup.ksh \
renameat2_001_noreplace.ksh \
renameat2_002_exchange.ksh \
renameat2_003_whiteout.ksh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For new tests we've been moving away from the 00x prefix so let's remove it, and simply go with renameat2_noreplace.ksh, renameat2_exchange.ksh, renameat2_whiteout.ksh.

You're also going to need to add the following block to the tests/runfiles/linux.run file. This file controls which test cases get run, and defines some groups to make it easy to select tests.

index b73528e43..fec8cff5c 100644
--- a/tests/runfiles/linux.run
+++ b/tests/runfiles/linux.run
@@ -778,6 +778,10 @@ tests = ['removal_all_vdev', 'removal_check_space',
     'remove_mirror', 'remove_mirror_sanity', 'remove_raidz']
 tags = ['functional', 'removal']
 
+[tests/functional/renameat2]
+tests = ['renameat2_noreplace', 'renameat2_exchange', 'renameat2_whiteout']
+tags = ['functional', 'renameat2']
+
 [tests/functional/rename_dirs]
 tests = ['rename_dirs_001_pos']
 tags = ['functional', 'rename_dirs']
$ ./scripts/zfs-tests.sh -T renameat2
Test: /home/fedora/src/git/zfs/tests/zfs-tests/tests/functional/renameat2/setup (run as root) [00:00] [PASS]
...

@@ -22,6 +22,7 @@ SUBDIRS = \
randfree_file \
randwritecomp \
readmmap \
renameat2 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, this new test command needs to be added to the command.cfg file. This allows the test suite to locate the command when running in the source tree.

index f8ad02246..e69c4c216 100644
--- a/tests/zfs-tests/include/commands.cfg
+++ b/tests/zfs-tests/include/commands.cfg
@@ -178,6 +178,7 @@ export ZFSTEST_FILES='chg_usr_exec
     randfree_file
     randwritecomp
     readmmap
+    renameat2
     rename_dir
     rm_lnkcnt_zero_file
     threadsappend

log_unsupported "renameat2 is linux-only"
elif ! renameat2 -C ; then
log_unsupported "renameat2 not supported on this (pre-3.15) linux kernel"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests/test-runner/bin/zts-report.py script is used by the test suite to try and summarize the overall test results. When a test might be legitimately skipped, it should be added to the maybe section of the script along with a brief reason. This prevents it from being marked as an unexpected failure.

I'd suggest modeling this after the functional/user_namespace tests which may also be skipped. By moving this check in to the setup.sh the entire test group can be skipped.

uint64_t wo_projid;
boolean_t fuid_dirtied;
zfs_acl_ids_t acl_ids;
boolean_t have_acl = B_FALSE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've probably noticed when looking at some of the other code that we do allow declaring the variable closer to where they're first used. I'd suggest doing that with some of these to make their scope as clear as possible since this is a large function.

if (!tzp && txtype == TX_EXCHANGE) {
error = SET_ERROR(ENOENT);
goto out;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what do you think about moving this check in to an else block to remove the redundant tzp check?

        /*
         * Does target exist?
         */
        if (tzp) {
                ...
        } else {
                /* Target must exist for RENAME_EXCHANGE. */
                if (txtype == TX_EXCHANGE) {
                        error = SET_ERROR(ENOENT);
                        goto out;
                }
        }

@@ -3656,8 +3656,7 @@ int
zfs_rename(struct inode *sdip, char *snm, struct inode *tdip, char *tnm,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than modify this existing complicated function so extensively, what if instead you added a new zfs_renameat2 which is solely responsible for handling RENAME_EXCHANGE and TX_WHITEOUT. That would potentially allow it to be simpler and remove the risk of accidentally breaking zfs_rename in some subtle way. (Or even a zfs_renameat2_exchange() and zfs_renameat2_whiteout if that falls out more cleanly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could give this a shot, though there is a fair amount of shared code -- the main difference is right at the end in zfs_rename.

* best to correct the state. In particular, all of the nlinks are
* wrong because we were destroying and creating links with ZRENAMING.
*
* In some form, all of thee operations have to resolve the state:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thee

commit_link_szp:
if (zfs_link_create(sdl, szp, tx, ZRENAMING))
VERIFY3U(zfs_drop_nlink(szp, tx, NULL), ==, 0);
goto commit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use zfs_dirent_lock to lock the specific names to protect them. Then you wouldn't need to call zfs_drop_nlink here, since it shouldn't be possible for these operations to fail.

@snajpa
Copy link
Contributor

snajpa commented Oct 5, 2019

Well I gave it a shot and we can't rely on the dirent locks, the zfs_link_create is still allowed fail in one "edge" case. This test lays it out nicely (that's pretty much the only case this is allowed to fail and has to still return a nice error and not panic) ->

https://github.com/zfsonlinux/zfs/blob/master/tests/zfs-tests/tests/functional/casenorm/mixed_create_failure.ksh#L19

I've also considered separate functions for whiteout/exchange and still @cyphar's solution clearly wins to me (the functions would come out really similar, so it'd even make it harder to understand what the subtle differences are).

So I'll revert to using these patches in full and then just tend to the intent log part - at least if nobody has no better ideas...

@cyphar
Copy link
Contributor Author

cyphar commented Oct 5, 2019

@snajpa I can't think of any other workarounds -- my only concern is whether zfs_drop_nlinks is subtly broken somehow (and the fact that it completely borks if you are doing a RENAME_EXCHANGE of a non-empty directory). Personally I find the way errors are handled within ZFS to be really strange (given ZFS is transactional in nature, I was surprised that error handling couldn't be done with "simple" rollbacks to the state before the transaction started).

@cyphar
Copy link
Contributor Author

cyphar commented Nov 11, 2019

This is being carried by #9414.

@cyphar cyphar closed this Nov 11, 2019
@cyphar cyphar deleted the zfs-renameat2-flags branch November 11, 2019 18:45
@cyphar cyphar mentioned this pull request Jun 9, 2021
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants