-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs_rename: support RENAME_* flags #8667
Conversation
7e9ef73
to
3a73e39
Compare
dab899c
to
d1b7f05
Compare
eddcef0
to
e5c6926
Compare
I think this is ready for review (this is my first ZFS patch, so I'm sure I've made quite a few bad mistakes but I've smoke-tested that the functionality works). /cc @behlendorf |
e5c6926
to
e53ab06
Compare
I'm not well-versed in the ZFS kernel code, so perhaps this is obvious, but... what does the whiteout inode look like "on disk"? More specifically, is there a backwards compatibility issue here? In other words, what happens if I create a whiteout inode and then import my pool and mount the filesystem on a version of ZFS that predates your commits? |
@rlaager The whiteout inode in overlayfs is just a character device with major and minor numbers So importing old pools or new pools on old machines won't change anything since the inode is just a bog-standard character device. You could make one with |
Thanks. For other filesystems, do whiteout nodes work when the filesystem is mounted If the answer to the first question is "no", then that's an argument against |
@rlaager It is my understanding that |
@cyphar Thanks! |
e53ab06
to
b93ba92
Compare
@cyphar sorry about the delay in getting this reviewed. I should be able to take a proper look in the next week or so. Thank you in advance for implementing this needed functionality. |
No problem, totally understandable. :D I haven't worked on ZFS code before so hopefully I didn't make too many mistakes with regards to the error path and how txgs actually operate (as far as I can tell, there's isn't a rollback-on-failure mechanism?). It's also possible that |
/ping now that 0.8.1 is out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar sorry for the delay, thanks for your patience. The broad stokes of the patch look reasonable to me. From my perspective, what's needed is to resolve the accidental on-disk format change and to add some test coverage. Additional comments inline.
4e5e969
to
eb8a822
Compare
This is in preparation for RENAME_EXCHANGE and RENAME_WHITEOUT support for ZoL, but the changes here allow for far nicer fallbacks than the previous implementation (the source and target are re-linked in case of the final link failing). In addition, a small cleanup was done for the "target exists but is a different type" codepath so that it's more understandable. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Implement support for Linux's RENAME_* flags (for renameat2). Aside from being quite useful for userspace (providing race-free ways to exchange paths and implement mv --no-clobber), they are used by overlayfs and are thus required in order to use overlayfs-on-ZFS. In order for us to represent the new renameat2(2) flags in the ZIL, we need to create a new transaction type (to be backwards-compatible). Since RENAME_EXCHANGE and RENAME_WHITEOUT are mutually exclusive they deserve separate types. We just re-use the logic of zfs_{log,replay}_rename() with the only change being the transaction types and the associate vfsflags passed to zfs_rename(). RENAME_NOREPLACE doesn't need an entry because if the renameat2(2) fails because of RENAME_NOREPLACE there won't be a ZIL entry for the operation (and if it succeeds then it should also succeed on-replay). Unfortunately, more work is required in order use overlayfs-on-ZFS (namely we have to remove our .d_revalidate hook, since overlayfs refuses to use a filesystem with d_revalidate as an upperdir). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Since mv(1) doesn't expose the RENAME_* flags we need to have our own variation in the tests/ tree. The tests are fairly obvious functional tests, though in the future (once we solve the d_revalidate issue) we might also add a full-stack overlayfs integration test. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
eb8a822
to
4ea6f30
Compare
Alright @behlendorf, I've made the discussed changes. The only one that isn't yet possible to do is:
Because we still need to solve the d_revalidate issue in #8774. |
commit_link_szp: | ||
if (zfs_link_create(sdl, szp, tx, ZRENAMING)) | ||
VERIFY3U(zfs_drop_nlink(szp, tx, NULL), ==, 0); | ||
goto commit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behlendorf When you take a look at this next can you please let me know whether zfs_drop_nlink()
is actually needed and/or safe? My understanding of the link-related errors is that you shouldn't get one from zfs_link_create
given that we just removed the link in this function (so zfs_drop_nlink()
might be completely unnecessary).
I also don't like that this will basically BUG_ON
if we happen to hit this error with a non-empty directory...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use zfs_dirent_lock
to lock the specific names to protect them. Then you wouldn't need to call zfs_drop_nlink
here, since it shouldn't be possible for these operations to fail.
Codecov Report
@@ Coverage Diff @@
## master #8667 +/- ##
==========================================
- Coverage 78.71% 77.46% -1.25%
==========================================
Files 388 377 -11
Lines 120071 119677 -394
==========================================
- Hits 94512 92706 -1806
- Misses 25559 26971 +1412
Continue to review full report at Codecov.
|
@@ -625,15 +625,14 @@ zfs_replay_link(void *arg1, void *arg2, boolean_t byteswap) | |||
} | |||
|
|||
static int | |||
zfs_replay_rename(void *arg1, void *arg2, boolean_t byteswap) | |||
_zfs_replay_renameat2(void *arg1, void *arg2, boolean_t byteswap, int vflg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you rename this zfs_replay_rename_impl
which is the convention used by the rest of the code base.
sys_renameat2(int olddirfd, const char *oldpath, | ||
int newdirfd, const char *newpath, unsigned int flags) | ||
{ | ||
int ret = syscall(SYS_renameat2, olddirfd, oldpath, newdirfd, newpath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until now I didn't realize that glibc doesn't provide a wrapper for renameat2
. I'd mention that here, or in the comment above, so it clear why this is needed.
cleanup.ksh \ | ||
renameat2_001_noreplace.ksh \ | ||
renameat2_002_exchange.ksh \ | ||
renameat2_003_whiteout.ksh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: For new tests we've been moving away from the 00x
prefix so let's remove it, and simply go with renameat2_noreplace.ksh
, renameat2_exchange.ksh
, renameat2_whiteout.ksh
.
You're also going to need to add the following block to the tests/runfiles/linux.run
file. This file controls which test cases get run, and defines some groups to make it easy to select tests.
index b73528e43..fec8cff5c 100644
--- a/tests/runfiles/linux.run
+++ b/tests/runfiles/linux.run
@@ -778,6 +778,10 @@ tests = ['removal_all_vdev', 'removal_check_space',
'remove_mirror', 'remove_mirror_sanity', 'remove_raidz']
tags = ['functional', 'removal']
+[tests/functional/renameat2]
+tests = ['renameat2_noreplace', 'renameat2_exchange', 'renameat2_whiteout']
+tags = ['functional', 'renameat2']
+
[tests/functional/rename_dirs]
tests = ['rename_dirs_001_pos']
tags = ['functional', 'rename_dirs']
$ ./scripts/zfs-tests.sh -T renameat2
Test: /home/fedora/src/git/zfs/tests/zfs-tests/tests/functional/renameat2/setup (run as root) [00:00] [PASS]
...
@@ -22,6 +22,7 @@ SUBDIRS = \ | |||
randfree_file \ | |||
randwritecomp \ | |||
readmmap \ | |||
renameat2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, this new test command needs to be added to the command.cfg
file. This allows the test suite to locate the command when running in the source tree.
index f8ad02246..e69c4c216 100644
--- a/tests/zfs-tests/include/commands.cfg
+++ b/tests/zfs-tests/include/commands.cfg
@@ -178,6 +178,7 @@ export ZFSTEST_FILES='chg_usr_exec
randfree_file
randwritecomp
readmmap
+ renameat2
rename_dir
rm_lnkcnt_zero_file
threadsappend
log_unsupported "renameat2 is linux-only" | ||
elif ! renameat2 -C ; then | ||
log_unsupported "renameat2 not supported on this (pre-3.15) linux kernel" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests/test-runner/bin/zts-report.py
script is used by the test suite to try and summarize the overall test results. When a test might be legitimately skipped, it should be added to the maybe
section of the script along with a brief reason. This prevents it from being marked as an unexpected failure.
I'd suggest modeling this after the functional/user_namespace
tests which may also be skipped. By moving this check in to the setup.sh
the entire test group can be skipped.
uint64_t wo_projid; | ||
boolean_t fuid_dirtied; | ||
zfs_acl_ids_t acl_ids; | ||
boolean_t have_acl = B_FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've probably noticed when looking at some of the other code that we do allow declaring the variable closer to where they're first used. I'd suggest doing that with some of these to make their scope as clear as possible since this is a large function.
if (!tzp && txtype == TX_EXCHANGE) { | ||
error = SET_ERROR(ENOENT); | ||
goto out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what do you think about moving this check in to an else
block to remove the redundant tzp
check?
/*
* Does target exist?
*/
if (tzp) {
...
} else {
/* Target must exist for RENAME_EXCHANGE. */
if (txtype == TX_EXCHANGE) {
error = SET_ERROR(ENOENT);
goto out;
}
}
@@ -3656,8 +3656,7 @@ int | |||
zfs_rename(struct inode *sdip, char *snm, struct inode *tdip, char *tnm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than modify this existing complicated function so extensively, what if instead you added a new zfs_renameat2
which is solely responsible for handling RENAME_EXCHANGE
and TX_WHITEOUT
. That would potentially allow it to be simpler and remove the risk of accidentally breaking zfs_rename
in some subtle way. (Or even a zfs_renameat2_exchange()
and zfs_renameat2_whiteout
if that falls out more cleanly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could give this a shot, though there is a fair amount of shared code -- the main difference is right at the end in zfs_rename
.
* best to correct the state. In particular, all of the nlinks are | ||
* wrong because we were destroying and creating links with ZRENAMING. | ||
* | ||
* In some form, all of thee operations have to resolve the state: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thee
commit_link_szp: | ||
if (zfs_link_create(sdl, szp, tx, ZRENAMING)) | ||
VERIFY3U(zfs_drop_nlink(szp, tx, NULL), ==, 0); | ||
goto commit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use zfs_dirent_lock
to lock the specific names to protect them. Then you wouldn't need to call zfs_drop_nlink
here, since it shouldn't be possible for these operations to fail.
Well I gave it a shot and we can't rely on the dirent locks, the zfs_link_create is still allowed fail in one "edge" case. This test lays it out nicely (that's pretty much the only case this is allowed to fail and has to still return a nice error and not panic) -> I've also considered separate functions for whiteout/exchange and still @cyphar's solution clearly wins to me (the functions would come out really similar, so it'd even make it harder to understand what the subtle differences are). So I'll revert to using these patches in full and then just tend to the intent log part - at least if nobody has no better ideas... |
@snajpa I can't think of any other workarounds -- my only concern is whether |
This is being carried by #9414. |
Motivation and Context
In order to allow overlayfs-on-ZFS (#8648), we need to support the renameat(2)
flags (most importantly, RENAME_WHITEOUT and RENAME_EXCHANGE). This
requires quite a few changes to the ordering of link operations during
rename (and the related error paths).
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
How Has This Been Tested?
I conducted a variety of smoke-tests to check that all of the
renameat(2)
flags (and ordinaryrename
) all operated correctly. After all the runs, I checked that there were no leaks withzdb -bbb
. However, I have not tested the error path at all (and I have a feeling this is almost certainly where I've made mistakes).Types of changes
Checklist:
Signed-off-by
.