Skip to content

Commit 88d2bee

Browse files
fdmananakdave
authored andcommitted
btrfs: avoid logging all directory changes during renames
When doing a rename of a file, if the file or its old parent directory were logged before, we log the new name of the file and then make sure we log the old parent directory, to ensure that after a log replay the old name of the file is deleted and the new name added. The logging of the old parent directory can take some time, because it will scan all leaves modified in the current transaction, check which directory entries were already logged, copy the ones that were not logged before, etc. In this rename context all we need to do is make sure that the old name of the file is deleted on log replay, so instead of triggering a directory log operation, we can just delete the old directory entry from the log if it's there, or in case it isn't there, just log a range item to signal log replay that the old name must be deleted. So change btrfs_log_new_name() to do that. This scenario is actually not uncommon to trigger, and recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package installations and upgrades, with the zypper tool, were often taking a long time to complete, much more than usual. With strace it could be observed that zypper was spending over 99% of its time on rename operations, and then with further analysis we checked that directory logging was happening too frequently and causing high latencies for the rename operations. Taking into account that installation/upgrade of some of these packages needed about a few thousand file renames, the slowdown was very noticeable for the user. The issue was caused indirectly due to an excessive number of inode evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure, in an efficient way, if an inode was previously logged in the current transaction, so we are pessimistic and assume it was, because in case it was we need to update the logged inode. More details on that in one of the patches in the same series (subject "btrfs: avoid inode logging during rename and link when possible"). Either way, in case the parent directory was logged before, we currently do more work then necessary during a rename, and this change minimizes that amount of work. The following script mimics part of what a package installation/upgrade with zypper does, which is basically renaming a lot of files, in some directory under /usr, to a name with a suffix of "-RPMDELETE": $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Do some change to testdir and fsync it. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) xfs_io -c "fsync" $MNT/testdir echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on box using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before this patch: 27399 ms NUM_FILES=10000, after this patch: 9093 ms (-66.8%) NUM_FILES=5000, before this patch: 9241 ms NUM_FILES=5000, after this patch: 4642 ms (-49.8%) NUM_FILES=2000, before this patch: 2550 ms NUM_FILES=2000, after this patch: 1788 ms (-29.9%) NUM_FILES=1000, before this patch: 1088 ms NUM_FILES=1000, after this patch: 905 ms (-16.9%) Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
1 parent d5f5bd5 commit 88d2bee

File tree

3 files changed

+78
-24
lines changed

3 files changed

+78
-24
lines changed

fs/btrfs/inode.c

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,11 @@ struct btrfs_dio_data {
6666
struct extent_changeset *data_reserved;
6767
};
6868

69+
struct btrfs_rename_ctx {
70+
/* Output field. Stores the index number of the old directory entry. */
71+
u64 index;
72+
};
73+
6974
static const struct inode_operations btrfs_dir_inode_operations;
7075
static const struct inode_operations btrfs_symlink_inode_operations;
7176
static const struct inode_operations btrfs_special_inode_operations;
@@ -4062,7 +4067,8 @@ int btrfs_update_inode_fallback(struct btrfs_trans_handle *trans,
40624067
static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
40634068
struct btrfs_inode *dir,
40644069
struct btrfs_inode *inode,
4065-
const char *name, int name_len)
4070+
const char *name, int name_len,
4071+
struct btrfs_rename_ctx *rename_ctx)
40664072
{
40674073
struct btrfs_root *root = dir->root;
40684074
struct btrfs_fs_info *fs_info = root->fs_info;
@@ -4118,6 +4124,9 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
41184124
goto err;
41194125
}
41204126
skip_backref:
4127+
if (rename_ctx)
4128+
rename_ctx->index = index;
4129+
41214130
ret = btrfs_delete_delayed_dir_index(trans, dir, index);
41224131
if (ret) {
41234132
btrfs_abort_transaction(trans, ret);
@@ -4158,7 +4167,7 @@ int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
41584167
const char *name, int name_len)
41594168
{
41604169
int ret;
4161-
ret = __btrfs_unlink_inode(trans, dir, inode, name, name_len);
4170+
ret = __btrfs_unlink_inode(trans, dir, inode, name, name_len, NULL);
41624171
if (!ret) {
41634172
drop_nlink(&inode->vfs_inode);
41644173
ret = btrfs_update_inode(trans, inode->root, inode);
@@ -6531,7 +6540,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
65316540
goto fail;
65326541
}
65336542
d_instantiate(dentry, inode);
6534-
btrfs_log_new_name(trans, old_dentry, NULL, parent);
6543+
btrfs_log_new_name(trans, old_dentry, NULL, 0, parent);
65356544
}
65366545

65376546
fail:
@@ -9024,6 +9033,8 @@ static int btrfs_rename_exchange(struct inode *old_dir,
90249033
struct inode *new_inode = new_dentry->d_inode;
90259034
struct inode *old_inode = old_dentry->d_inode;
90269035
struct timespec64 ctime = current_time(old_inode);
9036+
struct btrfs_rename_ctx old_rename_ctx;
9037+
struct btrfs_rename_ctx new_rename_ctx;
90279038
u64 old_ino = btrfs_ino(BTRFS_I(old_inode));
90289039
u64 new_ino = btrfs_ino(BTRFS_I(new_inode));
90299040
u64 old_idx = 0;
@@ -9164,7 +9175,8 @@ static int btrfs_rename_exchange(struct inode *old_dir,
91649175
ret = __btrfs_unlink_inode(trans, BTRFS_I(old_dir),
91659176
BTRFS_I(old_dentry->d_inode),
91669177
old_dentry->d_name.name,
9167-
old_dentry->d_name.len);
9178+
old_dentry->d_name.len,
9179+
&old_rename_ctx);
91689180
if (!ret)
91699181
ret = btrfs_update_inode(trans, root, BTRFS_I(old_inode));
91709182
}
@@ -9180,7 +9192,8 @@ static int btrfs_rename_exchange(struct inode *old_dir,
91809192
ret = __btrfs_unlink_inode(trans, BTRFS_I(new_dir),
91819193
BTRFS_I(new_dentry->d_inode),
91829194
new_dentry->d_name.name,
9183-
new_dentry->d_name.len);
9195+
new_dentry->d_name.len,
9196+
&new_rename_ctx);
91849197
if (!ret)
91859198
ret = btrfs_update_inode(trans, dest, BTRFS_I(new_inode));
91869199
}
@@ -9212,13 +9225,13 @@ static int btrfs_rename_exchange(struct inode *old_dir,
92129225

92139226
if (root_log_pinned) {
92149227
btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
9215-
new_dentry->d_parent);
9228+
old_rename_ctx.index, new_dentry->d_parent);
92169229
btrfs_end_log_trans(root);
92179230
root_log_pinned = false;
92189231
}
92199232
if (dest_log_pinned) {
92209233
btrfs_log_new_name(trans, new_dentry, BTRFS_I(new_dir),
9221-
old_dentry->d_parent);
9234+
new_rename_ctx.index, old_dentry->d_parent);
92229235
btrfs_end_log_trans(dest);
92239236
dest_log_pinned = false;
92249237
}
@@ -9324,6 +9337,7 @@ static int btrfs_rename(struct user_namespace *mnt_userns,
93249337
struct btrfs_root *dest = BTRFS_I(new_dir)->root;
93259338
struct inode *new_inode = d_inode(new_dentry);
93269339
struct inode *old_inode = d_inode(old_dentry);
9340+
struct btrfs_rename_ctx rename_ctx;
93279341
u64 index = 0;
93289342
int ret;
93299343
int ret2;
@@ -9455,7 +9469,8 @@ static int btrfs_rename(struct user_namespace *mnt_userns,
94559469
ret = __btrfs_unlink_inode(trans, BTRFS_I(old_dir),
94569470
BTRFS_I(d_inode(old_dentry)),
94579471
old_dentry->d_name.name,
9458-
old_dentry->d_name.len);
9472+
old_dentry->d_name.len,
9473+
&rename_ctx);
94599474
if (!ret)
94609475
ret = btrfs_update_inode(trans, root, BTRFS_I(old_inode));
94619476
}
@@ -9499,7 +9514,7 @@ static int btrfs_rename(struct user_namespace *mnt_userns,
94999514

95009515
if (log_pinned) {
95019516
btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
9502-
new_dentry->d_parent);
9517+
rename_ctx.index, new_dentry->d_parent);
95039518
btrfs_end_log_trans(root);
95049519
log_pinned = false;
95059520
}

fs/btrfs/tree-log.c

Lines changed: 53 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6796,6 +6796,9 @@ void btrfs_record_snapshot_destroy(struct btrfs_trans_handle *trans,
67966796
* parent directory.
67976797
* @old_dir: The inode of the previous parent directory for the case
67986798
* of a rename. For a link operation, it must be NULL.
6799+
* @old_dir_index: The index number associated with the old name, meaningful
6800+
* only for rename operations (when @old_dir is not NULL).
6801+
* Ignored for link operations.
67996802
* @parent: The dentry associated with the directory under which the
68006803
* new name is located.
68016804
*
@@ -6804,7 +6807,7 @@ void btrfs_record_snapshot_destroy(struct btrfs_trans_handle *trans,
68046807
*/
68056808
void btrfs_log_new_name(struct btrfs_trans_handle *trans,
68066809
struct dentry *old_dentry, struct btrfs_inode *old_dir,
6807-
struct dentry *parent)
6810+
u64 old_dir_index, struct dentry *parent)
68086811
{
68096812
struct btrfs_inode *inode = BTRFS_I(d_inode(old_dentry));
68106813
struct btrfs_log_ctx ctx;
@@ -6826,20 +6829,56 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
68266829

68276830
/*
68286831
* If we are doing a rename (old_dir is not NULL) from a directory that
6829-
* was previously logged, make sure the next log attempt on the directory
6830-
* is not skipped and logs the inode again. This is because the log may
6831-
* not currently be authoritative for a range including the old
6832-
* BTRFS_DIR_INDEX_KEY key, so we want to make sure after a log replay we
6833-
* do not end up with both the new and old dentries around (in case the
6834-
* inode is a directory we would have a directory with two hard links and
6835-
* 2 inode references for different parents). The next log attempt of
6836-
* old_dir will happen at btrfs_log_all_parents(), called through
6837-
* btrfs_log_inode_parent() below, because we have previously set
6838-
* inode->last_unlink_trans to the current transaction ID, either here or
6839-
* at btrfs_record_unlink_dir() in case the inode is a directory.
6832+
* was previously logged, make sure that on log replay we get the old
6833+
* dir entry deleted. This is needed because we will also log the new
6834+
* name of the renamed inode, so we need to make sure that after log
6835+
* replay we don't end up with both the new and old dir entries existing.
68406836
*/
6841-
if (old_dir)
6842-
old_dir->logged_trans = 0;
6837+
if (old_dir && old_dir->logged_trans == trans->transid) {
6838+
struct btrfs_root *log = old_dir->root->log_root;
6839+
struct btrfs_path *path;
6840+
int ret;
6841+
6842+
ASSERT(old_dir_index >= BTRFS_DIR_START_INDEX);
6843+
6844+
path = btrfs_alloc_path();
6845+
if (!path) {
6846+
btrfs_set_log_full_commit(trans);
6847+
return;
6848+
}
6849+
6850+
/*
6851+
* Other concurrent task might be logging the old directory,
6852+
* as it can be triggered when logging other inode that had or
6853+
* still has a dentry in the old directory. So take the old
6854+
* directory's log_mutex to prevent getting an -EEXIST when
6855+
* logging a key to record the deletion, or having that other
6856+
* task logging the old directory get an -EEXIST if it attempts
6857+
* to log the same key after we just did it. In both cases that
6858+
* would result in falling back to a transaction commit.
6859+
*/
6860+
mutex_lock(&old_dir->log_mutex);
6861+
ret = del_logged_dentry(trans, log, path, btrfs_ino(old_dir),
6862+
old_dentry->d_name.name,
6863+
old_dentry->d_name.len, old_dir_index);
6864+
if (ret > 0) {
6865+
/*
6866+
* The dentry does not exist in the log, so record its
6867+
* deletion.
6868+
*/
6869+
btrfs_release_path(path);
6870+
ret = insert_dir_log_key(trans, log, path,
6871+
btrfs_ino(old_dir),
6872+
old_dir_index, old_dir_index);
6873+
}
6874+
mutex_unlock(&old_dir->log_mutex);
6875+
6876+
btrfs_free_path(path);
6877+
if (ret < 0) {
6878+
btrfs_set_log_full_commit(trans);
6879+
return;
6880+
}
6881+
}
68436882

68446883
btrfs_init_log_ctx(&ctx, &inode->vfs_inode);
68456884
ctx.logging_new_name = true;

fs/btrfs/tree-log.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,6 @@ void btrfs_record_snapshot_destroy(struct btrfs_trans_handle *trans,
8787
struct btrfs_inode *dir);
8888
void btrfs_log_new_name(struct btrfs_trans_handle *trans,
8989
struct dentry *old_dentry, struct btrfs_inode *old_dir,
90-
struct dentry *parent);
90+
u64 old_dir_index, struct dentry *parent);
9191

9292
#endif

0 commit comments

Comments
 (0)