Skip to content

Commit eaa36c1

Browse files
fdmananasmb49
authored andcommitted
btrfs: fix a race between renames and directory logging
BugLink: https://bugs.launchpad.net/bugs/2121266 commit 3ca864d upstream. We have a race between a rename and directory inode logging that if it happens and we crash/power fail before the rename completes, the next time the filesystem is mounted, the log replay code will end up deleting the file that was being renamed. This is best explained following a step by step analysis of an interleaving of steps that lead into this situation. Consider the initial conditions: 1) We are at transaction N; 2) We have directories A and B created in a past transaction (< N); 3) We have inode X corresponding to a file that has 2 hardlinks, one in directory A and the other in directory B, so we'll name them as "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a past transaction (< N); 4) We have inode Y corresponding to a file that as a single hard link and is located in directory A, we'll name it as "A/bar". This file was also persisted in a past transaction (< N). The steps leading to a file loss are the following and for all of them we are under transaction N: 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir(); 2) Task A starts a rename for inode Y, with the goal of renaming from "A/bar" to "A/baz", so we enter btrfs_rename(); 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling btrfs_insert_inode_ref(); 4) Because the rename happens in the same directory, we don't set the last_unlink_trans field of directoty A's inode to the current transaction id, that is, we don't cal btrfs_record_unlink_dir(); 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode() (actually the dir index item is added as a delayed item, but the effect is the same); 6) Now before task A adds the new entry "A/baz" to directory A by calling btrfs_add_link(), another task, task B is logging inode X; 7) Task B starts a fsync of inode X and after logging inode X, at btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since inode X has a last_unlink_trans value of N, set at in step 1; 8) At btrfs_log_all_parents() we search for all parent directories of inode X using the commit root, so we find directories A and B and log them. Bu when logging direct A, we don't have a dir index item for inode Y anymore, neither the old name "A/bar" nor for the new name "A/baz" since the rename has deleted the old name but has not yet inserted the new name - task A hasn't called yet btrfs_add_link() to do that. Note that logging directory A doesn't fallback to a transaction commit because its last_unlink_trans has a lower value than the current transaction's id (see step 4); 9) Task B finishes logging directories A and B and gets back to btrfs_sync_file() where it calls btrfs_sync_log() to persist the log tree; 10) Task B successfully persisted the log tree, btrfs_sync_log() completed with success, and a power failure happened. We have a log tree without any directory entry for inode Y, so the log replay code deletes the entry for inode Y, name "A/bar", from the subvolume tree since it doesn't exist in the log tree and the log tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY item that covers the index range for the dentry that corresponds to "A/bar"). Since there's no other hard link for inode Y and the log replay code deletes the name "A/bar", the file is lost. The issue wouldn't happen if task B synced the log only after task A called btrfs_log_new_name(), which would update the log with the new name for inode Y ("A/bar"). Fix this by pinning the log root during renames before removing the old directory entry, and unpinning after btrfs_log_new_name() is called. Fixes: 259c4b9 ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
1 parent 6cea3ed commit eaa36c1

File tree

1 file changed

+64
-17
lines changed

1 file changed

+64
-17
lines changed

fs/btrfs/inode.c

Lines changed: 64 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7947,6 +7947,7 @@ static int btrfs_rename_exchange(struct inode *old_dir,
79477947
int ret;
79487948
int ret2;
79497949
bool need_abort = false;
7950+
bool logs_pinned = false;
79507951
struct fscrypt_name old_fname, new_fname;
79517952
struct fscrypt_str *old_name, *new_name;
79527953

@@ -8070,6 +8071,31 @@ static int btrfs_rename_exchange(struct inode *old_dir,
80708071
inode_inc_iversion(new_inode);
80718072
simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
80728073

8074+
if (old_ino != BTRFS_FIRST_FREE_OBJECTID &&
8075+
new_ino != BTRFS_FIRST_FREE_OBJECTID) {
8076+
/*
8077+
* If we are renaming in the same directory (and it's not for
8078+
* root entries) pin the log early to prevent any concurrent
8079+
* task from logging the directory after we removed the old
8080+
* entries and before we add the new entries, otherwise that
8081+
* task can sync a log without any entry for the inodes we are
8082+
* renaming and therefore replaying that log, if a power failure
8083+
* happens after syncing the log, would result in deleting the
8084+
* inodes.
8085+
*
8086+
* If the rename affects two different directories, we want to
8087+
* make sure the that there's no log commit that contains
8088+
* updates for only one of the directories but not for the
8089+
* other.
8090+
*
8091+
* If we are renaming an entry for a root, we don't care about
8092+
* log updates since we called btrfs_set_log_full_commit().
8093+
*/
8094+
btrfs_pin_log_trans(root);
8095+
btrfs_pin_log_trans(dest);
8096+
logs_pinned = true;
8097+
}
8098+
80738099
if (old_dentry->d_parent != new_dentry->d_parent) {
80748100
btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
80758101
BTRFS_I(old_inode), true);
@@ -8141,30 +8167,23 @@ static int btrfs_rename_exchange(struct inode *old_dir,
81418167
BTRFS_I(new_inode)->dir_index = new_idx;
81428168

81438169
/*
8144-
* Now pin the logs of the roots. We do it to ensure that no other task
8145-
* can sync the logs while we are in progress with the rename, because
8146-
* that could result in an inconsistency in case any of the inodes that
8147-
* are part of this rename operation were logged before.
8170+
* Do the log updates for all inodes.
8171+
*
8172+
* If either entry is for a root we don't need to update the logs since
8173+
* we've called btrfs_set_log_full_commit() before.
81488174
*/
8149-
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
8150-
btrfs_pin_log_trans(root);
8151-
if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
8152-
btrfs_pin_log_trans(dest);
8153-
8154-
/* Do the log updates for all inodes. */
8155-
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
8175+
if (logs_pinned) {
81568176
btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
81578177
old_rename_ctx.index, new_dentry->d_parent);
8158-
if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
81598178
btrfs_log_new_name(trans, new_dentry, BTRFS_I(new_dir),
81608179
new_rename_ctx.index, old_dentry->d_parent);
8180+
}
81618181

8162-
/* Now unpin the logs. */
8163-
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
8182+
out_fail:
8183+
if (logs_pinned) {
81648184
btrfs_end_log_trans(root);
8165-
if (new_ino != BTRFS_FIRST_FREE_OBJECTID)
81668185
btrfs_end_log_trans(dest);
8167-
out_fail:
8186+
}
81688187
ret2 = btrfs_end_transaction(trans);
81698188
ret = ret ? ret : ret2;
81708189
out_notrans:
@@ -8214,6 +8233,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
82148233
int ret2;
82158234
u64 old_ino = btrfs_ino(BTRFS_I(old_inode));
82168235
struct fscrypt_name old_fname, new_fname;
8236+
bool logs_pinned = false;
82178237

82188238
if (btrfs_ino(BTRFS_I(new_dir)) == BTRFS_EMPTY_SUBVOL_DIR_OBJECTID)
82198239
return -EPERM;
@@ -8348,6 +8368,29 @@ static int btrfs_rename(struct mnt_idmap *idmap,
83488368
inode_inc_iversion(old_inode);
83498369
simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
83508370

8371+
if (old_ino != BTRFS_FIRST_FREE_OBJECTID) {
8372+
/*
8373+
* If we are renaming in the same directory (and it's not a
8374+
* root entry) pin the log to prevent any concurrent task from
8375+
* logging the directory after we removed the old entry and
8376+
* before we add the new entry, otherwise that task can sync
8377+
* a log without any entry for the inode we are renaming and
8378+
* therefore replaying that log, if a power failure happens
8379+
* after syncing the log, would result in deleting the inode.
8380+
*
8381+
* If the rename affects two different directories, we want to
8382+
* make sure the that there's no log commit that contains
8383+
* updates for only one of the directories but not for the
8384+
* other.
8385+
*
8386+
* If we are renaming an entry for a root, we don't care about
8387+
* log updates since we called btrfs_set_log_full_commit().
8388+
*/
8389+
btrfs_pin_log_trans(root);
8390+
btrfs_pin_log_trans(dest);
8391+
logs_pinned = true;
8392+
}
8393+
83518394
if (old_dentry->d_parent != new_dentry->d_parent)
83528395
btrfs_record_unlink_dir(trans, BTRFS_I(old_dir),
83538396
BTRFS_I(old_inode), true);
@@ -8412,7 +8455,7 @@ static int btrfs_rename(struct mnt_idmap *idmap,
84128455
if (old_inode->i_nlink == 1)
84138456
BTRFS_I(old_inode)->dir_index = index;
84148457

8415-
if (old_ino != BTRFS_FIRST_FREE_OBJECTID)
8458+
if (logs_pinned)
84168459
btrfs_log_new_name(trans, old_dentry, BTRFS_I(old_dir),
84178460
rename_ctx.index, new_dentry->d_parent);
84188461

@@ -8428,6 +8471,10 @@ static int btrfs_rename(struct mnt_idmap *idmap,
84288471
}
84298472
}
84308473
out_fail:
8474+
if (logs_pinned) {
8475+
btrfs_end_log_trans(root);
8476+
btrfs_end_log_trans(dest);
8477+
}
84318478
ret2 = btrfs_end_transaction(trans);
84328479
ret = ret ? ret : ret2;
84338480
out_notrans:

0 commit comments

Comments
 (0)