Skip to content

Commit 19bfe57

Browse files
fdmananagregkh
authored andcommitted
btrfs: fix stale page cache after race between readahead and direct IO write
[ Upstream commit acc18e1 ] After commit ac325fc ("btrfs: do not hold the extent lock for entire read") we can now trigger a race between a task doing a direct IO write and readahead. When this race is triggered it results in tasks getting stale data when they attempt do a buffered read (including the task that did the direct IO write). This race can be sporadically triggered with test case generic/418, failing like this: $ ./check generic/418 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ torvalds#17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad) # --- tests/generic/418.out 2020-06-10 19:29:03.850519863 +0100 # +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad 2025-02-03 15:42:36.974609476 +0000 @@ -1,2 +1,5 @@ QA output created by 418 +cmpbuf: offset 0: Expected: 0x1, got 0x0 +[6:0] FAIL - comparison failed, offset 24576 +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3 Silence is golden ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad' to see the entire diff) Ran: generic/418 Failures: generic/418 Failed 1 of 1 tests The race happens like this: 1) A file has a prealloc extent for the range [16K, 28K); 2) Task A starts a direct IO write against file range [24K, 28K). At the start of the direct IO write it invalidates the page cache at __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file offset 24K; 3) Task A enters btrfs_dio_iomap_begin() and locks the extent range [24K, 28K); 4) Task B starts a readahead for file range [16K, 28K), entering btrfs_readahead(). First it attempts to read the page at offset 16K by entering btrfs_do_readpage(), where it calls get_extent_map(), locks the range [16K, 20K) and gets the extent map for the range [16K, 28K), caching it into the 'em_cached' variable declared in the local stack of btrfs_readahead(), and then unlocks the range [16K, 20K). Since the extent map has the prealloc flag, at btrfs_do_readpage() we zero out the page's content and don't submit any bio to read the page from the extent. Then it attempts to read the page at offset 20K entering btrfs_do_readpage() where we reuse the previously cached extent map (decided by get_extent_map()) since it spans the page's range and it's still in the inode's extent map tree. Just like for the previous page, we zero out the page's content since the extent map has the prealloc flag set. Then it attempts to read the page at offset 24K entering btrfs_do_readpage() where we reuse the previously cached extent map (decided by get_extent_map()) since it spans the page's range and it's still in the inode's extent map tree. Just like for the previous pages, we zero out the page's content since the extent map has the prealloc flag set. Note that we didn't lock the extent range [24K, 28K), so we didn't synchronize with the ongoing direct IO write being performed by task A; 5) Task A enters btrfs_create_dio_extent() and creates an ordered extent for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and BTRFS_ORDERED_PREALLOC set; 6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin(); 7) The ordered extent enters btrfs_finish_one_ordered() and locks the range [24K, 28K); 8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries to invalidate the page at offset 24K by calling kiocb_invalidate_post_direct_write(), resulting in a call chain that ends up at btrfs_release_folio(). The btrfs_release_folio() call ends up returning false because the range for the page at file offset 24K is currently locked by the task doing the ordered extent completion in the previous step (7), so we have: btrfs_release_folio() -> __btrfs_release_folio() -> try_release_extent_mapping() -> try_release_extent_state() This last function checking that the range is locked and returning false and propagating it up to btrfs_release_folio(). So this results in a failure to invalidate the page and kiocb_invalidate_post_direct_write() triggers this message logged in dmesg: Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! After this we leave the page cache with stale data for the file range [24K, 28K), filled with zeroes instead of the data written by direct IO write (all bytes with a 0x01 value), so any task attempting to read with buffered IO, including the task that did the direct IO write, will get all bytes in the range with a 0x00 value instead of the written data. Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(), at the two callers of btrfs_do_readpage() instead of doing it at get_extent_map(), just like we did before commit ac325fc ("btrfs: do not hold the extent lock for entire read"), and unlocking the range after all the calls to btrfs_do_readpage(). This way we never reuse a cached extent map without flushing any pending ordered extents from a concurrent direct IO write. Fixes: ac325fc ("btrfs: do not hold the extent lock for entire read") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
1 parent 53ba116 commit 19bfe57

File tree

1 file changed

+15
-3
lines changed

1 file changed

+15
-3
lines changed

fs/btrfs/extent_io.c

+15-3
Original file line numberDiff line numberDiff line change
@@ -906,7 +906,6 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode,
906906
u64 len, struct extent_map **em_cached)
907907
{
908908
struct extent_map *em;
909-
struct extent_state *cached_state = NULL;
910909

911910
ASSERT(em_cached);
912911

@@ -922,14 +921,12 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode,
922921
*em_cached = NULL;
923922
}
924923

925-
btrfs_lock_and_flush_ordered_range(inode, start, start + len - 1, &cached_state);
926924
em = btrfs_get_extent(inode, folio, start, len);
927925
if (!IS_ERR(em)) {
928926
BUG_ON(*em_cached);
929927
refcount_inc(&em->refs);
930928
*em_cached = em;
931929
}
932-
unlock_extent(&inode->io_tree, start, start + len - 1, &cached_state);
933930

934931
return em;
935932
}
@@ -1086,11 +1083,18 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
10861083

10871084
int btrfs_read_folio(struct file *file, struct folio *folio)
10881085
{
1086+
struct btrfs_inode *inode = folio_to_inode(folio);
1087+
const u64 start = folio_pos(folio);
1088+
const u64 end = start + folio_size(folio) - 1;
1089+
struct extent_state *cached_state = NULL;
10891090
struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
10901091
struct extent_map *em_cached = NULL;
10911092
int ret;
10921093

1094+
btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
10931095
ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
1096+
unlock_extent(&inode->io_tree, start, end, &cached_state);
1097+
10941098
free_extent_map(em_cached);
10951099

10961100
/*
@@ -2267,12 +2271,20 @@ void btrfs_readahead(struct readahead_control *rac)
22672271
{
22682272
struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
22692273
struct folio *folio;
2274+
struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
2275+
const u64 start = readahead_pos(rac);
2276+
const u64 end = start + readahead_length(rac) - 1;
2277+
struct extent_state *cached_state = NULL;
22702278
struct extent_map *em_cached = NULL;
22712279
u64 prev_em_start = (u64)-1;
22722280

2281+
btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
2282+
22732283
while ((folio = readahead_folio(rac)) != NULL)
22742284
btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
22752285

2286+
unlock_extent(&inode->io_tree, start, end, &cached_state);
2287+
22762288
if (em_cached)
22772289
free_extent_map(em_cached);
22782290
submit_one_bio(&bio_ctrl);

0 commit comments

Comments
 (0)