Skip to content

Commit acc18e1

Browse files
fdmananakdave
authored andcommitted
btrfs: fix stale page cache after race between readahead and direct IO write
After commit ac325fc ("btrfs: do not hold the extent lock for entire read") we can now trigger a race between a task doing a direct IO write and readahead. When this race is triggered it results in tasks getting stale data when they attempt do a buffered read (including the task that did the direct IO write). This race can be sporadically triggered with test case generic/418, failing like this: $ ./check generic/418 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ torvalds#17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad) --- tests/generic/418.out 2020-06-10 19:29:03.850519863 +0100 +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad 2025-02-03 15:42:36.974609476 +0000 @@ -1,2 +1,5 @@ QA output created by 418 +cmpbuf: offset 0: Expected: 0x1, got 0x0 +[6:0] FAIL - comparison failed, offset 24576 +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3 Silence is golden ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad' to see the entire diff) Ran: generic/418 Failures: generic/418 Failed 1 of 1 tests The race happens like this: 1) A file has a prealloc extent for the range [16K, 28K); 2) Task A starts a direct IO write against file range [24K, 28K). At the start of the direct IO write it invalidates the page cache at __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file offset 24K; 3) Task A enters btrfs_dio_iomap_begin() and locks the extent range [24K, 28K); 4) Task B starts a readahead for file range [16K, 28K), entering btrfs_readahead(). First it attempts to read the page at offset 16K by entering btrfs_do_readpage(), where it calls get_extent_map(), locks the range [16K, 20K) and gets the extent map for the range [16K, 28K), caching it into the 'em_cached' variable declared in the local stack of btrfs_readahead(), and then unlocks the range [16K, 20K). Since the extent map has the prealloc flag, at btrfs_do_readpage() we zero out the page's content and don't submit any bio to read the page from the extent. Then it attempts to read the page at offset 20K entering btrfs_do_readpage() where we reuse the previously cached extent map (decided by get_extent_map()) since it spans the page's range and it's still in the inode's extent map tree. Just like for the previous page, we zero out the page's content since the extent map has the prealloc flag set. Then it attempts to read the page at offset 24K entering btrfs_do_readpage() where we reuse the previously cached extent map (decided by get_extent_map()) since it spans the page's range and it's still in the inode's extent map tree. Just like for the previous pages, we zero out the page's content since the extent map has the prealloc flag set. Note that we didn't lock the extent range [24K, 28K), so we didn't synchronize with the ongoing direct IO write being performed by task A; 5) Task A enters btrfs_create_dio_extent() and creates an ordered extent for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and BTRFS_ORDERED_PREALLOC set; 6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin(); 7) The ordered extent enters btrfs_finish_one_ordered() and locks the range [24K, 28K); 8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries to invalidate the page at offset 24K by calling kiocb_invalidate_post_direct_write(), resulting in a call chain that ends up at btrfs_release_folio(). The btrfs_release_folio() call ends up returning false because the range for the page at file offset 24K is currently locked by the task doing the ordered extent completion in the previous step (7), so we have: btrfs_release_folio() -> __btrfs_release_folio() -> try_release_extent_mapping() -> try_release_extent_state() This last function checking that the range is locked and returning false and propagating it up to btrfs_release_folio(). So this results in a failure to invalidate the page and kiocb_invalidate_post_direct_write() triggers this message logged in dmesg: Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! After this we leave the page cache with stale data for the file range [24K, 28K), filled with zeroes instead of the data written by direct IO write (all bytes with a 0x01 value), so any task attempting to read with buffered IO, including the task that did the direct IO write, will get all bytes in the range with a 0x00 value instead of the written data. Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(), at the two callers of btrfs_do_readpage() instead of doing it at get_extent_map(), just like we did before commit ac325fc ("btrfs: do not hold the extent lock for entire read"), and unlocking the range after all the calls to btrfs_do_readpage(). This way we never reuse a cached extent map without flushing any pending ordered extents from a concurrent direct IO write. Fixes: ac325fc ("btrfs: do not hold the extent lock for entire read") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
1 parent 01af106 commit acc18e1

File tree

1 file changed

+15
-3
lines changed

1 file changed

+15
-3
lines changed

fs/btrfs/extent_io.c

+15-3
Original file line numberDiff line numberDiff line change
@@ -898,7 +898,6 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode,
898898
u64 len, struct extent_map **em_cached)
899899
{
900900
struct extent_map *em;
901-
struct extent_state *cached_state = NULL;
902901

903902
ASSERT(em_cached);
904903

@@ -914,14 +913,12 @@ static struct extent_map *get_extent_map(struct btrfs_inode *inode,
914913
*em_cached = NULL;
915914
}
916915

917-
btrfs_lock_and_flush_ordered_range(inode, start, start + len - 1, &cached_state);
918916
em = btrfs_get_extent(inode, folio, start, len);
919917
if (!IS_ERR(em)) {
920918
BUG_ON(*em_cached);
921919
refcount_inc(&em->refs);
922920
*em_cached = em;
923921
}
924-
unlock_extent(&inode->io_tree, start, start + len - 1, &cached_state);
925922

926923
return em;
927924
}
@@ -1078,11 +1075,18 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
10781075

10791076
int btrfs_read_folio(struct file *file, struct folio *folio)
10801077
{
1078+
struct btrfs_inode *inode = folio_to_inode(folio);
1079+
const u64 start = folio_pos(folio);
1080+
const u64 end = start + folio_size(folio) - 1;
1081+
struct extent_state *cached_state = NULL;
10811082
struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
10821083
struct extent_map *em_cached = NULL;
10831084
int ret;
10841085

1086+
btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
10851087
ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
1088+
unlock_extent(&inode->io_tree, start, end, &cached_state);
1089+
10861090
free_extent_map(em_cached);
10871091

10881092
/*
@@ -2379,12 +2383,20 @@ void btrfs_readahead(struct readahead_control *rac)
23792383
{
23802384
struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
23812385
struct folio *folio;
2386+
struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
2387+
const u64 start = readahead_pos(rac);
2388+
const u64 end = start + readahead_length(rac) - 1;
2389+
struct extent_state *cached_state = NULL;
23822390
struct extent_map *em_cached = NULL;
23832391
u64 prev_em_start = (u64)-1;
23842392

2393+
btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
2394+
23852395
while ((folio = readahead_folio(rac)) != NULL)
23862396
btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
23872397

2398+
unlock_extent(&inode->io_tree, start, end, &cached_state);
2399+
23882400
if (em_cached)
23892401
free_extent_map(em_cached);
23902402
submit_one_bio(&bio_ctrl);

0 commit comments

Comments
 (0)