Skip to content

Commit 0b90191

Browse files
committed
Btrfs: fix race between fsync and direct IO writes for prealloc extents
When we do a direct IO write against a preallocated extent (fallocate) that does not go beyond the i_size of the inode, we do the write operation without holding the inode's i_mutex (an optimization that landed in commit 38851cc ("Btrfs: implement unlocked dio write")). This allows for a very tiny time window where a race can happen with a concurrent fsync using the fast code path, as the direct IO write path creates first a new extent map (no longer flagged as a prealloc extent) and then it creates the ordered extent, while the fast fsync path first collects ordered extents and then it collects extent maps. This allows for the possibility of the fast fsync path to collect the new extent map without collecting the new ordered extent, and therefore logging an extent item based on the extent map without waiting for the ordered extent to be created and complete. This can result in a situation where after a log replay we end up with an extent not marked anymore as prealloc but it was only partially written (or not written at all), exposing random, stale or garbage data corresponding to the unwritten pages and without any checksums in the csum tree covering the extent's range. This is an extension of what was done in commit de0ee0e ("Btrfs: fix race between fsync and lockless direct IO writes"). So fix this by creating first the ordered extent and then the extent map, so that this way if the fast fsync patch collects the new extent map it also collects the corresponding ordered extent. Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: Josef Bacik <[email protected]>
1 parent 5062af3 commit 0b90191

File tree

1 file changed

+37
-6
lines changed

1 file changed

+37
-6
lines changed

fs/btrfs/inode.c

Lines changed: 37 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7658,6 +7658,25 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
76587658

76597659
if (can_nocow_extent(inode, start, &len, &orig_start,
76607660
&orig_block_len, &ram_bytes) == 1) {
7661+
7662+
/*
7663+
* Create the ordered extent before the extent map. This
7664+
* is to avoid races with the fast fsync path because it
7665+
* collects ordered extents into a local list and then
7666+
* collects all the new extent maps, so we must create
7667+
* the ordered extent first and make sure the fast fsync
7668+
* path collects any new ordered extents after
7669+
* collecting new extent maps as well. The fsync path
7670+
* simply can not rely on inode_dio_wait() because it
7671+
* causes deadlock with AIO.
7672+
*/
7673+
ret = btrfs_add_ordered_extent_dio(inode, start,
7674+
block_start, len, len, type);
7675+
if (ret) {
7676+
free_extent_map(em);
7677+
goto unlock_err;
7678+
}
7679+
76617680
if (type == BTRFS_ORDERED_PREALLOC) {
76627681
free_extent_map(em);
76637682
em = create_pinned_em(inode, start, len,
@@ -7666,17 +7685,29 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
76667685
orig_block_len,
76677686
ram_bytes, type);
76687687
if (IS_ERR(em)) {
7688+
struct btrfs_ordered_extent *oe;
7689+
76697690
ret = PTR_ERR(em);
7691+
oe = btrfs_lookup_ordered_extent(inode,
7692+
start);
7693+
ASSERT(oe);
7694+
if (WARN_ON(!oe))
7695+
goto unlock_err;
7696+
set_bit(BTRFS_ORDERED_IOERR,
7697+
&oe->flags);
7698+
set_bit(BTRFS_ORDERED_IO_DONE,
7699+
&oe->flags);
7700+
btrfs_remove_ordered_extent(inode, oe);
7701+
/*
7702+
* Once for our lookup and once for the
7703+
* ordered extents tree.
7704+
*/
7705+
btrfs_put_ordered_extent(oe);
7706+
btrfs_put_ordered_extent(oe);
76707707
goto unlock_err;
76717708
}
76727709
}
76737710

7674-
ret = btrfs_add_ordered_extent_dio(inode, start,
7675-
block_start, len, len, type);
7676-
if (ret) {
7677-
free_extent_map(em);
7678-
goto unlock_err;
7679-
}
76807711
goto unlock;
76817712
}
76827713
}

0 commit comments

Comments
 (0)