Skip to content

Commit 5f9a8a5

Browse files
committed
Btrfs: add semaphore to synchronize direct IO writes with fsync
Due to the optimization of lockless direct IO writes (the inode's i_mutex is not held) introduced in commit 38851cc ("Btrfs: implement unlocked dio write"), we started having races between such writes with concurrent fsync operations that use the fast fsync path. These races were addressed in the patches titled "Btrfs: fix race between fsync and lockless direct IO writes" and "Btrfs: fix race between fsync and direct IO writes for prealloc extents". The races happened because the direct IO path, like every other write path, does create extent maps followed by the corresponding ordered extents while the fast fsync path collected first ordered extents and then it collected extent maps. This made it possible to log file extent items (based on the collected extent maps) without waiting for the corresponding ordered extents to complete (get their IO done). The two fixes mentioned before added a solution that consists of making the direct IO path create first the ordered extents and then the extent maps, while the fsync path attempts to collect any new ordered extents once it collects the extent maps. This was simple and did not require adding any synchonization primitive to any data structure (struct btrfs_inode for example) but it makes things more fragile for future development endeavours and adds an exceptional approach compared to the other write paths. This change adds a read-write semaphore to the btrfs inode structure and makes the direct IO path create the extent maps and the ordered extents while holding read access on that semaphore, while the fast fsync path collects extent maps and ordered extents while holding write access on that semaphore. The logic for direct IO write path is encapsulated in a new helper function that is used both for cow and nocow direct IO writes. Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: Josef Bacik <[email protected]>
1 parent f78c436 commit 5f9a8a5

File tree

3 files changed

+77
-118
lines changed

3 files changed

+77
-118
lines changed

fs/btrfs/btrfs_inode.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,16 @@ struct btrfs_inode {
196196
struct list_head delayed_iput;
197197
long delayed_iput_count;
198198

199+
/*
200+
* To avoid races between lockless (i_mutex not held) direct IO writes
201+
* and concurrent fsync requests. Direct IO writes must acquire read
202+
* access on this semaphore for creating an extent map and its
203+
* corresponding ordered extent. The fast fsync path must acquire write
204+
* access on this semaphore before it collects ordered extents and
205+
* extent maps.
206+
*/
207+
struct rw_semaphore dio_sem;
208+
199209
struct inode vfs_inode;
200210
};
201211

fs/btrfs/inode.c

Lines changed: 53 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -7145,6 +7145,43 @@ struct extent_map *btrfs_get_extent_fiemap(struct inode *inode, struct page *pag
71457145
return em;
71467146
}
71477147

7148+
static struct extent_map *btrfs_create_dio_extent(struct inode *inode,
7149+
const u64 start,
7150+
const u64 len,
7151+
const u64 orig_start,
7152+
const u64 block_start,
7153+
const u64 block_len,
7154+
const u64 orig_block_len,
7155+
const u64 ram_bytes,
7156+
const int type)
7157+
{
7158+
struct extent_map *em = NULL;
7159+
int ret;
7160+
7161+
down_read(&BTRFS_I(inode)->dio_sem);
7162+
if (type != BTRFS_ORDERED_NOCOW) {
7163+
em = create_pinned_em(inode, start, len, orig_start,
7164+
block_start, block_len, orig_block_len,
7165+
ram_bytes, type);
7166+
if (IS_ERR(em))
7167+
goto out;
7168+
}
7169+
ret = btrfs_add_ordered_extent_dio(inode, start, block_start,
7170+
len, block_len, type);
7171+
if (ret) {
7172+
if (em) {
7173+
free_extent_map(em);
7174+
btrfs_drop_extent_cache(inode, start,
7175+
start + len - 1, 0);
7176+
}
7177+
em = ERR_PTR(ret);
7178+
}
7179+
out:
7180+
up_read(&BTRFS_I(inode)->dio_sem);
7181+
7182+
return em;
7183+
}
7184+
71487185
static struct extent_map *btrfs_new_extent_direct(struct inode *inode,
71497186
u64 start, u64 len)
71507187
{
@@ -7160,43 +7197,13 @@ static struct extent_map *btrfs_new_extent_direct(struct inode *inode,
71607197
if (ret)
71617198
return ERR_PTR(ret);
71627199

7163-
/*
7164-
* Create the ordered extent before the extent map. This is to avoid
7165-
* races with the fast fsync path that would lead to it logging file
7166-
* extent items that point to disk extents that were not yet written to.
7167-
* The fast fsync path collects ordered extents into a local list and
7168-
* then collects all the new extent maps, so we must create the ordered
7169-
* extent first and make sure the fast fsync path collects any new
7170-
* ordered extents after collecting new extent maps as well.
7171-
* The fsync path simply can not rely on inode_dio_wait() because it
7172-
* causes deadlock with AIO.
7173-
*/
7174-
ret = btrfs_add_ordered_extent_dio(inode, start, ins.objectid,
7175-
ins.offset, ins.offset, 0);
7176-
if (ret) {
7177-
btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
7178-
return ERR_PTR(ret);
7179-
}
7180-
7200+
em = btrfs_create_dio_extent(inode, start, ins.offset, start,
7201+
ins.objectid, ins.offset, ins.offset,
7202+
ins.offset, 0);
71817203
btrfs_dec_block_group_reservations(root->fs_info, ins.objectid);
7182-
7183-
em = create_pinned_em(inode, start, ins.offset, start, ins.objectid,
7184-
ins.offset, ins.offset, ins.offset, 0);
7185-
if (IS_ERR(em)) {
7186-
struct btrfs_ordered_extent *oe;
7187-
7204+
if (IS_ERR(em))
71887205
btrfs_free_reserved_extent(root, ins.objectid, ins.offset, 1);
7189-
oe = btrfs_lookup_ordered_extent(inode, start);
7190-
ASSERT(oe);
7191-
if (WARN_ON(!oe))
7192-
return em;
7193-
set_bit(BTRFS_ORDERED_IOERR, &oe->flags);
7194-
set_bit(BTRFS_ORDERED_IO_DONE, &oe->flags);
7195-
btrfs_remove_ordered_extent(inode, oe);
7196-
/* Once for our lookup and once for the ordered extents tree. */
7197-
btrfs_put_ordered_extent(oe);
7198-
btrfs_put_ordered_extent(oe);
7199-
}
7206+
72007207
return em;
72017208
}
72027209

@@ -7670,57 +7677,21 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
76707677
if (can_nocow_extent(inode, start, &len, &orig_start,
76717678
&orig_block_len, &ram_bytes) == 1 &&
76727679
btrfs_inc_nocow_writers(root->fs_info, block_start)) {
7680+
struct extent_map *em2;
76737681

7674-
/*
7675-
* Create the ordered extent before the extent map. This
7676-
* is to avoid races with the fast fsync path because it
7677-
* collects ordered extents into a local list and then
7678-
* collects all the new extent maps, so we must create
7679-
* the ordered extent first and make sure the fast fsync
7680-
* path collects any new ordered extents after
7681-
* collecting new extent maps as well. The fsync path
7682-
* simply can not rely on inode_dio_wait() because it
7683-
* causes deadlock with AIO.
7684-
*/
7685-
ret = btrfs_add_ordered_extent_dio(inode, start,
7686-
block_start, len, len, type);
7682+
em2 = btrfs_create_dio_extent(inode, start, len,
7683+
orig_start, block_start,
7684+
len, orig_block_len,
7685+
ram_bytes, type);
76877686
btrfs_dec_nocow_writers(root->fs_info, block_start);
7688-
if (ret) {
7689-
free_extent_map(em);
7690-
goto unlock_err;
7691-
}
7692-
76937687
if (type == BTRFS_ORDERED_PREALLOC) {
76947688
free_extent_map(em);
7695-
em = create_pinned_em(inode, start, len,
7696-
orig_start,
7697-
block_start, len,
7698-
orig_block_len,
7699-
ram_bytes, type);
7700-
if (IS_ERR(em)) {
7701-
struct btrfs_ordered_extent *oe;
7702-
7703-
ret = PTR_ERR(em);
7704-
oe = btrfs_lookup_ordered_extent(inode,
7705-
start);
7706-
ASSERT(oe);
7707-
if (WARN_ON(!oe))
7708-
goto unlock_err;
7709-
set_bit(BTRFS_ORDERED_IOERR,
7710-
&oe->flags);
7711-
set_bit(BTRFS_ORDERED_IO_DONE,
7712-
&oe->flags);
7713-
btrfs_remove_ordered_extent(inode, oe);
7714-
/*
7715-
* Once for our lookup and once for the
7716-
* ordered extents tree.
7717-
*/
7718-
btrfs_put_ordered_extent(oe);
7719-
btrfs_put_ordered_extent(oe);
7720-
goto unlock_err;
7721-
}
7689+
em = em2;
7690+
}
7691+
if (em2 && IS_ERR(em2)) {
7692+
ret = PTR_ERR(em2);
7693+
goto unlock_err;
77227694
}
7723-
77247695
goto unlock;
77257696
}
77267697
}
@@ -9281,6 +9252,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
92819252
INIT_LIST_HEAD(&ei->delalloc_inodes);
92829253
INIT_LIST_HEAD(&ei->delayed_iput);
92839254
RB_CLEAR_NODE(&ei->rb_node);
9255+
init_rwsem(&ei->dio_sem);
92849256

92859257
return inode;
92869258
}

fs/btrfs/tree-log.c

Lines changed: 14 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -4141,6 +4141,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
41414141

41424142
INIT_LIST_HEAD(&extents);
41434143

4144+
down_write(&BTRFS_I(inode)->dio_sem);
41444145
write_lock(&tree->lock);
41454146
test_gen = root->fs_info->last_trans_committed;
41464147

@@ -4169,13 +4170,20 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
41694170
}
41704171

41714172
list_sort(NULL, &extents, extent_cmp);
4173+
btrfs_get_logged_extents(inode, logged_list, start, end);
41724174
/*
4173-
* Collect any new ordered extents within the range. This is to
4174-
* prevent logging file extent items without waiting for the disk
4175-
* location they point to being written. We do this only to deal
4176-
* with races against concurrent lockless direct IO writes.
4175+
* Some ordered extents started by fsync might have completed
4176+
* before we could collect them into the list logged_list, which
4177+
* means they're gone, not in our logged_list nor in the inode's
4178+
* ordered tree. We want the application/user space to know an
4179+
* error happened while attempting to persist file data so that
4180+
* it can take proper action. If such error happened, we leave
4181+
* without writing to the log tree and the fsync must report the
4182+
* file data write error and not commit the current transaction.
41774183
*/
4178-
btrfs_get_logged_extents(inode, logged_list, start, end);
4184+
ret = btrfs_inode_check_errors(inode);
4185+
if (ret)
4186+
ctx->io_err = ret;
41794187
process:
41804188
while (!list_empty(&extents)) {
41814189
em = list_entry(extents.next, struct extent_map, list);
@@ -4202,6 +4210,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
42024210
}
42034211
WARN_ON(!list_empty(&extents));
42044212
write_unlock(&tree->lock);
4213+
up_write(&BTRFS_I(inode)->dio_sem);
42054214

42064215
btrfs_release_path(path);
42074216
return ret;
@@ -4622,23 +4631,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
46224631

46234632
mutex_lock(&BTRFS_I(inode)->log_mutex);
46244633

4625-
/*
4626-
* Collect ordered extents only if we are logging data. This is to
4627-
* ensure a subsequent request to log this inode in LOG_INODE_ALL mode
4628-
* will process the ordered extents if they still exists at the time,
4629-
* because when we collect them we test and set for the flag
4630-
* BTRFS_ORDERED_LOGGED to prevent multiple log requests to process the
4631-
* same ordered extents. The consequence for the LOG_INODE_ALL log mode
4632-
* not processing the ordered extents is that we end up logging the
4633-
* corresponding file extent items, based on the extent maps in the
4634-
* inode's extent_map_tree's modified_list, without logging the
4635-
* respective checksums (since the may still be only attached to the
4636-
* ordered extents and have not been inserted in the csum tree by
4637-
* btrfs_finish_ordered_io() yet).
4638-
*/
4639-
if (inode_only == LOG_INODE_ALL)
4640-
btrfs_get_logged_extents(inode, &logged_list, start, end);
4641-
46424634
/*
46434635
* a brute force approach to making sure we get the most uptodate
46444636
* copies of everything.
@@ -4846,21 +4838,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
48464838
goto out_unlock;
48474839
}
48484840
if (fast_search) {
4849-
/*
4850-
* Some ordered extents started by fsync might have completed
4851-
* before we collected the ordered extents in logged_list, which
4852-
* means they're gone, not in our logged_list nor in the inode's
4853-
* ordered tree. We want the application/user space to know an
4854-
* error happened while attempting to persist file data so that
4855-
* it can take proper action. If such error happened, we leave
4856-
* without writing to the log tree and the fsync must report the
4857-
* file data write error and not commit the current transaction.
4858-
*/
4859-
err = btrfs_inode_check_errors(inode);
4860-
if (err) {
4861-
ctx->io_err = err;
4862-
goto out_unlock;
4863-
}
48644841
ret = btrfs_log_changed_extents(trans, root, inode, dst_path,
48654842
&logged_list, ctx, start, end);
48664843
if (ret) {

0 commit comments

Comments
 (0)