Skip to content

Commit a26bf33

Browse files
fdmananakdave
authored andcommitted
btrfs: fix race between async reclaim worker and close_ctree()
Syzbot reported an assertion failure due to an attempt to add a delayed iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state: WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Modules linked in: CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Workqueue: btrfs-endio-write btrfs_work_helper RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Code: 4e ad 5d (...) RSP: 0018:ffffc9000213f780 EFLAGS: 00010293 RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00 RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000 RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001 R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635 btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312 btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> This can happen due to a race with the async reclaim worker like this: 1) The async metadata reclaim worker enters shrink_delalloc(), which calls btrfs_start_delalloc_roots() with an nr_pages argument that has a value less than LONG_MAX, and that in turn enters start_delalloc_inodes(), which sets the local variable 'full_flush' to false because wbc->nr_to_write is less than LONG_MAX; 2) There it finds inode X in a root's delalloc list, grabs a reference for inode X (with igrab()), and triggers writeback for it with filemap_fdatawrite_wbc(), which creates an ordered extent for inode X; 3) The unmount sequence starts from another task, we enter close_ctree() and we flush the workqueue fs_info->endio_write_workers, which waits for the ordered extent for inode X to complete and when dropping the last reference of the ordered extent, with btrfs_put_ordered_extent(), when we call btrfs_add_delayed_iput() we don't add the inode to the list of delayed iputs because it has a refcount of 2, so we decrement it to 1 and return; 4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state; 5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now calls btrfs_add_delayed_iput() for inode X and there we trigger an assertion failure since the fs_info state has the flag BTRFS_FS_STATE_NO_DELAYED_IPUT set. Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for the async reclaim workers to finish, after we call cancel_work_sync() for them at close_ctree(), and by running delayed iputs after wait for the reclaim workers to finish and before setting the bit. This race was recently introduced by commit 19e60b2 ("btrfs: add extra warning if delayed iput is added when it's not allowed"). Without the new validation at btrfs_add_delayed_iput(), this described scenario was safe because close_ctree() later calls btrfs_commit_super(). That will run any final delayed iputs added by reclaim workers in the window between the btrfs_run_delayed_iputs() and the the reclaim workers being shut down. Reported-by: [email protected] Link: https://lore.kernel.org/linux-btrfs/[email protected]/T/#u Fixes: 19e60b2 ("btrfs: add extra warning if delayed iput is added when it's not allowed") Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent 1961d20 commit a26bf33

File tree

1 file changed

+18
-4
lines changed

1 file changed

+18
-4
lines changed

fs/btrfs/disk-io.c

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4312,8 +4312,8 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
43124312
*
43134313
* So wait for all ongoing ordered extents to complete and then run
43144314
* delayed iputs. This works because once we reach this point no one
4315-
* can either create new ordered extents nor create delayed iputs
4316-
* through some other means.
4315+
* can create new ordered extents, but delayed iputs can still be added
4316+
* by a reclaim worker (see comments further below).
43174317
*
43184318
* Also note that btrfs_wait_ordered_roots() is not safe here, because
43194319
* it waits for BTRFS_ORDERED_COMPLETE to be set on an ordered extent,
@@ -4324,15 +4324,29 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
43244324
btrfs_flush_workqueue(fs_info->endio_write_workers);
43254325
/* Ordered extents for free space inodes. */
43264326
btrfs_flush_workqueue(fs_info->endio_freespace_worker);
4327+
/*
4328+
* Run delayed iputs in case an async reclaim worker is waiting for them
4329+
* to be run as mentioned above.
4330+
*/
43274331
btrfs_run_delayed_iputs(fs_info);
4328-
/* There should be no more workload to generate new delayed iputs. */
4329-
set_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state);
43304332

43314333
cancel_work_sync(&fs_info->async_reclaim_work);
43324334
cancel_work_sync(&fs_info->async_data_reclaim_work);
43334335
cancel_work_sync(&fs_info->preempt_reclaim_work);
43344336
cancel_work_sync(&fs_info->em_shrinker_work);
43354337

4338+
/*
4339+
* Run delayed iputs again because an async reclaim worker may have
4340+
* added new ones if it was flushing delalloc:
4341+
*
4342+
* shrink_delalloc() -> btrfs_start_delalloc_roots() ->
4343+
* start_delalloc_inodes() -> btrfs_add_delayed_iput()
4344+
*/
4345+
btrfs_run_delayed_iputs(fs_info);
4346+
4347+
/* There should be no more workload to generate new delayed iputs. */
4348+
set_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state);
4349+
43364350
/* Cancel or finish ongoing discard work */
43374351
btrfs_discard_cleanup(fs_info);
43384352

0 commit comments

Comments
 (0)