Skip to content

Commit 00d80e3

Browse files
fdmananamasoncl
authored andcommitted
Btrfs: fix quick exhaustion of the system array in the superblock
Omar reported that after commit 4fbcdf6 ("Btrfs: fix -ENOSPC when finishing block group creation"), introduced in 4.2-rc1, the following test was failing due to exhaustion of the system array in the superblock: #!/bin/bash truncate -s 100T big.img mkfs.btrfs big.img mount -o loop big.img /mnt/loop num=5 sz=10T for ((i = 0; i < $num; i++)); do echo fallocate $i $sz fallocate -l $sz /mnt/loop/testfile$i done btrfs filesystem sync /mnt/loop for ((i = 0; i < $num; i++)); do echo rm $i rm /mnt/loop/testfile$i btrfs filesystem sync /mnt/loop done umount /mnt/loop This made btrfs_add_system_chunk() fail with -EFBIG due to excessive allocation of system block groups. This happened because the test creates a large number of data block groups per transaction and when committing the transaction we start the writeout of the block group caches for all the new new (dirty) block groups, which results in pre-allocating space for each block group's free space cache using the same transaction handle. That in turn often leads to creation of more block groups, and all get attached to the new_bgs list of the same transaction handle to the point of getting a list with over 1500 elements, and creation of new block groups leads to the need of reserving space in the chunk block reserve and often creating a new system block group too. So that made us quickly exhaust the chunk block reserve/system space info, because as of the commit mentioned before, we do reserve space for each new block group in the chunk block reserve, unlike before where we would not and would at most allocate one new system block group and therefore would only ensure that there was enough space in the system space info to allocate 1 new block group even if we ended up allocating thousands of new block groups using the same transaction handle. That worked most of the time because the computed required space at check_system_chunk() is very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and that all nodes/leafs in a path will be COWed and split) and since the updates to the chunk tree all happen at btrfs_create_pending_block_groups it is unlikely that a path needs to be COWed more than once (unless writepages() for the btree inode is called by mm in between) and that compensated for the need of creating any new nodes/leads in the chunk tree. So fix this by ensuring we don't accumulate a too large list of new block groups in a transaction's handles new_bgs list, inserting/updating the chunk tree for all accumulated new block groups and releasing the unused space from the chunk block reserve whenever the list becomes sufficiently large. This is a generic solution even though the problem currently can only happen when starting the writeout of the free space caches for all dirty block groups (btrfs_start_dirty_block_groups()). Reported-by: Omar Sandoval <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Tested-by: Omar Sandoval <[email protected]> Signed-off-by: Chris Mason <[email protected]>
1 parent 3e303ea commit 00d80e3

File tree

1 file changed

+18
-0
lines changed

1 file changed

+18
-0
lines changed

fs/btrfs/extent-tree.c

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4227,6 +4227,24 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans,
42274227
space_info->chunk_alloc = 0;
42284228
spin_unlock(&space_info->lock);
42294229
mutex_unlock(&fs_info->chunk_mutex);
4230+
/*
4231+
* When we allocate a new chunk we reserve space in the chunk block
4232+
* reserve to make sure we can COW nodes/leafs in the chunk tree or
4233+
* add new nodes/leafs to it if we end up needing to do it when
4234+
* inserting the chunk item and updating device items as part of the
4235+
* second phase of chunk allocation, performed by
4236+
* btrfs_finish_chunk_alloc(). So make sure we don't accumulate a
4237+
* large number of new block groups to create in our transaction
4238+
* handle's new_bgs list to avoid exhausting the chunk block reserve
4239+
* in extreme cases - like having a single transaction create many new
4240+
* block groups when starting to write out the free space caches of all
4241+
* the block groups that were made dirty during the lifetime of the
4242+
* transaction.
4243+
*/
4244+
if (trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) {
4245+
btrfs_create_pending_block_groups(trans, trans->root);
4246+
btrfs_trans_release_chunk_metadata(trans);
4247+
}
42304248
return ret;
42314249
}
42324250

0 commit comments

Comments
 (0)