Skip to content

Commit 28270e2

Browse files
fdmananakdave
authored andcommitted
btrfs: always reserve space for delayed refs when starting transaction
When starting a transaction (or joining an existing one with btrfs_start_transaction()), we reserve space for the number of items we want to insert in a btree, but we don't do it for the delayed refs we will generate while using the transaction to modify (COW) extent buffers in a btree or allocate new extent buffers. Basically how it works: 1) When we start a transaction we reserve space for the number of items the caller wants to be inserted/modified/deleted in a btree. This space goes to the transaction block reserve; 2) If the delayed refs block reserve is not full, its size is greater than the amount of its reserved space, and the flush method is BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for it corresponding to the number of items the caller wants to insert/modify/delete in a btree; 3) The size of the delayed refs block reserve is increased when a task creates delayed refs after COWing an extent buffer, allocating a new one or deleting (freeing) an extent buffer. This happens after the the task started or joined a transaction, whenever it calls btrfs_update_delayed_refs_rsv(); 4) The delayed refs block reserve is then refilled by anyone calling btrfs_delayed_refs_rsv_refill(), either during unlink/truncate operations or when someone else calls btrfs_start_transaction() with a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL; 5) As a task COWs or allocates extent buffers, it consumes space from the transaction block reserve. When the task releases its transaction handle (btrfs_end_transaction()) or it attempts to commit the transaction, it releases any remaining space in the transaction block reserve that it did not use, as not all space may have been used (due to pessimistic space calculation) by calling btrfs_block_rsv_release() which will try to add that unused space to the delayed refs block reserve (if its current size is greater than its reserved space). That transferred space may not be enough to completely fulfill the delayed refs block reserve. Plus we have some tasks that will attempt do modify as many leaves as they can before getting -ENOSPC (and then reserving more space and retrying), such as hole punching and extent cloning which call btrfs_replace_file_extents(). Such tasks can generate therefore a high number of delayed refs, for both metadata and data (we can't know in advance how many file extent items we will find in a range and therefore how many delayed refs for dropping references on data extents we will generate); 6) If a transaction starts its commit before the delayed refs block reserve is refilled, for example by the transaction kthread or by someone who called btrfs_join_transaction() before starting the commit, then when running delayed references if we don't have enough reserved space in the delayed refs block reserve, we will consume space from the global block reserve. Now this doesn't make a lot of sense because: 1) We should reserve space for delayed references when starting the transaction, since we have no guarantees the delayed refs block reserve will be refilled; 2) If no refill happens then we will consume from the global block reserve when running delayed refs during the transaction commit; 3) If we have a bunch of tasks calling btrfs_start_transaction() with a number of items greater than zero and at the time the delayed refs reserve is full, then we don't reserve any space at btrfs_start_transaction() for the delayed refs that will be generated by a task, and we can therefore end up using a lot of space from the global reserve when running the delayed refs during a transaction commit; 4) There are also other operations that result in bumping the size of the delayed refs reserve, such as creating and deleting block groups, as well as the need to update a block group item because we allocated or freed an extent from the respective block group; 5) If we have a significant gap between the delayed refs reserve's size and its reserved space, two very bad things may happen: 1) The reserved space of the global reserve may not be enough and we fail the transaction commit with -ENOSPC when running delayed refs; 2) If the available space in the global reserve is enough it may result in nearly exhausting it. If the fs has no more unallocated device space for allocating a new block group and all the available space in existing metadata block groups is not far from the global reserve's size before we started the transaction commit, we may end up in a situation where after the transaction commit we have too little available metadata space, and any future transaction commit will fail with -ENOSPC, because although we were able to reserve space to start the transaction, we were not able to commit it, as running delayed refs generates some more delayed refs (to update the extent tree for example) - this includes not even being able to commit a transaction that was started with the goal of unlinking a file, removing an empty data block group or doing reclaim/balance, so there's no way to release metadata space. In the worst case the next time we mount the filesystem we may also fail with -ENOSPC due to failure to commit a transaction to cleanup orphan inodes. This later case was reported and hit by someone running a SLE (SUSE Linux Enterprise) distribution for example - where the fs had no more unallocated space that could be used to allocate a new metadata block group, and the available metadata space was about 1.5M, not enough to commit a transaction to cleanup an orphan inode (or do relocation of data block groups that were far from being full). So improve on this situation by always reserving space for delayed refs when calling start_transaction(), and if the flush method is BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block reserve if it's not full. The space reserved for the delayed refs is added to a local block reserve that is part of the transaction handle, and when a task updates the delayed refs block reserve size, after creating a delayed ref, the space is transferred from that local reserve to the global delayed refs reserve (fs_info->delayed_refs_rsv). In case the local reserve does not have enough space, which may happen for tasks that generate a variable and potentially large number of delayed refs (such as the hole punching and extent cloning cases mentioned before), we transfer any available space and then rely on the current behaviour of hoping some other task refills the delayed refs reserve or fallback to the global block reserve. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent adb86db commit 28270e2

File tree

4 files changed

+132
-31
lines changed

4 files changed

+132
-31
lines changed

fs/btrfs/block-rsv.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -281,10 +281,10 @@ u64 btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
281281
struct btrfs_block_rsv *target = NULL;
282282

283283
/*
284-
* If we are the delayed_rsv then push to the global rsv, otherwise dump
285-
* into the delayed rsv if it is not full.
284+
* If we are a delayed block reserve then push to the global rsv,
285+
* otherwise dump into the global delayed reserve if it is not full.
286286
*/
287-
if (block_rsv == delayed_rsv)
287+
if (block_rsv->type == BTRFS_BLOCK_RSV_DELOPS)
288288
target = global_rsv;
289289
else if (block_rsv != global_rsv && !btrfs_block_rsv_full(delayed_rsv))
290290
target = delayed_rsv;

fs/btrfs/delayed-ref.c

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,9 @@ void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
8989
{
9090
struct btrfs_fs_info *fs_info = trans->fs_info;
9191
struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
92+
struct btrfs_block_rsv *local_rsv = &trans->delayed_rsv;
9293
u64 num_bytes;
94+
u64 reserved_bytes;
9395

9496
num_bytes = btrfs_calc_delayed_ref_bytes(fs_info, trans->delayed_ref_updates);
9597
num_bytes += btrfs_calc_delayed_ref_csum_bytes(fs_info,
@@ -98,9 +100,26 @@ void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
98100
if (num_bytes == 0)
99101
return;
100102

103+
/*
104+
* Try to take num_bytes from the transaction's local delayed reserve.
105+
* If not possible, try to take as much as it's available. If the local
106+
* reserve doesn't have enough reserved space, the delayed refs reserve
107+
* will be refilled next time btrfs_delayed_refs_rsv_refill() is called
108+
* by someone or if a transaction commit is triggered before that, the
109+
* global block reserve will be used. We want to minimize using the
110+
* global block reserve for cases we can account for in advance, to
111+
* avoid exhausting it and reach -ENOSPC during a transaction commit.
112+
*/
113+
spin_lock(&local_rsv->lock);
114+
reserved_bytes = min(num_bytes, local_rsv->reserved);
115+
local_rsv->reserved -= reserved_bytes;
116+
local_rsv->full = (local_rsv->reserved >= local_rsv->size);
117+
spin_unlock(&local_rsv->lock);
118+
101119
spin_lock(&delayed_rsv->lock);
102120
delayed_rsv->size += num_bytes;
103-
delayed_rsv->full = false;
121+
delayed_rsv->reserved += reserved_bytes;
122+
delayed_rsv->full = (delayed_rsv->reserved >= delayed_rsv->size);
104123
spin_unlock(&delayed_rsv->lock);
105124
trans->delayed_ref_updates = 0;
106125
trans->delayed_ref_csum_deletions = 0;

fs/btrfs/transaction.c

Lines changed: 107 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -561,17 +561,82 @@ static inline bool need_reserve_reloc_root(struct btrfs_root *root)
561561
return true;
562562
}
563563

564+
static int btrfs_reserve_trans_metadata(struct btrfs_fs_info *fs_info,
565+
enum btrfs_reserve_flush_enum flush,
566+
u64 num_bytes,
567+
u64 *delayed_refs_bytes)
568+
{
569+
struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
570+
struct btrfs_space_info *si = fs_info->trans_block_rsv.space_info;
571+
u64 extra_delayed_refs_bytes = 0;
572+
u64 bytes;
573+
int ret;
574+
575+
/*
576+
* If there's a gap between the size of the delayed refs reserve and
577+
* its reserved space, than some tasks have added delayed refs or bumped
578+
* its size otherwise (due to block group creation or removal, or block
579+
* group item update). Also try to allocate that gap in order to prevent
580+
* using (and possibly abusing) the global reserve when committing the
581+
* transaction.
582+
*/
583+
if (flush == BTRFS_RESERVE_FLUSH_ALL &&
584+
!btrfs_block_rsv_full(delayed_refs_rsv)) {
585+
spin_lock(&delayed_refs_rsv->lock);
586+
if (delayed_refs_rsv->size > delayed_refs_rsv->reserved)
587+
extra_delayed_refs_bytes = delayed_refs_rsv->size -
588+
delayed_refs_rsv->reserved;
589+
spin_unlock(&delayed_refs_rsv->lock);
590+
}
591+
592+
bytes = num_bytes + *delayed_refs_bytes + extra_delayed_refs_bytes;
593+
594+
/*
595+
* We want to reserve all the bytes we may need all at once, so we only
596+
* do 1 enospc flushing cycle per transaction start.
597+
*/
598+
ret = btrfs_reserve_metadata_bytes(fs_info, si, bytes, flush);
599+
if (ret == 0) {
600+
if (extra_delayed_refs_bytes > 0)
601+
btrfs_migrate_to_delayed_refs_rsv(fs_info,
602+
extra_delayed_refs_bytes);
603+
return 0;
604+
}
605+
606+
if (extra_delayed_refs_bytes > 0) {
607+
bytes -= extra_delayed_refs_bytes;
608+
ret = btrfs_reserve_metadata_bytes(fs_info, si, bytes, flush);
609+
if (ret == 0)
610+
return 0;
611+
}
612+
613+
/*
614+
* If we are an emergency flush, which can steal from the global block
615+
* reserve, then attempt to not reserve space for the delayed refs, as
616+
* we will consume space for them from the global block reserve.
617+
*/
618+
if (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL) {
619+
bytes -= *delayed_refs_bytes;
620+
*delayed_refs_bytes = 0;
621+
ret = btrfs_reserve_metadata_bytes(fs_info, si, bytes, flush);
622+
}
623+
624+
return ret;
625+
}
626+
564627
static struct btrfs_trans_handle *
565628
start_transaction(struct btrfs_root *root, unsigned int num_items,
566629
unsigned int type, enum btrfs_reserve_flush_enum flush,
567630
bool enforce_qgroups)
568631
{
569632
struct btrfs_fs_info *fs_info = root->fs_info;
570633
struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
634+
struct btrfs_block_rsv *trans_rsv = &fs_info->trans_block_rsv;
571635
struct btrfs_trans_handle *h;
572636
struct btrfs_transaction *cur_trans;
573637
u64 num_bytes = 0;
574638
u64 qgroup_reserved = 0;
639+
u64 delayed_refs_bytes = 0;
575640
bool reloc_reserved = false;
576641
bool do_chunk_alloc = false;
577642
int ret;
@@ -594,9 +659,6 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
594659
* the appropriate flushing if need be.
595660
*/
596661
if (num_items && root != fs_info->chunk_root) {
597-
struct btrfs_block_rsv *rsv = &fs_info->trans_block_rsv;
598-
u64 delayed_refs_bytes = 0;
599-
600662
qgroup_reserved = num_items * fs_info->nodesize;
601663
/*
602664
* Use prealloc for now, as there might be a currently running
@@ -608,20 +670,16 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
608670
if (ret)
609671
return ERR_PTR(ret);
610672

673+
num_bytes = btrfs_calc_insert_metadata_size(fs_info, num_items);
611674
/*
612-
* We want to reserve all the bytes we may need all at once, so
613-
* we only do 1 enospc flushing cycle per transaction start. We
614-
* accomplish this by simply assuming we'll do num_items worth
615-
* of delayed refs updates in this trans handle, and refill that
616-
* amount for whatever is missing in the reserve.
675+
* If we plan to insert/update/delete "num_items" from a btree,
676+
* we will also generate delayed refs for extent buffers in the
677+
* respective btree paths, so reserve space for the delayed refs
678+
* that will be generated by the caller as it modifies btrees.
679+
* Try to reserve them to avoid excessive use of the global
680+
* block reserve.
617681
*/
618-
num_bytes = btrfs_calc_insert_metadata_size(fs_info, num_items);
619-
if (flush == BTRFS_RESERVE_FLUSH_ALL &&
620-
!btrfs_block_rsv_full(delayed_refs_rsv)) {
621-
delayed_refs_bytes = btrfs_calc_delayed_ref_bytes(fs_info,
622-
num_items);
623-
num_bytes += delayed_refs_bytes;
624-
}
682+
delayed_refs_bytes = btrfs_calc_delayed_ref_bytes(fs_info, num_items);
625683

626684
/*
627685
* Do the reservation for the relocation root creation
@@ -631,17 +689,14 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
631689
reloc_reserved = true;
632690
}
633691

634-
ret = btrfs_reserve_metadata_bytes(fs_info, rsv->space_info,
635-
num_bytes, flush);
692+
ret = btrfs_reserve_trans_metadata(fs_info, flush, num_bytes,
693+
&delayed_refs_bytes);
636694
if (ret)
637695
goto reserve_fail;
638-
if (delayed_refs_bytes) {
639-
btrfs_migrate_to_delayed_refs_rsv(fs_info, delayed_refs_bytes);
640-
num_bytes -= delayed_refs_bytes;
641-
}
642-
btrfs_block_rsv_add_bytes(rsv, num_bytes, true);
643696

644-
if (rsv->space_info->force_alloc)
697+
btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
698+
699+
if (trans_rsv->space_info->force_alloc)
645700
do_chunk_alloc = true;
646701
} else if (num_items == 0 && flush == BTRFS_RESERVE_FLUSH_ALL &&
647702
!btrfs_block_rsv_full(delayed_refs_rsv)) {
@@ -701,6 +756,7 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
701756

702757
h->type = type;
703758
INIT_LIST_HEAD(&h->new_bgs);
759+
btrfs_init_metadata_block_rsv(fs_info, &h->delayed_rsv, BTRFS_BLOCK_RSV_DELOPS);
704760

705761
smp_mb();
706762
if (cur_trans->state >= TRANS_STATE_COMMIT_START &&
@@ -713,8 +769,17 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
713769
if (num_bytes) {
714770
trace_btrfs_space_reservation(fs_info, "transaction",
715771
h->transid, num_bytes, 1);
716-
h->block_rsv = &fs_info->trans_block_rsv;
772+
h->block_rsv = trans_rsv;
717773
h->bytes_reserved = num_bytes;
774+
if (delayed_refs_bytes > 0) {
775+
trace_btrfs_space_reservation(fs_info,
776+
"local_delayed_refs_rsv",
777+
h->transid,
778+
delayed_refs_bytes, 1);
779+
h->delayed_refs_bytes_reserved = delayed_refs_bytes;
780+
btrfs_block_rsv_add_bytes(&h->delayed_rsv, delayed_refs_bytes, true);
781+
delayed_refs_bytes = 0;
782+
}
718783
h->reloc_reserved = reloc_reserved;
719784
}
720785

@@ -770,8 +835,10 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
770835
kmem_cache_free(btrfs_trans_handle_cachep, h);
771836
alloc_fail:
772837
if (num_bytes)
773-
btrfs_block_rsv_release(fs_info, &fs_info->trans_block_rsv,
774-
num_bytes, NULL);
838+
btrfs_block_rsv_release(fs_info, trans_rsv, num_bytes, NULL);
839+
if (delayed_refs_bytes)
840+
btrfs_space_info_free_bytes_may_use(fs_info, trans_rsv->space_info,
841+
delayed_refs_bytes);
775842
reserve_fail:
776843
btrfs_qgroup_free_meta_prealloc(root, qgroup_reserved);
777844
return ERR_PTR(ret);
@@ -992,18 +1059,31 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
9921059

9931060
if (!trans->block_rsv) {
9941061
ASSERT(!trans->bytes_reserved);
1062+
ASSERT(!trans->delayed_refs_bytes_reserved);
9951063
return;
9961064
}
9971065

998-
if (!trans->bytes_reserved)
1066+
if (!trans->bytes_reserved) {
1067+
ASSERT(!trans->delayed_refs_bytes_reserved);
9991068
return;
1069+
}
10001070

10011071
ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
10021072
trace_btrfs_space_reservation(fs_info, "transaction",
10031073
trans->transid, trans->bytes_reserved, 0);
10041074
btrfs_block_rsv_release(fs_info, trans->block_rsv,
10051075
trans->bytes_reserved, NULL);
10061076
trans->bytes_reserved = 0;
1077+
1078+
if (!trans->delayed_refs_bytes_reserved)
1079+
return;
1080+
1081+
trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
1082+
trans->transid,
1083+
trans->delayed_refs_bytes_reserved, 0);
1084+
btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
1085+
trans->delayed_refs_bytes_reserved, NULL);
1086+
trans->delayed_refs_bytes_reserved = 0;
10071087
}
10081088

10091089
static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,

fs/btrfs/transaction.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@ enum {
118118
struct btrfs_trans_handle {
119119
u64 transid;
120120
u64 bytes_reserved;
121+
u64 delayed_refs_bytes_reserved;
121122
u64 chunk_bytes_reserved;
122123
unsigned long delayed_ref_updates;
123124
unsigned long delayed_ref_csum_deletions;
@@ -140,6 +141,7 @@ struct btrfs_trans_handle {
140141
bool in_fsync;
141142
struct btrfs_fs_info *fs_info;
142143
struct list_head new_bgs;
144+
struct btrfs_block_rsv delayed_rsv;
143145
};
144146

145147
/*

0 commit comments

Comments
 (0)