Skip to content

Commit f8f931b

Browse files
Hugh Dickinsakpm00
authored andcommitted
mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases. Link: https://lkml.kernel.org/r/[email protected] Fixes: 87eaceb ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f ("mm: split underused THPs") Signed-off-by: Hugh Dickins <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Usama Arif <[email protected]> Cc: Wei Yang <[email protected]> Cc: Zi Yan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent e66f318 commit f8f931b

File tree

8 files changed

+67
-24
lines changed

8 files changed

+67
-24
lines changed

mm/huge_memory.c

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
35883588
return split_huge_page_to_list_to_order(&folio->page, list, ret);
35893589
}
35903590

3591-
void __folio_undo_large_rmappable(struct folio *folio)
3591+
/*
3592+
* __folio_unqueue_deferred_split() is not to be called directly:
3593+
* the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
3594+
* limits its calls to those folios which may have a _deferred_list for
3595+
* queueing THP splits, and that list is (racily observed to be) non-empty.
3596+
*
3597+
* It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
3598+
* zero: because even when split_queue_lock is held, a non-empty _deferred_list
3599+
* might be in use on deferred_split_scan()'s unlocked on-stack list.
3600+
*
3601+
* If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
3602+
* therefore important to unqueue deferred split before changing folio memcg.
3603+
*/
3604+
bool __folio_unqueue_deferred_split(struct folio *folio)
35923605
{
35933606
struct deferred_split *ds_queue;
35943607
unsigned long flags;
3608+
bool unqueued = false;
3609+
3610+
WARN_ON_ONCE(folio_ref_count(folio));
3611+
WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
35953612

35963613
ds_queue = get_deferred_split_queue(folio);
35973614
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
@@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
36033620
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
36043621
}
36053622
list_del_init(&folio->_deferred_list);
3623+
unqueued = true;
36063624
}
36073625
spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
3626+
3627+
return unqueued; /* useful for debug warnings */
36083628
}
36093629

36103630
/* partially_mapped=false won't clear PG_partially_mapped folio flag */
@@ -3627,14 +3647,11 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
36273647
return;
36283648

36293649
/*
3630-
* The try_to_unmap() in page reclaim path might reach here too,
3631-
* this may cause a race condition to corrupt deferred split queue.
3632-
* And, if page reclaim is already handling the same folio, it is
3633-
* unnecessary to handle it again in shrinker.
3634-
*
3635-
* Check the swapcache flag to determine if the folio is being
3636-
* handled by page reclaim since THP swap would add the folio into
3637-
* swap cache before calling try_to_unmap().
3650+
* Exclude swapcache: originally to avoid a corrupt deferred split
3651+
* queue. Nowadays that is fully prevented by mem_cgroup_swapout();
3652+
* but if page reclaim is already handling the same folio, it is
3653+
* unnecessary to handle it again in the shrinker, so excluding
3654+
* swapcache here may still be a useful optimization.
36383655
*/
36393656
if (folio_test_swapcache(folio))
36403657
return;

mm/internal.h

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -639,21 +639,21 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
639639
#endif
640640
}
641641

642-
void __folio_undo_large_rmappable(struct folio *folio);
643-
static inline void folio_undo_large_rmappable(struct folio *folio)
642+
bool __folio_unqueue_deferred_split(struct folio *folio);
643+
static inline bool folio_unqueue_deferred_split(struct folio *folio)
644644
{
645645
if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
646-
return;
646+
return false;
647647

648648
/*
649649
* At this point, there is no one trying to add the folio to
650650
* deferred_list. If folio is not in deferred_list, it's safe
651651
* to check without acquiring the split_queue_lock.
652652
*/
653653
if (data_race(list_empty(&folio->_deferred_list)))
654-
return;
654+
return false;
655655

656-
__folio_undo_large_rmappable(folio);
656+
return __folio_unqueue_deferred_split(folio);
657657
}
658658

659659
static inline struct folio *page_rmappable_folio(struct page *page)

mm/memcontrol-v1.c

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
848848
css_get(&to->css);
849849
css_put(&from->css);
850850

851+
/* Warning should never happen, so don't worry about refcount non-0 */
852+
WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
851853
folio->memcg_data = (unsigned long)to;
852854

853855
__folio_memcg_unlock(from);
@@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
12171219
enum mc_target_type target_type;
12181220
union mc_target target;
12191221
struct folio *folio;
1222+
bool tried_split_before = false;
12201223

1224+
retry_pmd:
12211225
ptl = pmd_trans_huge_lock(pmd, vma);
12221226
if (ptl) {
12231227
if (mc.precharge < HPAGE_PMD_NR) {
@@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
12271231
target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
12281232
if (target_type == MC_TARGET_PAGE) {
12291233
folio = target.folio;
1234+
/*
1235+
* Deferred split queue locking depends on memcg,
1236+
* and unqueue is unsafe unless folio refcount is 0:
1237+
* split or skip if on the queue? first try to split.
1238+
*/
1239+
if (!list_empty(&folio->_deferred_list)) {
1240+
spin_unlock(ptl);
1241+
if (!tried_split_before)
1242+
split_folio(folio);
1243+
folio_unlock(folio);
1244+
folio_put(folio);
1245+
if (tried_split_before)
1246+
return 0;
1247+
tried_split_before = true;
1248+
goto retry_pmd;
1249+
}
1250+
/*
1251+
* So long as that pmd lock is held, the folio cannot
1252+
* be racily added to the _deferred_list, because
1253+
* __folio_remove_rmap() will find !partially_mapped.
1254+
*/
12301255
if (folio_isolate_lru(folio)) {
12311256
if (!mem_cgroup_move_account(folio, true,
12321257
mc.from, mc.to)) {

mm/memcontrol.c

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
46294629
struct obj_cgroup *objcg;
46304630

46314631
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
4632-
VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
4633-
!folio_test_hugetlb(folio) &&
4634-
!list_empty(&folio->_deferred_list), folio);
46354632

46364633
/*
46374634
* Nobody should be changing or seriously looking at
@@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
46784675
ug->nr_memory += nr_pages;
46794676
ug->pgpgout++;
46804677

4678+
WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
46814679
folio->memcg_data = 0;
46824680
}
46834681

@@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
47894787

47904788
/* Transfer the charge and the css ref */
47914789
commit_charge(new, memcg);
4790+
4791+
/* Warning should never happen, so don't worry about refcount non-0 */
4792+
WARN_ON_ONCE(folio_unqueue_deferred_split(old));
47924793
old->memcg_data = 0;
47934794
}
47944795

@@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
49754976
VM_BUG_ON_FOLIO(oldid, folio);
49764977
mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
49774978

4979+
folio_unqueue_deferred_split(folio);
49784980
folio->memcg_data = 0;
49794981

49804982
if (!mem_cgroup_is_root(memcg))

mm/migrate.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -490,7 +490,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
490490
folio_test_large_rmappable(folio)) {
491491
if (!folio_ref_freeze(folio, expected_count))
492492
return -EAGAIN;
493-
folio_undo_large_rmappable(folio);
493+
folio_unqueue_deferred_split(folio);
494494
folio_ref_unfreeze(folio, expected_count);
495495
}
496496

@@ -515,7 +515,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
515515
}
516516

517517
/* Take off deferred split queue while frozen and memcg set */
518-
folio_undo_large_rmappable(folio);
518+
folio_unqueue_deferred_split(folio);
519519

520520
/*
521521
* Now we know that no one else is looking at the folio:

mm/page_alloc.c

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2681,7 +2681,6 @@ void free_unref_folios(struct folio_batch *folios)
26812681
unsigned long pfn = folio_pfn(folio);
26822682
unsigned int order = folio_order(folio);
26832683

2684-
folio_undo_large_rmappable(folio);
26852684
if (!free_pages_prepare(&folio->page, order))
26862685
continue;
26872686
/*

mm/swap.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
121121
}
122122

123123
page_cache_release(folio);
124-
folio_undo_large_rmappable(folio);
124+
folio_unqueue_deferred_split(folio);
125125
mem_cgroup_uncharge(folio);
126126
free_unref_page(&folio->page, folio_order(folio));
127127
}
@@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
988988
free_huge_folio(folio);
989989
continue;
990990
}
991-
folio_undo_large_rmappable(folio);
991+
folio_unqueue_deferred_split(folio);
992992
__page_cache_release(folio, &lruvec, &flags);
993993

994994
if (j != i)

mm/vmscan.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1476,7 +1476,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
14761476
*/
14771477
nr_reclaimed += nr_pages;
14781478

1479-
folio_undo_large_rmappable(folio);
1479+
folio_unqueue_deferred_split(folio);
14801480
if (folio_batch_add(&free_folios, folio) == 0) {
14811481
mem_cgroup_uncharge_folios(&free_folios);
14821482
try_to_unmap_flush();
@@ -1864,7 +1864,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
18641864
if (unlikely(folio_put_testzero(folio))) {
18651865
__folio_clear_lru_flags(folio);
18661866

1867-
folio_undo_large_rmappable(folio);
1867+
folio_unqueue_deferred_split(folio);
18681868
if (folio_batch_add(&free_folios, folio) == 0) {
18691869
spin_unlock_irq(&lruvec->lru_lock);
18701870
mem_cgroup_uncharge_folios(&free_folios);

0 commit comments

Comments
 (0)