Skip to content

Commit 78fbe90

Browse files
davidhildenbrandakpm00
authored andcommitted
mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages
The basic question we would like to have a reliable and efficient answer to is: is this anonymous page exclusive to a single process or might it be shared? We need that information for ordinary/single pages, hugetlb pages, and possibly each subpage of a THP. Introduce a way to mark an anonymous page as exclusive, with the ultimate goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins lose consistency with the pages mapped into the page table, resulting in reported memory corruptions. Most pageflags already have semantics for anonymous pages, however, PG_mappedtodisk should never apply to pages in the swapcache, so let's reuse that flag. As PG_has_hwpoisoned also uses that flag on the second tail page of a compound page, convert it to PG_error instead, which is marked as PF_NO_TAIL, so never used for tail pages. Use custom page flag modification functions such that we can do additional sanity checks. The semantics we'll put into some kernel doc in the future are: " PG_anon_exclusive is *usually* only expressive in combination with a page table entry. Depending on the page table entry type it might store the following information: Is what's mapped via this page table entry exclusive to the single process and can be mapped writable without further checks? If not, it might be shared and we might have to COW. For now, we only expect PTE-mapped THPs to make use of PG_anon_exclusive in subpages. For other anonymous compound folios (i.e., hugetlb), only the head page is logically mapped and holds this information. For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive set on the head page. When replacing the PMD by a page table full of PTEs, PG_anon_exclusive, if set on the head page, will be set on all tail pages accordingly. Note that converting from a PTE-mapping to a PMD mapping using the same compound page is currently not possible and consequently doesn't require care. If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page, it should only pin if the relevant PG_anon_exclusive is set. In that case, the pin will be fully reliable and stay consistent with the pages mapped into the page table, as the bit cannot get cleared (e.g., by fork(), KSM) while the page is pinned. For anonymous pages that are mapped R/W, PG_anon_exclusive can be assumed to always be set because such pages cannot possibly be shared. The page table lock protecting the page table entry is the primary synchronization mechanism for PG_anon_exclusive; GUP-fast that does not take the PT lock needs special care when trying to clear the flag. Page table entry types and PG_anon_exclusive: * Present: PG_anon_exclusive applies. * Swap: the information is lost. PG_anon_exclusive was cleared. * Migration: the entry holds this information instead. PG_anon_exclusive was cleared. * Device private: PG_anon_exclusive applies. * Device exclusive: PG_anon_exclusive applies. * HW Poison: PG_anon_exclusive is stale and not changed. If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is not allowed and the flag will stick around until the page is freed and folio->mapping is cleared. " We won't be clearing PG_anon_exclusive on destructive unmapping (i.e., zapping) of page table entries, page freeing code will handle that when also invalidate page->mapping to not indicate PageAnon() anymore. Letting information about exclusivity stick around will be an important property when adding sanity checks to unpinning code. Note that we properly clear the flag in free_pages_prepare() via PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page, so there is no need to manually clear the flag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Rientjes <[email protected]> Cc: Don Dutile <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Liang Zhang <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 5005394 commit 78fbe90

File tree

6 files changed

+71
-2
lines changed

6 files changed

+71
-2
lines changed

include/linux/page-flags.h

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,15 @@ enum pageflags {
142142

143143
PG_readahead = PG_reclaim,
144144

145+
/*
146+
* Depending on the way an anonymous folio can be mapped into a page
147+
* table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
148+
* THP), PG_anon_exclusive may be set only for the head page or for
149+
* tail pages of an anonymous folio. For now, we only expect it to be
150+
* set on tail pages for PTE-mapped THP.
151+
*/
152+
PG_anon_exclusive = PG_mappedtodisk,
153+
145154
/* Filesystems */
146155
PG_checked = PG_owner_priv_1,
147156

@@ -176,7 +185,7 @@ enum pageflags {
176185
* Indicates that at least one subpage is hwpoisoned in the
177186
* THP.
178187
*/
179-
PG_has_hwpoisoned = PG_mappedtodisk,
188+
PG_has_hwpoisoned = PG_error,
180189
#endif
181190

182191
/* non-lru isolated movable page */
@@ -1002,6 +1011,34 @@ extern bool is_free_buddy_page(struct page *page);
10021011

10031012
PAGEFLAG(Isolated, isolated, PF_ANY);
10041013

1014+
static __always_inline int PageAnonExclusive(struct page *page)
1015+
{
1016+
VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
1017+
VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
1018+
return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
1019+
}
1020+
1021+
static __always_inline void SetPageAnonExclusive(struct page *page)
1022+
{
1023+
VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
1024+
VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
1025+
set_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
1026+
}
1027+
1028+
static __always_inline void ClearPageAnonExclusive(struct page *page)
1029+
{
1030+
VM_BUG_ON_PGFLAGS(!PageAnon(page) || PageKsm(page), page);
1031+
VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
1032+
clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
1033+
}
1034+
1035+
static __always_inline void __ClearPageAnonExclusive(struct page *page)
1036+
{
1037+
VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
1038+
VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
1039+
__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
1040+
}
1041+
10051042
#ifdef CONFIG_MMU
10061043
#define __PG_MLOCKED (1UL << PG_mlocked)
10071044
#else

mm/hugetlb.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1677,6 +1677,8 @@ void free_huge_page(struct page *page)
16771677
VM_BUG_ON_PAGE(page_mapcount(page), page);
16781678

16791679
hugetlb_set_page_subpool(page, NULL);
1680+
if (PageAnon(page))
1681+
__ClearPageAnonExclusive(page);
16801682
page->mapping = NULL;
16811683
restore_reserve = HPageRestoreReserve(page);
16821684
ClearHPageRestoreReserve(page);

mm/memory.c

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3667,6 +3667,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
36673667
goto out_nomap;
36683668
}
36693669

3670+
/*
3671+
* PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
3672+
* must never point at an anonymous page in the swapcache that is
3673+
* PG_anon_exclusive. Sanity check that this holds and especially, that
3674+
* no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
3675+
* check after taking the PT lock and making sure that nobody
3676+
* concurrently faulted in this page and set PG_anon_exclusive.
3677+
*/
3678+
BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
3679+
BUG_ON(PageAnon(page) && PageAnonExclusive(page));
3680+
36703681
/*
36713682
* Remove the swap entry and conditionally try to free up the swapcache.
36723683
* We're already holding a reference on the page but haven't mapped it

mm/memremap.c

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -459,6 +459,15 @@ void free_zone_device_page(struct page *page)
459459

460460
mem_cgroup_uncharge(page_folio(page));
461461

462+
/*
463+
* Note: we don't expect anonymous compound pages yet. Once supported
464+
* and we could PTE-map them similar to THP, we'd have to clear
465+
* PG_anon_exclusive on all tail pages.
466+
*/
467+
VM_BUG_ON_PAGE(PageAnon(page) && PageCompound(page), page);
468+
if (PageAnon(page))
469+
__ClearPageAnonExclusive(page);
470+
462471
/*
463472
* When a device managed page is freed, the page->mapping field
464473
* may still contain a (stale) mapping value. For example, the

mm/swapfile.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1796,6 +1796,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
17961796
goto out;
17971797
}
17981798

1799+
/* See do_swap_page() */
1800+
BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
1801+
BUG_ON(PageAnon(page) && PageAnonExclusive(page));
1802+
17991803
dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
18001804
inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
18011805
get_page(page);

tools/vm/page-types.c

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,9 +80,10 @@
8080
#define KPF_SOFTDIRTY 40
8181
#define KPF_ARCH_2 41
8282

83-
/* [48-] take some arbitrary free slots for expanding overloaded flags
83+
/* [47-] take some arbitrary free slots for expanding overloaded flags
8484
* not part of kernel API
8585
*/
86+
#define KPF_ANON_EXCLUSIVE 47
8687
#define KPF_READAHEAD 48
8788
#define KPF_SLOB_FREE 49
8889
#define KPF_SLUB_FROZEN 50
@@ -138,6 +139,7 @@ static const char * const page_flag_names[] = {
138139
[KPF_SOFTDIRTY] = "f:softdirty",
139140
[KPF_ARCH_2] = "H:arch_2",
140141

142+
[KPF_ANON_EXCLUSIVE] = "d:anon_exclusive",
141143
[KPF_READAHEAD] = "I:readahead",
142144
[KPF_SLOB_FREE] = "P:slob_free",
143145
[KPF_SLUB_FROZEN] = "A:slub_frozen",
@@ -472,6 +474,10 @@ static int bit_mask_ok(uint64_t flags)
472474

473475
static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)
474476
{
477+
/* Anonymous pages overload PG_mappedtodisk */
478+
if ((flags & BIT(ANON)) && (flags & BIT(MAPPEDTODISK)))
479+
flags ^= BIT(MAPPEDTODISK) | BIT(ANON_EXCLUSIVE);
480+
475481
/* SLOB/SLUB overload several page flags */
476482
if (flags & BIT(SLAB)) {
477483
if (flags & BIT(PRIVATE))

0 commit comments

Comments
 (0)