Skip to content
This repository was archived by the owner on Nov 8, 2023. It is now read-only.

Commit 1d65b77

Browse files
Hugh Dickinsakpm00
authored andcommitted
mm/khugepaged: retract_page_tables() without mmap or vma lock
Simplify shmem and file THP collapse's retract_page_tables(), and relax its locking: to improve its success rate and to lessen impact on others. Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of target_mm, leave that part of the work to madvise_collapse() calling collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s result code to arrange for that. That spares retract_page_tables() four arguments; and since it will be successful in retracting all of the page tables expected of it, no need to track and return a result code itself. It needs i_mmap_lock_read(mapping) for traversing the vma interval tree, but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk() allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for THPs. retract_page_tables() just needs to use those same spinlocks to exclude it briefly, while transitioning pmd from page table to none: so restore its use of pmd_lock() inside of which pte lock is nested. Users of pte_offset_map_lock() etc all now allow for them to fail: so retract_page_tables() now has no use for mmap_write_trylock() or vma_try_start_write(). In common with rmap and page_vma_mapped_walk(), it does not even need the mmap_read_lock(). But those users do expect the page table to remain a good page table, until they unlock and rcu_read_unlock(): so the page table cannot be freed immediately, but rather by the recently added pte_free_defer(). Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt when PAE, and pmdp_collapse_flush() did not already do so: to make sure that the start,pmdp_get_lockless(),end sequence in __pte_offset_map() cannot pick up a pmd entry with mismatched pmd_low and pmd_high. retract_page_tables() can be enhanced to replace_page_tables(), which inserts the final huge pmd without mmap lock: going through an invalid state instead of pmd_none() followed by fault. But that enhancement does raise some more questions: leave it until a later release. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huang, Ying <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Russell King <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Song Liu <[email protected]> Cc: Steven Price <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Thomas Hellström <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zack Rusin <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 13cf577 commit 1d65b77

File tree

1 file changed

+69
-103
lines changed

1 file changed

+69
-103
lines changed

mm/khugepaged.c

Lines changed: 69 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
16171617
break;
16181618
case SCAN_PMD_NONE:
16191619
/*
1620-
* In MADV_COLLAPSE path, possible race with khugepaged where
1621-
* all pte entries have been removed and pmd cleared. If so,
1622-
* skip all the pte checks and just update the pmd mapping.
1620+
* All pte entries have been removed and pmd cleared.
1621+
* Skip all the pte checks and just update the pmd mapping.
16231622
*/
16241623
goto maybe_install_pmd;
16251624
default:
@@ -1750,123 +1749,88 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
17501749
mmap_write_unlock(mm);
17511750
}
17521751

1753-
static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
1754-
struct mm_struct *target_mm,
1755-
unsigned long target_addr, struct page *hpage,
1756-
struct collapse_control *cc)
1752+
static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
17571753
{
17581754
struct vm_area_struct *vma;
1759-
int target_result = SCAN_FAIL;
17601755

1761-
i_mmap_lock_write(mapping);
1756+
i_mmap_lock_read(mapping);
17621757
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
1763-
int result = SCAN_FAIL;
1764-
struct mm_struct *mm = NULL;
1765-
unsigned long addr = 0;
1766-
pmd_t *pmd;
1767-
bool is_target = false;
1758+
struct mmu_notifier_range range;
1759+
struct mm_struct *mm;
1760+
unsigned long addr;
1761+
pmd_t *pmd, pgt_pmd;
1762+
spinlock_t *pml;
1763+
spinlock_t *ptl;
1764+
bool skipped_uffd = false;
17681765

17691766
/*
17701767
* Check vma->anon_vma to exclude MAP_PRIVATE mappings that
1771-
* got written to. These VMAs are likely not worth investing
1772-
* mmap_write_lock(mm) as PMD-mapping is likely to be split
1773-
* later.
1774-
*
1775-
* Note that vma->anon_vma check is racy: it can be set up after
1776-
* the check but before we took mmap_lock by the fault path.
1777-
* But page lock would prevent establishing any new ptes of the
1778-
* page, so we are safe.
1779-
*
1780-
* An alternative would be drop the check, but check that page
1781-
* table is clear before calling pmdp_collapse_flush() under
1782-
* ptl. It has higher chance to recover THP for the VMA, but
1783-
* has higher cost too. It would also probably require locking
1784-
* the anon_vma.
1768+
* got written to. These VMAs are likely not worth removing
1769+
* page tables from, as PMD-mapping is likely to be split later.
17851770
*/
1786-
if (READ_ONCE(vma->anon_vma)) {
1787-
result = SCAN_PAGE_ANON;
1788-
goto next;
1789-
}
1771+
if (READ_ONCE(vma->anon_vma))
1772+
continue;
1773+
17901774
addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
17911775
if (addr & ~HPAGE_PMD_MASK ||
1792-
vma->vm_end < addr + HPAGE_PMD_SIZE) {
1793-
result = SCAN_VMA_CHECK;
1794-
goto next;
1795-
}
1776+
vma->vm_end < addr + HPAGE_PMD_SIZE)
1777+
continue;
1778+
17961779
mm = vma->vm_mm;
1797-
is_target = mm == target_mm && addr == target_addr;
1798-
result = find_pmd_or_thp_or_none(mm, addr, &pmd);
1799-
if (result != SCAN_SUCCEED)
1800-
goto next;
1780+
if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
1781+
continue;
1782+
1783+
if (hpage_collapse_test_exit(mm))
1784+
continue;
18011785
/*
1802-
* We need exclusive mmap_lock to retract page table.
1803-
*
1804-
* We use trylock due to lock inversion: we need to acquire
1805-
* mmap_lock while holding page lock. Fault path does it in
1806-
* reverse order. Trylock is a way to avoid deadlock.
1807-
*
1808-
* Also, it's not MADV_COLLAPSE's job to collapse other
1809-
* mappings - let khugepaged take care of them later.
1786+
* When a vma is registered with uffd-wp, we cannot recycle
1787+
* the page table because there may be pte markers installed.
1788+
* Other vmas can still have the same file mapped hugely, but
1789+
* skip this one: it will always be mapped in small page size
1790+
* for uffd-wp registered ranges.
18101791
*/
1811-
result = SCAN_PTE_MAPPED_HUGEPAGE;
1812-
if ((cc->is_khugepaged || is_target) &&
1813-
mmap_write_trylock(mm)) {
1814-
/* trylock for the same lock inversion as above */
1815-
if (!vma_try_start_write(vma))
1816-
goto unlock_next;
1792+
if (userfaultfd_wp(vma))
1793+
continue;
18171794

1818-
/*
1819-
* Re-check whether we have an ->anon_vma, because
1820-
* collapse_and_free_pmd() requires that either no
1821-
* ->anon_vma exists or the anon_vma is locked.
1822-
* We already checked ->anon_vma above, but that check
1823-
* is racy because ->anon_vma can be populated under the
1824-
* mmap lock in read mode.
1825-
*/
1826-
if (vma->anon_vma) {
1827-
result = SCAN_PAGE_ANON;
1828-
goto unlock_next;
1829-
}
1830-
/*
1831-
* When a vma is registered with uffd-wp, we can't
1832-
* recycle the pmd pgtable because there can be pte
1833-
* markers installed. Skip it only, so the rest mm/vma
1834-
* can still have the same file mapped hugely, however
1835-
* it'll always mapped in small page size for uffd-wp
1836-
* registered ranges.
1837-
*/
1838-
if (hpage_collapse_test_exit(mm)) {
1839-
result = SCAN_ANY_PROCESS;
1840-
goto unlock_next;
1841-
}
1842-
if (userfaultfd_wp(vma)) {
1843-
result = SCAN_PTE_UFFD_WP;
1844-
goto unlock_next;
1845-
}
1846-
collapse_and_free_pmd(mm, vma, addr, pmd);
1847-
if (!cc->is_khugepaged && is_target)
1848-
result = set_huge_pmd(vma, addr, pmd, hpage);
1849-
else
1850-
result = SCAN_SUCCEED;
1795+
/* PTEs were notified when unmapped; but now for the PMD? */
1796+
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
1797+
addr, addr + HPAGE_PMD_SIZE);
1798+
mmu_notifier_invalidate_range_start(&range);
1799+
1800+
pml = pmd_lock(mm, pmd);
1801+
ptl = pte_lockptr(mm, pmd);
1802+
if (ptl != pml)
1803+
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
18511804

1852-
unlock_next:
1853-
mmap_write_unlock(mm);
1854-
goto next;
1855-
}
18561805
/*
1857-
* Calling context will handle target mm/addr. Otherwise, let
1858-
* khugepaged try again later.
1806+
* Huge page lock is still held, so normally the page table
1807+
* must remain empty; and we have already skipped anon_vma
1808+
* and userfaultfd_wp() vmas. But since the mmap_lock is not
1809+
* held, it is still possible for a racing userfaultfd_ioctl()
1810+
* to have inserted ptes or markers. Now that we hold ptlock,
1811+
* repeating the anon_vma check protects from one category,
1812+
* and repeating the userfaultfd_wp() check from another.
18591813
*/
1860-
if (!is_target) {
1861-
khugepaged_add_pte_mapped_thp(mm, addr);
1862-
continue;
1814+
if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) {
1815+
skipped_uffd = true;
1816+
} else {
1817+
pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
1818+
pmdp_get_lockless_sync();
1819+
}
1820+
1821+
if (ptl != pml)
1822+
spin_unlock(ptl);
1823+
spin_unlock(pml);
1824+
1825+
mmu_notifier_invalidate_range_end(&range);
1826+
1827+
if (!skipped_uffd) {
1828+
mm_dec_nr_ptes(mm);
1829+
page_table_check_pte_clear_range(mm, addr, pgt_pmd);
1830+
pte_free_defer(mm, pmd_pgtable(pgt_pmd));
18631831
}
1864-
next:
1865-
if (is_target)
1866-
target_result = result;
18671832
}
1868-
i_mmap_unlock_write(mapping);
1869-
return target_result;
1833+
i_mmap_unlock_read(mapping);
18701834
}
18711835

18721836
/**
@@ -2260,9 +2224,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
22602224

22612225
/*
22622226
* Remove pte page tables, so we can re-fault the page as huge.
2227+
* If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
22632228
*/
2264-
result = retract_page_tables(mapping, start, mm, addr, hpage,
2265-
cc);
2229+
retract_page_tables(mapping, start);
2230+
if (cc && !cc->is_khugepaged)
2231+
result = SCAN_PTE_MAPPED_HUGEPAGE;
22662232
unlock_page(hpage);
22672233

22682234
/*

0 commit comments

Comments
 (0)