Skip to content

Commit edf445a

Browse files
committed
Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)
Merge hugepage allocation updates from David Rientjes: "We (mostly Linus, Andrea, and myself) have been discussing offlist how to implement a sane default allocation strategy for hugepages on NUMA platforms. With these reverts in place, the page allocator will happily allocate a remote hugepage immediately rather than try to make a local hugepage available. This incurs a substantial performance degradation when memory compaction would have otherwise made a local hugepage available. This series reverts those reverts and attempts to propose a more sane default allocation strategy specifically for hugepages. Andrea acknowledges this is likely to fix the swap storms that he originally reported that resulted in the patches that removed __GFP_THISNODE from hugepage allocations. The immediate goal is to return 5.3 to the behavior the kernel has implemented over the past several years so that remote hugepages are not immediately allocated when local hugepages could have been made available because the increased access latency is untenable. The next goal is to introduce a sane default allocation strategy for hugepages allocations in general regardless of the configuration of the system so that we prevent thrashing of local memory when compaction is unlikely to succeed and can prefer remote hugepages over remote native pages when the local node is low on memory." Note on timing: this reverts the hugepage VM behavior changes that got introduced fairly late in the 5.3 cycle, and that fixed a huge performance regression for certain loads that had been around since 4.18. Andrea had this note: "The regression of 4.18 was that it was taking hours to start a VM where 3.10 was only taking a few seconds, I reported all the details on lkml when it was finally tracked down in August 2018. https://lore.kernel.org/linux-mm/[email protected]/ __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio workload degrade like in the "current upstream" above. And it still would have been that bad as above until 5.3-rc5" where the bad behavior ends up happening as you fill up a local node, and without that change, you'd get into the nasty swap storm behavior due to compaction working overtime to make room for more memory on the nodes. As a result 5.3 got the two performance fix reverts in rc5. However, David Rientjes then noted that those performance fixes in turn regressed performance for other loads - although not quite to the same degree. He suggested reverting the reverts and instead replacing them with two small changes to how hugepage allocations are done (patch descriptions rephrased by me): - "avoid expensive reclaim when compaction may not succeed": just admit that the allocation failed when you're trying to allocate a huge-page and compaction wasn't successful. - "allow hugepage fallback to remote nodes when madvised": when that node-local huge-page allocation failed, retry without forcing the local node. but by then I judged it too late to replace the fixes for a 5.3 release. So 5.3 was released with behavior that harked back to the pre-4.18 logic. But now we're in the merge window for 5.4, and we can see if this alternate model fixes not just the horrendous swap storm behavior, but also restores the performance regression that the late reverts caused. Fingers crossed. * emailed patches from David Rientjes <[email protected]>: mm, page_alloc: allow hugepage fallback to remote nodes when madvised mm, page_alloc: avoid expensive reclaim when compaction may not succeed Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" Revert "Revert "mm, thp: restore node-local hugepage allocations""
2 parents a295320 + 76e654c commit edf445a

File tree

6 files changed

+92
-42
lines changed

6 files changed

+92
-42
lines changed

include/linux/gfp.h

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
510510
}
511511
extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
512512
struct vm_area_struct *vma, unsigned long addr,
513-
int node);
513+
int node, bool hugepage);
514+
#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
515+
alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
514516
#else
515517
#define alloc_pages(gfp_mask, order) \
516518
alloc_pages_node(numa_node_id(), gfp_mask, order)
517-
#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
519+
#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
520+
alloc_pages(gfp_mask, order)
521+
#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
518522
alloc_pages(gfp_mask, order)
519523
#endif
520524
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
521525
#define alloc_page_vma(gfp_mask, vma, addr) \
522-
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
526+
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
523527
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
524-
alloc_pages_vma(gfp_mask, 0, vma, addr, node)
528+
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
525529

526530
extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
527531
extern unsigned long get_zeroed_page(gfp_t gfp_mask);

include/linux/mempolicy.h

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
139139
struct mempolicy *get_task_policy(struct task_struct *p);
140140
struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
141141
unsigned long addr);
142-
struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
143-
unsigned long addr);
144142
bool vma_policy_mof(struct vm_area_struct *vma);
145143

146144
extern void numa_default_policy(void);

mm/huge_memory.c

Lines changed: 20 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -659,40 +659,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
659659
* available
660660
* never: never stall for any thp allocation
661661
*/
662-
static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
662+
static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
663663
{
664664
const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
665-
gfp_t this_node = 0;
666-
667-
#ifdef CONFIG_NUMA
668-
struct mempolicy *pol;
669-
/*
670-
* __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
671-
* specified, to express a general desire to stay on the current
672-
* node for optimistic allocation attempts. If the defrag mode
673-
* and/or madvise hint requires the direct reclaim then we prefer
674-
* to fallback to other node rather than node reclaim because that
675-
* can lead to excessive reclaim even though there is free memory
676-
* on other nodes. We expect that NUMA preferences are specified
677-
* by memory policies.
678-
*/
679-
pol = get_vma_policy(vma, addr);
680-
if (pol->mode != MPOL_BIND)
681-
this_node = __GFP_THISNODE;
682-
mpol_cond_put(pol);
683-
#endif
684665

666+
/* Always do synchronous compaction */
685667
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
686668
return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
669+
670+
/* Kick kcompactd and fail quickly */
687671
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
688-
return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
672+
return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
673+
674+
/* Synchronous compaction if madvised, otherwise kick kcompactd */
689675
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
690-
return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
691-
__GFP_KSWAPD_RECLAIM | this_node);
676+
return GFP_TRANSHUGE_LIGHT |
677+
(vma_madvised ? __GFP_DIRECT_RECLAIM :
678+
__GFP_KSWAPD_RECLAIM);
679+
680+
/* Only do synchronous compaction if madvised */
692681
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
693-
return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
694-
this_node);
695-
return GFP_TRANSHUGE_LIGHT | this_node;
682+
return GFP_TRANSHUGE_LIGHT |
683+
(vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
684+
685+
return GFP_TRANSHUGE_LIGHT;
696686
}
697687

698688
/* Caller must hold page table lock. */
@@ -764,8 +754,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
764754
pte_free(vma->vm_mm, pgtable);
765755
return ret;
766756
}
767-
gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
768-
page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
757+
gfp = alloc_hugepage_direct_gfpmask(vma);
758+
page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
769759
if (unlikely(!page)) {
770760
count_vm_event(THP_FAULT_FALLBACK);
771761
return VM_FAULT_FALLBACK;
@@ -1372,9 +1362,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
13721362
alloc:
13731363
if (__transparent_hugepage_enabled(vma) &&
13741364
!transparent_hugepage_debug_cow()) {
1375-
huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
1376-
new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma,
1377-
haddr, numa_node_id());
1365+
huge_gfp = alloc_hugepage_direct_gfpmask(vma);
1366+
new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
13781367
} else
13791368
new_page = NULL;
13801369

mm/mempolicy.c

Lines changed: 41 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1179,8 +1179,8 @@ static struct page *new_page(struct page *page, unsigned long start)
11791179
} else if (PageTransHuge(page)) {
11801180
struct page *thp;
11811181

1182-
thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma,
1183-
address, numa_node_id());
1182+
thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
1183+
HPAGE_PMD_ORDER);
11841184
if (!thp)
11851185
return NULL;
11861186
prep_transhuge_page(thp);
@@ -1732,7 +1732,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
17321732
* freeing by another task. It is the caller's responsibility to free the
17331733
* extra reference for shared policies.
17341734
*/
1735-
struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
1735+
static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
17361736
unsigned long addr)
17371737
{
17381738
struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -2081,6 +2081,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
20812081
* @vma: Pointer to VMA or NULL if not available.
20822082
* @addr: Virtual Address of the allocation. Must be inside the VMA.
20832083
* @node: Which node to prefer for allocation (modulo policy).
2084+
* @hugepage: for hugepages try only the preferred node if possible
20842085
*
20852086
* This function allocates a page from the kernel page pool and applies
20862087
* a NUMA policy associated with the VMA or the current process.
@@ -2091,7 +2092,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
20912092
*/
20922093
struct page *
20932094
alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
2094-
unsigned long addr, int node)
2095+
unsigned long addr, int node, bool hugepage)
20952096
{
20962097
struct mempolicy *pol;
20972098
struct page *page;
@@ -2109,6 +2110,42 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
21092110
goto out;
21102111
}
21112112

2113+
if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
2114+
int hpage_node = node;
2115+
2116+
/*
2117+
* For hugepage allocation and non-interleave policy which
2118+
* allows the current node (or other explicitly preferred
2119+
* node) we only try to allocate from the current/preferred
2120+
* node and don't fall back to other nodes, as the cost of
2121+
* remote accesses would likely offset THP benefits.
2122+
*
2123+
* If the policy is interleave, or does not allow the current
2124+
* node in its nodemask, we allocate the standard way.
2125+
*/
2126+
if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
2127+
hpage_node = pol->v.preferred_node;
2128+
2129+
nmask = policy_nodemask(gfp, pol);
2130+
if (!nmask || node_isset(hpage_node, *nmask)) {
2131+
mpol_cond_put(pol);
2132+
page = __alloc_pages_node(hpage_node,
2133+
gfp | __GFP_THISNODE, order);
2134+
2135+
/*
2136+
* If hugepage allocations are configured to always
2137+
* synchronous compact or the vma has been madvised
2138+
* to prefer hugepage backing, retry allowing remote
2139+
* memory as well.
2140+
*/
2141+
if (!page && (gfp & __GFP_DIRECT_RECLAIM))
2142+
page = __alloc_pages_node(hpage_node,
2143+
gfp | __GFP_NORETRY, order);
2144+
2145+
goto out;
2146+
}
2147+
}
2148+
21122149
nmask = policy_nodemask(gfp, pol);
21132150
preferred_nid = policy_node(gfp, pol, node);
21142151
page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);

mm/page_alloc.c

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4467,6 +4467,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
44674467
if (page)
44684468
goto got_pg;
44694469

4470+
if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
4471+
/*
4472+
* If allocating entire pageblock(s) and compaction
4473+
* failed because all zones are below low watermarks
4474+
* or is prohibited because it recently failed at this
4475+
* order, fail immediately.
4476+
*
4477+
* Reclaim is
4478+
* - potentially very expensive because zones are far
4479+
* below their low watermarks or this is part of very
4480+
* bursty high order allocations,
4481+
* - not guaranteed to help because isolate_freepages()
4482+
* may not iterate over freed pages as part of its
4483+
* linear scan, and
4484+
* - unlikely to make entire pageblocks free on its
4485+
* own.
4486+
*/
4487+
if (compact_result == COMPACT_SKIPPED ||
4488+
compact_result == COMPACT_DEFERRED)
4489+
goto nopage;
4490+
}
4491+
44704492
/*
44714493
* Checks for costly allocations with __GFP_NORETRY, which
44724494
* includes THP page fault allocations

mm/shmem.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1481,7 +1481,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
14811481

14821482
shmem_pseudo_vma_init(&pvma, info, hindex);
14831483
page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
1484-
HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
1484+
HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
14851485
shmem_pseudo_vma_destroy(&pvma);
14861486
if (page)
14871487
prep_transhuge_page(page);

0 commit comments

Comments
 (0)