Skip to content

Commit aaf14e4

Browse files
Michal Hockotorvalds
authored andcommitted
mm, hugetlb: unclutter hugetlb allocation layers
Patch series "mm, hugetlb: allow proper node fallback dequeue". While working on a hugetlb migration issue addressed in a separate patchset[1] I have noticed that the hugetlb allocations from the preallocated pool are quite subotimal. [1] //lkml.kernel.org/r/[email protected] There is no fallback mechanism implemented and no notion of preferred node. I have tried to work around it but Vlastimil was right to push back for a more robust solution. It seems that such a solution is to reuse zonelist approach we use for the page alloctor. This series has 3 patches. The first one tries to make hugetlb allocation layers more clear. The second one implements the zonelist hugetlb pool allocation and introduces a preferred node semantic which is used by the migration callbacks. The last patch is a clean up. This patch (of 3): Hugetlb allocation path for fresh huge pages is unnecessarily complex and it mixes different interfaces between layers. __alloc_buddy_huge_page is the central place to perform a new allocation. It checks for the hugetlb overcommit and then relies on __hugetlb_alloc_buddy_huge_page to invoke the page allocator. This is all good except that __alloc_buddy_huge_page pushes vma and address down the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with two different allocation modes - one for memory policy and other node specific (or to make it more obscure node non-specific) requests. This just screams for a reorganization. This patch pulls out all the vma specific handling up to __alloc_buddy_huge_page_with_mpol where it belongs. __alloc_buddy_huge_page will get nodemask argument and __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the page allocator. In short: __alloc_buddy_huge_page_with_mpol - memory policy handling __alloc_buddy_huge_page - overcommit handling and accounting __hugetlb_alloc_buddy_huge_page - page allocator layer Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop is not really needed because the page allocator already handles the cpusets update. Finally __hugetlb_alloc_buddy_huge_page had a special case for node specific allocations (when no policy is applied and there is a node given). This has relied on __GFP_THISNODE to not fallback to a different node. alloc_huge_page_node is the only caller which relies on this behavior so move the __GFP_THISNODE there. Not only does this remove quite some code it also should make those layers easier to follow and clear wrt responsibilities. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 422580c commit aaf14e4

File tree

2 files changed

+30
-105
lines changed

2 files changed

+30
-105
lines changed

include/linux/hugetlb.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -349,7 +349,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
349349
struct page *alloc_huge_page_node(struct hstate *h, int nid);
350350
struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
351351
unsigned long addr, int avoid_reserve);
352-
struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask);
352+
struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask);
353353
int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
354354
pgoff_t idx);
355355

mm/hugetlb.c

Lines changed: 29 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -1521,98 +1521,26 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
15211521
return rc;
15221522
}
15231523

1524-
/*
1525-
* There are 3 ways this can get called:
1526-
* 1. With vma+addr: we use the VMA's memory policy
1527-
* 2. With !vma, but nid=NUMA_NO_NODE: We try to allocate a huge
1528-
* page from any node, and let the buddy allocator itself figure
1529-
* it out.
1530-
* 3. With !vma, but nid!=NUMA_NO_NODE. We allocate a huge page
1531-
* strictly from 'nid'
1532-
*/
15331524
static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
1534-
struct vm_area_struct *vma, unsigned long addr, int nid)
1525+
gfp_t gfp_mask, int nid, nodemask_t *nmask)
15351526
{
15361527
int order = huge_page_order(h);
1537-
gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
1538-
unsigned int cpuset_mems_cookie;
15391528

1540-
/*
1541-
* We need a VMA to get a memory policy. If we do not
1542-
* have one, we use the 'nid' argument.
1543-
*
1544-
* The mempolicy stuff below has some non-inlined bits
1545-
* and calls ->vm_ops. That makes it hard to optimize at
1546-
* compile-time, even when NUMA is off and it does
1547-
* nothing. This helps the compiler optimize it out.
1548-
*/
1549-
if (!IS_ENABLED(CONFIG_NUMA) || !vma) {
1550-
/*
1551-
* If a specific node is requested, make sure to
1552-
* get memory from there, but only when a node
1553-
* is explicitly specified.
1554-
*/
1555-
if (nid != NUMA_NO_NODE)
1556-
gfp |= __GFP_THISNODE;
1557-
/*
1558-
* Make sure to call something that can handle
1559-
* nid=NUMA_NO_NODE
1560-
*/
1561-
return alloc_pages_node(nid, gfp, order);
1562-
}
1563-
1564-
/*
1565-
* OK, so we have a VMA. Fetch the mempolicy and try to
1566-
* allocate a huge page with it. We will only reach this
1567-
* when CONFIG_NUMA=y.
1568-
*/
1569-
do {
1570-
struct page *page;
1571-
struct mempolicy *mpol;
1572-
int nid;
1573-
nodemask_t *nodemask;
1574-
1575-
cpuset_mems_cookie = read_mems_allowed_begin();
1576-
nid = huge_node(vma, addr, gfp, &mpol, &nodemask);
1577-
mpol_cond_put(mpol);
1578-
page = __alloc_pages_nodemask(gfp, order, nid, nodemask);
1579-
if (page)
1580-
return page;
1581-
} while (read_mems_allowed_retry(cpuset_mems_cookie));
1582-
1583-
return NULL;
1529+
gfp_mask |= __GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
1530+
if (nid == NUMA_NO_NODE)
1531+
nid = numa_mem_id();
1532+
return __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
15841533
}
15851534

1586-
/*
1587-
* There are two ways to allocate a huge page:
1588-
* 1. When you have a VMA and an address (like a fault)
1589-
* 2. When you have no VMA (like when setting /proc/.../nr_hugepages)
1590-
*
1591-
* 'vma' and 'addr' are only for (1). 'nid' is always NUMA_NO_NODE in
1592-
* this case which signifies that the allocation should be done with
1593-
* respect for the VMA's memory policy.
1594-
*
1595-
* For (2), we ignore 'vma' and 'addr' and use 'nid' exclusively. This
1596-
* implies that memory policies will not be taken in to account.
1597-
*/
1598-
static struct page *__alloc_buddy_huge_page(struct hstate *h,
1599-
struct vm_area_struct *vma, unsigned long addr, int nid)
1535+
static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
1536+
int nid, nodemask_t *nmask)
16001537
{
16011538
struct page *page;
16021539
unsigned int r_nid;
16031540

16041541
if (hstate_is_gigantic(h))
16051542
return NULL;
16061543

1607-
/*
1608-
* Make sure that anyone specifying 'nid' is not also specifying a VMA.
1609-
* This makes sure the caller is picking _one_ of the modes with which
1610-
* we can call this function, not both.
1611-
*/
1612-
if (vma || (addr != -1)) {
1613-
VM_WARN_ON_ONCE(addr == -1);
1614-
VM_WARN_ON_ONCE(nid != NUMA_NO_NODE);
1615-
}
16161544
/*
16171545
* Assume we will successfully allocate the surplus page to
16181546
* prevent racing processes from causing the surplus to exceed
@@ -1646,7 +1574,7 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h,
16461574
}
16471575
spin_unlock(&hugetlb_lock);
16481576

1649-
page = __hugetlb_alloc_buddy_huge_page(h, vma, addr, nid);
1577+
page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
16501578

16511579
spin_lock(&hugetlb_lock);
16521580
if (page) {
@@ -1670,27 +1598,24 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h,
16701598
return page;
16711599
}
16721600

1673-
/*
1674-
* Allocate a huge page from 'nid'. Note, 'nid' may be
1675-
* NUMA_NO_NODE, which means that it may be allocated
1676-
* anywhere.
1677-
*/
1678-
static
1679-
struct page *__alloc_buddy_huge_page_no_mpol(struct hstate *h, int nid)
1680-
{
1681-
unsigned long addr = -1;
1682-
1683-
return __alloc_buddy_huge_page(h, NULL, addr, nid);
1684-
}
1685-
16861601
/*
16871602
* Use the VMA's mpolicy to allocate a huge page from the buddy.
16881603
*/
16891604
static
16901605
struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
16911606
struct vm_area_struct *vma, unsigned long addr)
16921607
{
1693-
return __alloc_buddy_huge_page(h, vma, addr, NUMA_NO_NODE);
1608+
struct page *page;
1609+
struct mempolicy *mpol;
1610+
gfp_t gfp_mask = htlb_alloc_mask(h);
1611+
int nid;
1612+
nodemask_t *nodemask;
1613+
1614+
nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
1615+
page = __alloc_buddy_huge_page(h, gfp_mask, nid, nodemask);
1616+
mpol_cond_put(mpol);
1617+
1618+
return page;
16941619
}
16951620

16961621
/*
@@ -1700,21 +1625,26 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
17001625
*/
17011626
struct page *alloc_huge_page_node(struct hstate *h, int nid)
17021627
{
1628+
gfp_t gfp_mask = htlb_alloc_mask(h);
17031629
struct page *page = NULL;
17041630

1631+
if (nid != NUMA_NO_NODE)
1632+
gfp_mask |= __GFP_THISNODE;
1633+
17051634
spin_lock(&hugetlb_lock);
17061635
if (h->free_huge_pages - h->resv_huge_pages > 0)
17071636
page = dequeue_huge_page_node(h, nid);
17081637
spin_unlock(&hugetlb_lock);
17091638

17101639
if (!page)
1711-
page = __alloc_buddy_huge_page_no_mpol(h, nid);
1640+
page = __alloc_buddy_huge_page(h, gfp_mask, nid, NULL);
17121641

17131642
return page;
17141643
}
17151644

1716-
struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
1645+
struct page *alloc_huge_page_nodemask(struct hstate *h, nodemask_t *nmask)
17171646
{
1647+
gfp_t gfp_mask = htlb_alloc_mask(h);
17181648
struct page *page = NULL;
17191649
int node;
17201650

@@ -1731,13 +1661,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, const nodemask_t *nmask)
17311661
return page;
17321662

17331663
/* No reservations, try to overcommit */
1734-
for_each_node_mask(node, *nmask) {
1735-
page = __alloc_buddy_huge_page_no_mpol(h, node);
1736-
if (page)
1737-
return page;
1738-
}
1739-
1740-
return NULL;
1664+
return __alloc_buddy_huge_page(h, gfp_mask, NUMA_NO_NODE, nmask);
17411665
}
17421666

17431667
/*
@@ -1765,7 +1689,8 @@ static int gather_surplus_pages(struct hstate *h, int delta)
17651689
retry:
17661690
spin_unlock(&hugetlb_lock);
17671691
for (i = 0; i < needed; i++) {
1768-
page = __alloc_buddy_huge_page_no_mpol(h, NUMA_NO_NODE);
1692+
page = __alloc_buddy_huge_page(h, htlb_alloc_mask(h),
1693+
NUMA_NO_NODE, NULL);
17691694
if (!page) {
17701695
alloc_ok = false;
17711696
break;

0 commit comments

Comments
 (0)