Skip to content

Commit a49bd4d

Browse files
Michal Hockotorvalds
authored andcommitted
mm, numa: rework do_pages_move
Patch series "unclutter thp migration" Motivation: THP migration is hacked into the generic migration with rather surprising semantic. The migration allocation callback is supposed to check whether the THP can be migrated at once and if that is not the case then it allocates a simple page to migrate. unmap_and_move then fixes that up by splitting the THP into small pages while moving the head page to the newly allocated order-0 page. Remaining pages are moved to the LRU list by split_huge_page. The same happens if the THP allocation fails. This is really ugly and error prone [2]. I also believe that split_huge_page to the LRU lists is inherently wrong because all tail pages are not migrated. Some callers will just work around that by retrying (e.g. memory hotplug). There are other pfn walkers which are simply broken though. e.g. madvise_inject_error will migrate head and then advances next pfn by the huge page size. do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind), will simply split the THP before migration if the THP migration is not supported then falls back to single page migration but it doesn't handle tail pages if the THP migration path is not able to allocate a fresh THP so we end up with ENOMEM and fail the whole migration which is a questionable behavior. Page compaction doesn't try to migrate large pages so it should be immune. The first patch reworks do_pages_move which relies on a very ugly calling semantic when the return status is pushed to the migration path via private pointer. It uses pre allocated fixed size batching to achieve that. We simply cannot do the same if a THP is to be split during the migration path which is done in the patch 3. Patch 2 is follow up cleanup which removes the mentioned return status calling convention ugliness. On a side note: There are some semantic issues I have encountered on the way when working on patch 1 but I am not addressing them here. E.g. trying to move THP tail pages will result in either success or EBUSY (the later one more likely once we isolate head from the LRU list). Hugetlb reports EACCESS on tail pages. Some errors are reported via status parameter but migration failures are not even though the original `reason' argument suggests there was an intention to do so. From a quick look into git history this never worked. I have tried to keep the semantic unchanged. Then there is a relatively minor thing that the page isolation might fail because of pages not being on the LRU - e.g. because they are sitting on the per-cpu LRU caches. Easily fixable. This patch (of 3): do_pages_move is supposed to move user defined memory (an array of addresses) to the user defined numa nodes (an array of nodes one for each address). The user provided status array then contains resulting numa node for each address or an error. The semantic of this function is little bit confusing because only some errors are reported back. Notably migrate_pages error is only reported via the return value. This patch doesn't try to address these semantic nuances but rather change the underlying implementation. Currently we are processing user input (which can be really large) in batches which are stored to a temporarily allocated page. Each address is resolved to its struct page and stored to page_to_node structure along with the requested target numa node. The array of these structures is then conveyed down the page migration path via private argument. new_page_node then finds the corresponding structure and allocates the proper target page. What is the problem with the current implementation and why to change it? Apart from being quite ugly it also doesn't cope with unexpected pages showing up on the migration list inside migrate_pages path. That doesn't happen currently but the follow up patch would like to make the thp migration code more clear and that would need to split a THP into the list for some cases. How does the new implementation work? Well, instead of batching into a fixed size array we simply batch all pages that should be migrated to the same node and isolate all of them into a linked list which doesn't require any additional storage. This should work reasonably well because page migration usually migrates larger ranges of memory to a specific node. So the common case should work equally well as the current implementation. Even if somebody constructs an input where the target numa nodes would be interleaved we shouldn't see a large performance impact because page migration alone doesn't really benefit from batching. mmap_sem batching for the lookup is quite questionable and isolate_lru_page which would benefit from batching is not using it even in the current implementation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Zi Yan <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Reale <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent bfc6b1c commit a49bd4d

File tree

3 files changed

+139
-175
lines changed

3 files changed

+139
-175
lines changed

mm/internal.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -538,4 +538,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
538538
}
539539

540540
void setup_zone_pageset(struct zone *zone);
541+
extern struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x);
541542
#endif /* __MM_INTERNAL_H */

mm/mempolicy.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -942,7 +942,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
942942
}
943943
}
944944

945-
static struct page *new_node_page(struct page *page, unsigned long node, int **x)
945+
/* page allocation callback for NUMA node migration */
946+
struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x)
946947
{
947948
if (PageHuge(page))
948949
return alloc_huge_page_node(page_hstate(compound_head(page)),
@@ -986,7 +987,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
986987
flags | MPOL_MF_DISCONTIG_OK, &pagelist);
987988

988989
if (!list_empty(&pagelist)) {
989-
err = migrate_pages(&pagelist, new_node_page, NULL, dest,
990+
err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
990991
MIGRATE_SYNC, MR_SYSCALL);
991992
if (err)
992993
putback_movable_pages(&pagelist);

mm/migrate.c

Lines changed: 135 additions & 173 deletions
Original file line numberDiff line numberDiff line change
@@ -1444,141 +1444,103 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
14441444
}
14451445

14461446
#ifdef CONFIG_NUMA
1447-
/*
1448-
* Move a list of individual pages
1449-
*/
1450-
struct page_to_node {
1451-
unsigned long addr;
1452-
struct page *page;
1453-
int node;
1454-
int status;
1455-
};
14561447

1457-
static struct page *new_page_node(struct page *p, unsigned long private,
1458-
int **result)
1448+
static int store_status(int __user *status, int start, int value, int nr)
14591449
{
1460-
struct page_to_node *pm = (struct page_to_node *)private;
1461-
1462-
while (pm->node != MAX_NUMNODES && pm->page != p)
1463-
pm++;
1450+
while (nr-- > 0) {
1451+
if (put_user(value, status + start))
1452+
return -EFAULT;
1453+
start++;
1454+
}
14641455

1465-
if (pm->node == MAX_NUMNODES)
1466-
return NULL;
1456+
return 0;
1457+
}
14671458

1468-
*result = &pm->status;
1459+
static int do_move_pages_to_node(struct mm_struct *mm,
1460+
struct list_head *pagelist, int node)
1461+
{
1462+
int err;
14691463

1470-
if (PageHuge(p))
1471-
return alloc_huge_page_node(page_hstate(compound_head(p)),
1472-
pm->node);
1473-
else if (thp_migration_supported() && PageTransHuge(p)) {
1474-
struct page *thp;
1464+
if (list_empty(pagelist))
1465+
return 0;
14751466

1476-
thp = alloc_pages_node(pm->node,
1477-
(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
1478-
HPAGE_PMD_ORDER);
1479-
if (!thp)
1480-
return NULL;
1481-
prep_transhuge_page(thp);
1482-
return thp;
1483-
} else
1484-
return __alloc_pages_node(pm->node,
1485-
GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
1467+
err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
1468+
MIGRATE_SYNC, MR_SYSCALL);
1469+
if (err)
1470+
putback_movable_pages(pagelist);
1471+
return err;
14861472
}
14871473

14881474
/*
1489-
* Move a set of pages as indicated in the pm array. The addr
1490-
* field must be set to the virtual address of the page to be moved
1491-
* and the node number must contain a valid target node.
1492-
* The pm array ends with node = MAX_NUMNODES.
1475+
* Resolves the given address to a struct page, isolates it from the LRU and
1476+
* puts it to the given pagelist.
1477+
* Returns -errno if the page cannot be found/isolated or 0 when it has been
1478+
* queued or the page doesn't need to be migrated because it is already on
1479+
* the target node
14931480
*/
1494-
static int do_move_page_to_node_array(struct mm_struct *mm,
1495-
struct page_to_node *pm,
1496-
int migrate_all)
1481+
static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
1482+
int node, struct list_head *pagelist, bool migrate_all)
14971483
{
1484+
struct vm_area_struct *vma;
1485+
struct page *page;
1486+
unsigned int follflags;
14981487
int err;
1499-
struct page_to_node *pp;
1500-
LIST_HEAD(pagelist);
15011488

15021489
down_read(&mm->mmap_sem);
1490+
err = -EFAULT;
1491+
vma = find_vma(mm, addr);
1492+
if (!vma || addr < vma->vm_start || !vma_migratable(vma))
1493+
goto out;
15031494

1504-
/*
1505-
* Build a list of pages to migrate
1506-
*/
1507-
for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
1508-
struct vm_area_struct *vma;
1509-
struct page *page;
1510-
struct page *head;
1511-
unsigned int follflags;
1512-
1513-
err = -EFAULT;
1514-
vma = find_vma(mm, pp->addr);
1515-
if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
1516-
goto set_status;
1517-
1518-
/* FOLL_DUMP to ignore special (like zero) pages */
1519-
follflags = FOLL_GET | FOLL_DUMP;
1520-
if (!thp_migration_supported())
1521-
follflags |= FOLL_SPLIT;
1522-
page = follow_page(vma, pp->addr, follflags);
1495+
/* FOLL_DUMP to ignore special (like zero) pages */
1496+
follflags = FOLL_GET | FOLL_DUMP;
1497+
if (!thp_migration_supported())
1498+
follflags |= FOLL_SPLIT;
1499+
page = follow_page(vma, addr, follflags);
15231500

1524-
err = PTR_ERR(page);
1525-
if (IS_ERR(page))
1526-
goto set_status;
1501+
err = PTR_ERR(page);
1502+
if (IS_ERR(page))
1503+
goto out;
15271504

1528-
err = -ENOENT;
1529-
if (!page)
1530-
goto set_status;
1505+
err = -ENOENT;
1506+
if (!page)
1507+
goto out;
15311508

1532-
err = page_to_nid(page);
1509+
err = 0;
1510+
if (page_to_nid(page) == node)
1511+
goto out_putpage;
15331512

1534-
if (err == pp->node)
1535-
/*
1536-
* Node already in the right place
1537-
*/
1538-
goto put_and_set;
1513+
err = -EACCES;
1514+
if (page_mapcount(page) > 1 && !migrate_all)
1515+
goto out_putpage;
15391516

1540-
err = -EACCES;
1541-
if (page_mapcount(page) > 1 &&
1542-
!migrate_all)
1543-
goto put_and_set;
1544-
1545-
if (PageHuge(page)) {
1546-
if (PageHead(page)) {
1547-
isolate_huge_page(page, &pagelist);
1548-
err = 0;
1549-
pp->page = page;
1550-
}
1551-
goto put_and_set;
1517+
if (PageHuge(page)) {
1518+
if (PageHead(page)) {
1519+
isolate_huge_page(page, pagelist);
1520+
err = 0;
15521521
}
1522+
} else {
1523+
struct page *head;
15531524

1554-
pp->page = compound_head(page);
15551525
head = compound_head(page);
15561526
err = isolate_lru_page(head);
1557-
if (!err) {
1558-
list_add_tail(&head->lru, &pagelist);
1559-
mod_node_page_state(page_pgdat(head),
1560-
NR_ISOLATED_ANON + page_is_file_cache(head),
1561-
hpage_nr_pages(head));
1562-
}
1563-
put_and_set:
1564-
/*
1565-
* Either remove the duplicate refcount from
1566-
* isolate_lru_page() or drop the page ref if it was
1567-
* not isolated.
1568-
*/
1569-
put_page(page);
1570-
set_status:
1571-
pp->status = err;
1572-
}
1573-
1574-
err = 0;
1575-
if (!list_empty(&pagelist)) {
1576-
err = migrate_pages(&pagelist, new_page_node, NULL,
1577-
(unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
15781527
if (err)
1579-
putback_movable_pages(&pagelist);
1580-
}
1528+
goto out_putpage;
15811529

1530+
err = 0;
1531+
list_add_tail(&head->lru, pagelist);
1532+
mod_node_page_state(page_pgdat(head),
1533+
NR_ISOLATED_ANON + page_is_file_cache(head),
1534+
hpage_nr_pages(head));
1535+
}
1536+
out_putpage:
1537+
/*
1538+
* Either remove the duplicate refcount from
1539+
* isolate_lru_page() or drop the page ref if it was
1540+
* not isolated.
1541+
*/
1542+
put_page(page);
1543+
out:
15821544
up_read(&mm->mmap_sem);
15831545
return err;
15841546
}
@@ -1593,79 +1555,79 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
15931555
const int __user *nodes,
15941556
int __user *status, int flags)
15951557
{
1596-
struct page_to_node *pm;
1597-
unsigned long chunk_nr_pages;
1598-
unsigned long chunk_start;
1599-
int err;
1600-
1601-
err = -ENOMEM;
1602-
pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
1603-
if (!pm)
1604-
goto out;
1558+
int current_node = NUMA_NO_NODE;
1559+
LIST_HEAD(pagelist);
1560+
int start, i;
1561+
int err = 0, err1;
16051562

16061563
migrate_prep();
16071564

1608-
/*
1609-
* Store a chunk of page_to_node array in a page,
1610-
* but keep the last one as a marker
1611-
*/
1612-
chunk_nr_pages = (PAGE_SIZE / sizeof(struct page_to_node)) - 1;
1613-
1614-
for (chunk_start = 0;
1615-
chunk_start < nr_pages;
1616-
chunk_start += chunk_nr_pages) {
1617-
int j;
1565+
for (i = start = 0; i < nr_pages; i++) {
1566+
const void __user *p;
1567+
unsigned long addr;
1568+
int node;
16181569

1619-
if (chunk_start + chunk_nr_pages > nr_pages)
1620-
chunk_nr_pages = nr_pages - chunk_start;
1621-
1622-
/* fill the chunk pm with addrs and nodes from user-space */
1623-
for (j = 0; j < chunk_nr_pages; j++) {
1624-
const void __user *p;
1625-
int node;
1626-
1627-
err = -EFAULT;
1628-
if (get_user(p, pages + j + chunk_start))
1629-
goto out_pm;
1630-
pm[j].addr = (unsigned long) p;
1631-
1632-
if (get_user(node, nodes + j + chunk_start))
1633-
goto out_pm;
1634-
1635-
err = -ENODEV;
1636-
if (node < 0 || node >= MAX_NUMNODES)
1637-
goto out_pm;
1638-
1639-
if (!node_state(node, N_MEMORY))
1640-
goto out_pm;
1641-
1642-
err = -EACCES;
1643-
if (!node_isset(node, task_nodes))
1644-
goto out_pm;
1570+
err = -EFAULT;
1571+
if (get_user(p, pages + i))
1572+
goto out_flush;
1573+
if (get_user(node, nodes + i))
1574+
goto out_flush;
1575+
addr = (unsigned long)p;
1576+
1577+
err = -ENODEV;
1578+
if (node < 0 || node >= MAX_NUMNODES)
1579+
goto out_flush;
1580+
if (!node_state(node, N_MEMORY))
1581+
goto out_flush;
16451582

1646-
pm[j].node = node;
1583+
err = -EACCES;
1584+
if (!node_isset(node, task_nodes))
1585+
goto out_flush;
1586+
1587+
if (current_node == NUMA_NO_NODE) {
1588+
current_node = node;
1589+
start = i;
1590+
} else if (node != current_node) {
1591+
err = do_move_pages_to_node(mm, &pagelist, current_node);
1592+
if (err)
1593+
goto out;
1594+
err = store_status(status, start, current_node, i - start);
1595+
if (err)
1596+
goto out;
1597+
start = i;
1598+
current_node = node;
16471599
}
16481600

1649-
/* End marker for this chunk */
1650-
pm[chunk_nr_pages].node = MAX_NUMNODES;
1651-
1652-
/* Migrate this chunk */
1653-
err = do_move_page_to_node_array(mm, pm,
1654-
flags & MPOL_MF_MOVE_ALL);
1655-
if (err < 0)
1656-
goto out_pm;
1601+
/*
1602+
* Errors in the page lookup or isolation are not fatal and we simply
1603+
* report them via status
1604+
*/
1605+
err = add_page_for_migration(mm, addr, current_node,
1606+
&pagelist, flags & MPOL_MF_MOVE_ALL);
1607+
if (!err)
1608+
continue;
16571609

1658-
/* Return status information */
1659-
for (j = 0; j < chunk_nr_pages; j++)
1660-
if (put_user(pm[j].status, status + j + chunk_start)) {
1661-
err = -EFAULT;
1662-
goto out_pm;
1663-
}
1664-
}
1665-
err = 0;
1610+
err = store_status(status, i, err, 1);
1611+
if (err)
1612+
goto out_flush;
16661613

1667-
out_pm:
1668-
free_page((unsigned long)pm);
1614+
err = do_move_pages_to_node(mm, &pagelist, current_node);
1615+
if (err)
1616+
goto out;
1617+
if (i > start) {
1618+
err = store_status(status, start, current_node, i - start);
1619+
if (err)
1620+
goto out;
1621+
}
1622+
current_node = NUMA_NO_NODE;
1623+
}
1624+
out_flush:
1625+
/* Make sure we do not overwrite the existing error */
1626+
err1 = do_move_pages_to_node(mm, &pagelist, current_node);
1627+
if (!err1)
1628+
err1 = store_status(status, start, current_node, i - start);
1629+
if (!err)
1630+
err = err1;
16691631
out:
16701632
return err;
16711633
}

0 commit comments

Comments
 (0)