Skip to content

Commit c9bff3e

Browse files
Michal Hockotorvalds
authored andcommitted
mm, page_alloc: rip out ZONELIST_ORDER_ZONE
Patch series "cleanup zonelists initialization", v1. This is aimed at cleaning up the zonelists initialization code we have but the primary motivation was bug report [2] which got resolved but the usage of stop_machine is just too ugly to live. Most patches are straightforward but 3 of them need a special consideration. Patch 1 removes zone ordered zonelists completely. I am CCing linux-api because this is a user visible change. As I argue in the patch description I do not think we have a strong usecase for it these days. I have kept sysctl in place and warn into the log if somebody tries to configure zone lists ordering. If somebody has a real usecase for it we can revert this patch but I do not expect anybody will actually notice runtime differences. This patch is not strictly needed for the rest but it made patch 6 easier to implement. Patch 7 removes stop_machine from build_all_zonelists without adding any special synchronization between iterators and updater which I _believe_ is acceptable as explained in the changelog. I hope I am not missing anything. Patch 8 then removes zonelists_mutex which is kind of ugly as well and not really needed AFAICS but a care should be taken when double checking my thinking. This patch (of 9): Supporting zone ordered zonelists costs us just a lot of code while the usefulness is arguable if existent at all. Mel has already made node ordering default on 64b systems. 32b systems are still using ZONELIST_ORDER_ZONE because it is considered better to fallback to a different NUMA node rather than consume precious lowmem zones. This argument is, however, weaken by the fact that the memory reclaim has been reworked to be node rather than zone oriented. This means that lowmem requests have to skip over all highmem pages on LRUs already and so zone ordering doesn't save the reclaim time much. So the only advantage of the zone ordering is under a light memory pressure when highmem requests do not ever hit into lowmem zones and the lowmem pressure doesn't need to reclaim. Considering that 32b NUMA systems are rather suboptimal already and it is generally advisable to use 64b kernel on such a HW I believe we should rather care about the code maintainability and just get rid of ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if somebody tries to set zone ordering either from kernel command line or the sysctl. [[email protected]: reading vm.numa_zonelist_order will never terminate] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Abdul Haleem <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 5a47074 commit c9bff3e

File tree

5 files changed

+28
-166
lines changed

5 files changed

+28
-166
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2783,7 +2783,7 @@
27832783
Allowed values are enable and disable
27842784

27852785
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
2786-
one of ['zone', 'node', 'default'] can be specified
2786+
'node', 'default' can be specified
27872787
This can be set from sysctl after boot.
27882788
See Documentation/sysctl/vm.txt for details.
27892789

Documentation/sysctl/vm.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information.
572572

573573
numa_zonelist_order
574574

575-
This sysctl is only for NUMA.
575+
This sysctl is only for NUMA and it is deprecated. Anything but
576+
Node order will fail!
577+
576578
'where the memory is allocated from' is controlled by zonelists.
577579
(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
578580
you may be able to read ZONE_DMA as ZONE_DMA32...)

Documentation/vm/numa

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations
7979
fall back to the same zone type on a different node, or to a different zone
8080
type on the same node. This is an important consideration because some zones,
8181
such as DMA or DMA32, represent relatively scarce resources. Linux chooses
82-
a default zonelist order based on the sizes of the various zone types relative
83-
to the total memory of the node and the total memory of the system. The
84-
default zonelist order may be overridden using the numa_zonelist_order kernel
85-
boot parameter or sysctl. [see Documentation/admin-guide/kernel-parameters.rst and
86-
Documentation/sysctl/vm.txt]
82+
a default Node ordered zonelist. This means it tries to fallback to other zones
83+
from the same node before using remote nodes which are ordered by NUMA distance.
8784

8885
By default, Linux will attempt to satisfy memory allocation requests from the
8986
node to which the CPU that executes the request is assigned. Specifically,

include/linux/mmzone.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -896,7 +896,7 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
896896
extern int numa_zonelist_order_handler(struct ctl_table *, int,
897897
void __user *, size_t *, loff_t *);
898898
extern char numa_zonelist_order[];
899-
#define NUMA_ZONELIST_ORDER_LEN 16 /* string buffer size */
899+
#define NUMA_ZONELIST_ORDER_LEN 16
900900

901901
#ifndef CONFIG_NEED_MULTIPLE_NODES
902902

mm/page_alloc.c

Lines changed: 21 additions & 158 deletions
Original file line numberDiff line numberDiff line change
@@ -4858,115 +4858,52 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
48584858
return nr_zones;
48594859
}
48604860

4861-
4862-
/*
4863-
* zonelist_order:
4864-
* 0 = automatic detection of better ordering.
4865-
* 1 = order by ([node] distance, -zonetype)
4866-
* 2 = order by (-zonetype, [node] distance)
4867-
*
4868-
* If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
4869-
* the same zonelist. So only NUMA can configure this param.
4870-
*/
4871-
#define ZONELIST_ORDER_DEFAULT 0
4872-
#define ZONELIST_ORDER_NODE 1
4873-
#define ZONELIST_ORDER_ZONE 2
4874-
4875-
/* zonelist order in the kernel.
4876-
* set_zonelist_order() will set this to NODE or ZONE.
4877-
*/
4878-
static int current_zonelist_order = ZONELIST_ORDER_DEFAULT;
4879-
static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
4880-
4881-
48824861
#ifdef CONFIG_NUMA
4883-
/* The value user specified ....changed by config */
4884-
static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
4885-
/* string for sysctl */
4886-
#define NUMA_ZONELIST_ORDER_LEN 16
4887-
char numa_zonelist_order[16] = "default";
4888-
4889-
/*
4890-
* interface for configure zonelist ordering.
4891-
* command line option "numa_zonelist_order"
4892-
* = "[dD]efault - default, automatic configuration.
4893-
* = "[nN]ode - order by node locality, then by zone within node
4894-
* = "[zZ]one - order by zone, then by locality within zone
4895-
*/
48964862

48974863
static int __parse_numa_zonelist_order(char *s)
48984864
{
4899-
if (*s == 'd' || *s == 'D') {
4900-
user_zonelist_order = ZONELIST_ORDER_DEFAULT;
4901-
} else if (*s == 'n' || *s == 'N') {
4902-
user_zonelist_order = ZONELIST_ORDER_NODE;
4903-
} else if (*s == 'z' || *s == 'Z') {
4904-
user_zonelist_order = ZONELIST_ORDER_ZONE;
4905-
} else {
4906-
pr_warn("Ignoring invalid numa_zonelist_order value: %s\n", s);
4865+
/*
4866+
* We used to support different zonlists modes but they turned
4867+
* out to be just not useful. Let's keep the warning in place
4868+
* if somebody still use the cmd line parameter so that we do
4869+
* not fail it silently
4870+
*/
4871+
if (!(*s == 'd' || *s == 'D' || *s == 'n' || *s == 'N')) {
4872+
pr_warn("Ignoring unsupported numa_zonelist_order value: %s\n", s);
49074873
return -EINVAL;
49084874
}
49094875
return 0;
49104876
}
49114877

49124878
static __init int setup_numa_zonelist_order(char *s)
49134879
{
4914-
int ret;
4915-
49164880
if (!s)
49174881
return 0;
49184882

4919-
ret = __parse_numa_zonelist_order(s);
4920-
if (ret == 0)
4921-
strlcpy(numa_zonelist_order, s, NUMA_ZONELIST_ORDER_LEN);
4922-
4923-
return ret;
4883+
return __parse_numa_zonelist_order(s);
49244884
}
49254885
early_param("numa_zonelist_order", setup_numa_zonelist_order);
49264886

4887+
char numa_zonelist_order[] = "Node";
4888+
49274889
/*
49284890
* sysctl handler for numa_zonelist_order
49294891
*/
49304892
int numa_zonelist_order_handler(struct ctl_table *table, int write,
49314893
void __user *buffer, size_t *length,
49324894
loff_t *ppos)
49334895
{
4934-
char saved_string[NUMA_ZONELIST_ORDER_LEN];
4896+
char *str;
49354897
int ret;
4936-
static DEFINE_MUTEX(zl_order_mutex);
49374898

4938-
mutex_lock(&zl_order_mutex);
4939-
if (write) {
4940-
if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
4941-
ret = -EINVAL;
4942-
goto out;
4943-
}
4944-
strcpy(saved_string, (char *)table->data);
4945-
}
4946-
ret = proc_dostring(table, write, buffer, length, ppos);
4947-
if (ret)
4948-
goto out;
4949-
if (write) {
4950-
int oldval = user_zonelist_order;
4899+
if (!write)
4900+
return proc_dostring(table, write, buffer, length, ppos);
4901+
str = memdup_user_nul(buffer, 16);
4902+
if (IS_ERR(str))
4903+
return PTR_ERR(str);
49514904

4952-
ret = __parse_numa_zonelist_order((char *)table->data);
4953-
if (ret) {
4954-
/*
4955-
* bogus value. restore saved string
4956-
*/
4957-
strncpy((char *)table->data, saved_string,
4958-
NUMA_ZONELIST_ORDER_LEN);
4959-
user_zonelist_order = oldval;
4960-
} else if (oldval != user_zonelist_order) {
4961-
mem_hotplug_begin();
4962-
mutex_lock(&zonelists_mutex);
4963-
build_all_zonelists(NULL, NULL);
4964-
mutex_unlock(&zonelists_mutex);
4965-
mem_hotplug_done();
4966-
}
4967-
}
4968-
out:
4969-
mutex_unlock(&zl_order_mutex);
4905+
ret = __parse_numa_zonelist_order(str);
4906+
kfree(str);
49704907
return ret;
49714908
}
49724909

@@ -5075,70 +5012,12 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
50755012
*/
50765013
static int node_order[MAX_NUMNODES];
50775014

5078-
static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
5079-
{
5080-
int pos, j, node;
5081-
int zone_type; /* needs to be signed */
5082-
struct zone *z;
5083-
struct zonelist *zonelist;
5084-
5085-
zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
5086-
pos = 0;
5087-
for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
5088-
for (j = 0; j < nr_nodes; j++) {
5089-
node = node_order[j];
5090-
z = &NODE_DATA(node)->node_zones[zone_type];
5091-
if (managed_zone(z)) {
5092-
zoneref_set_zone(z,
5093-
&zonelist->_zonerefs[pos++]);
5094-
check_highest_zone(zone_type);
5095-
}
5096-
}
5097-
}
5098-
zonelist->_zonerefs[pos].zone = NULL;
5099-
zonelist->_zonerefs[pos].zone_idx = 0;
5100-
}
5101-
5102-
#if defined(CONFIG_64BIT)
5103-
/*
5104-
* Devices that require DMA32/DMA are relatively rare and do not justify a
5105-
* penalty to every machine in case the specialised case applies. Default
5106-
* to Node-ordering on 64-bit NUMA machines
5107-
*/
5108-
static int default_zonelist_order(void)
5109-
{
5110-
return ZONELIST_ORDER_NODE;
5111-
}
5112-
#else
5113-
/*
5114-
* On 32-bit, the Normal zone needs to be preserved for allocations accessible
5115-
* by the kernel. If processes running on node 0 deplete the low memory zone
5116-
* then reclaim will occur more frequency increasing stalls and potentially
5117-
* be easier to OOM if a large percentage of the zone is under writeback or
5118-
* dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set.
5119-
* Hence, default to zone ordering on 32-bit.
5120-
*/
5121-
static int default_zonelist_order(void)
5122-
{
5123-
return ZONELIST_ORDER_ZONE;
5124-
}
5125-
#endif /* CONFIG_64BIT */
5126-
5127-
static void set_zonelist_order(void)
5128-
{
5129-
if (user_zonelist_order == ZONELIST_ORDER_DEFAULT)
5130-
current_zonelist_order = default_zonelist_order();
5131-
else
5132-
current_zonelist_order = user_zonelist_order;
5133-
}
5134-
51355015
static void build_zonelists(pg_data_t *pgdat)
51365016
{
51375017
int i, node, load;
51385018
nodemask_t used_mask;
51395019
int local_node, prev_node;
51405020
struct zonelist *zonelist;
5141-
unsigned int order = current_zonelist_order;
51425021

51435022
/* initialize zonelists */
51445023
for (i = 0; i < MAX_ZONELISTS; i++) {
@@ -5168,15 +5047,7 @@ static void build_zonelists(pg_data_t *pgdat)
51685047

51695048
prev_node = node;
51705049
load--;
5171-
if (order == ZONELIST_ORDER_NODE)
5172-
build_zonelists_in_node_order(pgdat, node);
5173-
else
5174-
node_order[i++] = node; /* remember order */
5175-
}
5176-
5177-
if (order == ZONELIST_ORDER_ZONE) {
5178-
/* calculate node order -- i.e., DMA last! */
5179-
build_zonelists_in_zone_order(pgdat, i);
5050+
build_zonelists_in_node_order(pgdat, node);
51805051
}
51815052

51825053
build_thisnode_zonelists(pgdat);
@@ -5204,11 +5075,6 @@ static void setup_min_unmapped_ratio(void);
52045075
static void setup_min_slab_ratio(void);
52055076
#else /* CONFIG_NUMA */
52065077

5207-
static void set_zonelist_order(void)
5208-
{
5209-
current_zonelist_order = ZONELIST_ORDER_ZONE;
5210-
}
5211-
52125078
static void build_zonelists(pg_data_t *pgdat)
52135079
{
52145080
int node, local_node;
@@ -5348,8 +5214,6 @@ build_all_zonelists_init(void)
53485214
*/
53495215
void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
53505216
{
5351-
set_zonelist_order();
5352-
53535217
if (system_state == SYSTEM_BOOTING) {
53545218
build_all_zonelists_init();
53555219
} else {
@@ -5375,9 +5239,8 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
53755239
else
53765240
page_group_by_mobility_disabled = 0;
53775241

5378-
pr_info("Built %i zonelists in %s order, mobility grouping %s. Total pages: %ld\n",
5242+
pr_info("Built %i zonelists, mobility grouping %s. Total pages: %ld\n",
53795243
nr_online_nodes,
5380-
zonelist_order_name[current_zonelist_order],
53815244
page_group_by_mobility_disabled ? "off" : "on",
53825245
vm_total_pages);
53835246
#ifdef CONFIG_NUMA

0 commit comments

Comments
 (0)