Skip to content

Commit 795ae7a

Browse files
hnaztorvalds
authored andcommitted
mm: scale kswapd watermarks in proportion to memory
In machines with 140G of memory and enterprise flash storage, we have seen read and write bursts routinely exceed the kswapd watermarks and cause thundering herds in direct reclaim. Unfortunately, the only way to tune kswapd aggressiveness is through adjusting min_free_kbytes - the system's emergency reserves - which is entirely unrelated to the system's latency requirements. In order to get kswapd to maintain a 250M buffer of free memory, the emergency reserves need to be set to 1G. That is a lot of memory wasted for no good reason. On the other hand, it's reasonable to assume that allocation bursts and overall allocation concurrency scale with memory capacity, so it makes sense to make kswapd aggressiveness a function of that as well. Change the kswapd watermark scale factor from the currently fixed 25% of the tunable emergency reserve to a tunable 0.1% of memory. Beyond 1G of memory, this will produce bigger watermark steps than the current formula in default settings. Ensure that the new formula never chooses steps smaller than that, i.e. 25% of the emergency reserve. On a 140G machine, this raises the default watermark steps - the distance between min and low, and low and high - from 16M to 143M. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 3ed3a4f commit 795ae7a

File tree

5 files changed

+58
-2
lines changed

5 files changed

+58
-2
lines changed

Documentation/sysctl/vm.txt

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -803,6 +803,24 @@ performance impact. Reclaim code needs to take various locks to find freeable
803803
directory and inode objects. With vfs_cache_pressure=1000, it will look for
804804
ten times more freeable objects than there are.
805805

806+
=============================================================
807+
808+
watermark_scale_factor:
809+
810+
This factor controls the aggressiveness of kswapd. It defines the
811+
amount of memory left in a node/system before kswapd is woken up and
812+
how much memory needs to be free before kswapd goes back to sleep.
813+
814+
The unit is in fractions of 10,000. The default value of 10 means the
815+
distances between watermarks are 0.1% of the available memory in the
816+
node/system. The maximum value is 1000, or 10% of memory.
817+
818+
A high rate of threads entering direct reclaim (allocstall) or kswapd
819+
going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
820+
that the number of free pages kswapd maintains for latency reasons is
821+
too small for the allocation bursts occurring in the system. This knob
822+
can then be used to tune kswapd aggressiveness accordingly.
823+
806824
==============================================================
807825

808826
zone_reclaim_mode:

include/linux/mm.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1889,6 +1889,7 @@ extern void zone_pcp_reset(struct zone *zone);
18891889

18901890
/* page_alloc.c */
18911891
extern int min_free_kbytes;
1892+
extern int watermark_scale_factor;
18921893

18931894
/* nommu.c */
18941895
extern atomic_long_t mmap_pages_allocated;

include/linux/mmzone.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -841,6 +841,8 @@ static inline int is_highmem(struct zone *zone)
841841
struct ctl_table;
842842
int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
843843
void __user *, size_t *, loff_t *);
844+
int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
845+
void __user *, size_t *, loff_t *);
844846
extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
845847
int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
846848
void __user *, size_t *, loff_t *);

kernel/sysctl.c

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,7 @@ static int __maybe_unused two = 2;
126126
static int __maybe_unused four = 4;
127127
static unsigned long one_ul = 1;
128128
static int one_hundred = 100;
129+
static int one_thousand = 1000;
129130
#ifdef CONFIG_PRINTK
130131
static int ten_thousand = 10000;
131132
#endif
@@ -1403,6 +1404,15 @@ static struct ctl_table vm_table[] = {
14031404
.proc_handler = min_free_kbytes_sysctl_handler,
14041405
.extra1 = &zero,
14051406
},
1407+
{
1408+
.procname = "watermark_scale_factor",
1409+
.data = &watermark_scale_factor,
1410+
.maxlen = sizeof(watermark_scale_factor),
1411+
.mode = 0644,
1412+
.proc_handler = watermark_scale_factor_sysctl_handler,
1413+
.extra1 = &one,
1414+
.extra2 = &one_thousand,
1415+
},
14061416
{
14071417
.procname = "percpu_pagelist_fraction",
14081418
.data = &percpu_pagelist_fraction,

mm/page_alloc.c

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -249,6 +249,7 @@ compound_page_dtor * const compound_page_dtors[] = {
249249

250250
int min_free_kbytes = 1024;
251251
int user_min_free_kbytes = -1;
252+
int watermark_scale_factor = 10;
252253

253254
static unsigned long __meminitdata nr_kernel_pages;
254255
static unsigned long __meminitdata nr_all_pages;
@@ -6347,8 +6348,17 @@ static void __setup_per_zone_wmarks(void)
63476348
zone->watermark[WMARK_MIN] = tmp;
63486349
}
63496350

6350-
zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + (tmp >> 2);
6351-
zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
6351+
/*
6352+
* Set the kswapd watermarks distance according to the
6353+
* scale factor in proportion to available memory, but
6354+
* ensure a minimum size on small systems.
6355+
*/
6356+
tmp = max_t(u64, tmp >> 2,
6357+
mult_frac(zone->managed_pages,
6358+
watermark_scale_factor, 10000));
6359+
6360+
zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
6361+
zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
63526362

63536363
__mod_zone_page_state(zone, NR_ALLOC_BATCH,
63546364
high_wmark_pages(zone) - low_wmark_pages(zone) -
@@ -6489,6 +6499,21 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
64896499
return 0;
64906500
}
64916501

6502+
int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
6503+
void __user *buffer, size_t *length, loff_t *ppos)
6504+
{
6505+
int rc;
6506+
6507+
rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
6508+
if (rc)
6509+
return rc;
6510+
6511+
if (write)
6512+
setup_per_zone_wmarks();
6513+
6514+
return 0;
6515+
}
6516+
64926517
#ifdef CONFIG_NUMA
64936518
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
64946519
void __user *buffer, size_t *length, loff_t *ppos)

0 commit comments

Comments
 (0)