Skip to content

Commit a2468cc

Browse files
Aaron Lutorvalds
authored andcommitted
swap: choose swap device according to numa node
If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap device to use in get_swap_pages() to get better performance. The current code uses a priority based list, swap_avail_list, to decide which swap device to use and if multiple swap devices share the same priority, they are used round robin. This patch changes the previous single global swap_avail_list into a per-numa-node list, i.e. for each numa node, it sees its own priority based list of available swap devices. Swap device's priority can be promoted on its matching node's swap_avail_list. The current swap device's priority is set as: user can set a >=0 value, or the system will pick one starting from -1 then downwards. The priority value in the swap_avail_list is the negated value of the swap device's due to plist being sorted from low to high. The new policy doesn't change the semantics for priority >=0 cases, the previous starting from -1 then downwards now becomes starting from -2 then downwards and -1 is reserved as the promoted value. Take 4-node EX machine as an example, suppose 4 swap devices are available, each sit on a different node: swapA on node 0 swapB on node 1 swapC on node 2 swapD on node 3 After they are all swapped on in the sequence of ABCD. Current behaviour: their priorities will be: swapA: -1 swapB: -2 swapC: -3 swapD: -4 And their position in the global swap_avail_list will be: swapA -> swapB -> swapC -> swapD prio:1 prio:2 prio:3 prio:4 New behaviour: their priorities will be(note that -1 is skipped): swapA: -2 swapB: -3 swapC: -4 swapD: -5 And their positions in the 4 swap_avail_lists[nid] will be: swap_avail_lists[0]: /* node 0's available swap device list */ swapA -> swapB -> swapC -> swapD prio:1 prio:3 prio:4 prio:5 swap_avali_lists[1]: /* node 1's available swap device list */ swapB -> swapA -> swapC -> swapD prio:1 prio:2 prio:4 prio:5 swap_avail_lists[2]: /* node 2's available swap device list */ swapC -> swapA -> swapB -> swapD prio:1 prio:2 prio:3 prio:5 swap_avail_lists[3]: /* node 3's available swap device list */ swapD -> swapA -> swapB -> swapC prio:1 prio:2 prio:3 prio:4 To see the effect of the patch, a test that starts N process, each mmap a region of anonymous memory and then continually write to it at random position to trigger both swap in and out is used. On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives are used as swap devices with each attached to a different node, the result is: runtime=30m/processes=32/total test size=128G/each process mmap region=4G kernel throughput vanilla 13306 auto-binding 15169 +14% runtime=30m/processes=64/total test size=128G/each process mmap region=2G kernel throughput vanilla 11885 auto-binding 14879 +25% [[email protected]: v2] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: use kmalloc_array()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aaron Lu <[email protected]> Cc: "Chen, Tim C" <[email protected]> Cc: Huang Ying <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent da99ecf commit a2468cc

File tree

3 files changed

+164
-27
lines changed

3 files changed

+164
-27
lines changed

Documentation/vm/swap_numa.txt

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
Automatically bind swap device to numa node
2+
-------------------------------------------
3+
4+
If the system has more than one swap device and swap device has the node
5+
information, we can make use of this information to decide which swap
6+
device to use in get_swap_pages() to get better performance.
7+
8+
9+
How to use this feature
10+
-----------------------
11+
12+
Swap device has priority and that decides the order of it to be used. To make
13+
use of automatically binding, there is no need to manipulate priority settings
14+
for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
15+
swapB, with swapA attached to node 0 and swapB attached to node 1, are going
16+
to be swapped on. Simply swapping them on by doing:
17+
# swapon /dev/swapA
18+
# swapon /dev/swapB
19+
20+
Then node 0 will use the two swap devices in the order of swapA then swapB and
21+
node 1 will use the two swap devices in the order of swapB then swapA. Note
22+
that the order of them being swapped on doesn't matter.
23+
24+
A more complex example on a 4 node machine. Assume 6 swap devices are going to
25+
be swapped on: swapA and swapB are attached to node 0, swapC is attached to
26+
node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
27+
The way to swap them on is the same as above:
28+
# swapon /dev/swapA
29+
# swapon /dev/swapB
30+
# swapon /dev/swapC
31+
# swapon /dev/swapD
32+
# swapon /dev/swapE
33+
# swapon /dev/swapF
34+
35+
Then node 0 will use them in the order of:
36+
swapA/swapB -> swapC -> swapD -> swapE -> swapF
37+
swapA and swapB will be used in a round robin mode before any other swap device.
38+
39+
node 1 will use them in the order of:
40+
swapC -> swapA -> swapB -> swapD -> swapE -> swapF
41+
42+
node 2 will use them in the order of:
43+
swapD/swapE -> swapA -> swapB -> swapC -> swapF
44+
Similaly, swapD and swapE will be used in a round robin mode before any
45+
other swap devices.
46+
47+
node 3 will use them in the order of:
48+
swapF -> swapA -> swapB -> swapC -> swapD -> swapE
49+
50+
51+
Implementation details
52+
----------------------
53+
54+
The current code uses a priority based list, swap_avail_list, to decide
55+
which swap device to use and if multiple swap devices share the same
56+
priority, they are used round robin. This change here replaces the single
57+
global swap_avail_list with a per-numa-node list, i.e. for each numa node,
58+
it sees its own priority based list of available swap devices. Swap
59+
device's priority can be promoted on its matching node's swap_avail_list.
60+
61+
The current swap device's priority is set as: user can set a >=0 value,
62+
or the system will pick one starting from -1 then downwards. The priority
63+
value in the swap_avail_list is the negated value of the swap device's
64+
due to plist being sorted from low to high. The new policy doesn't change
65+
the semantics for priority >=0 cases, the previous starting from -1 then
66+
downwards now becomes starting from -2 then downwards and -1 is reserved
67+
as the promoted value. So if multiple swap devices are attached to the same
68+
node, they will all be promoted to priority -1 on that node's plist and will
69+
be used round robin before any other swap devices.

include/linux/swap.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ struct swap_info_struct {
212212
unsigned long flags; /* SWP_USED etc: see above */
213213
signed short prio; /* swap priority of this type */
214214
struct plist_node list; /* entry in swap_active_head */
215-
struct plist_node avail_list; /* entry in swap_avail_head */
215+
struct plist_node avail_lists[MAX_NUMNODES];/* entry in swap_avail_heads */
216216
signed char type; /* strange name for an index */
217217
unsigned int max; /* extent of the swap_map */
218218
unsigned char *swap_map; /* vmalloc'ed array of usage counts */

mm/swapfile.c

Lines changed: 94 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ atomic_long_t nr_swap_pages;
6060
EXPORT_SYMBOL_GPL(nr_swap_pages);
6161
/* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
6262
long total_swap_pages;
63-
static int least_priority;
63+
static int least_priority = -1;
6464

6565
static const char Bad_file[] = "Bad swap file entry ";
6666
static const char Unused_file[] = "Unused swap file entry ";
@@ -85,7 +85,7 @@ PLIST_HEAD(swap_active_head);
8585
* is held and the locking order requires swap_lock to be taken
8686
* before any swap_info_struct->lock.
8787
*/
88-
static PLIST_HEAD(swap_avail_head);
88+
struct plist_head *swap_avail_heads;
8989
static DEFINE_SPINLOCK(swap_avail_lock);
9090

9191
struct swap_info_struct *swap_info[MAX_SWAPFILES];
@@ -592,6 +592,21 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
592592
return found_free;
593593
}
594594

595+
static void __del_from_avail_list(struct swap_info_struct *p)
596+
{
597+
int nid;
598+
599+
for_each_node(nid)
600+
plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
601+
}
602+
603+
static void del_from_avail_list(struct swap_info_struct *p)
604+
{
605+
spin_lock(&swap_avail_lock);
606+
__del_from_avail_list(p);
607+
spin_unlock(&swap_avail_lock);
608+
}
609+
595610
static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
596611
unsigned int nr_entries)
597612
{
@@ -605,12 +620,22 @@ static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
605620
if (si->inuse_pages == si->pages) {
606621
si->lowest_bit = si->max;
607622
si->highest_bit = 0;
608-
spin_lock(&swap_avail_lock);
609-
plist_del(&si->avail_list, &swap_avail_head);
610-
spin_unlock(&swap_avail_lock);
623+
del_from_avail_list(si);
611624
}
612625
}
613626

627+
static void add_to_avail_list(struct swap_info_struct *p)
628+
{
629+
int nid;
630+
631+
spin_lock(&swap_avail_lock);
632+
for_each_node(nid) {
633+
WARN_ON(!plist_node_empty(&p->avail_lists[nid]));
634+
plist_add(&p->avail_lists[nid], &swap_avail_heads[nid]);
635+
}
636+
spin_unlock(&swap_avail_lock);
637+
}
638+
614639
static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
615640
unsigned int nr_entries)
616641
{
@@ -623,13 +648,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
623648
bool was_full = !si->highest_bit;
624649

625650
si->highest_bit = end;
626-
if (was_full && (si->flags & SWP_WRITEOK)) {
627-
spin_lock(&swap_avail_lock);
628-
WARN_ON(!plist_node_empty(&si->avail_list));
629-
if (plist_node_empty(&si->avail_list))
630-
plist_add(&si->avail_list, &swap_avail_head);
631-
spin_unlock(&swap_avail_lock);
632-
}
651+
if (was_full && (si->flags & SWP_WRITEOK))
652+
add_to_avail_list(si);
633653
}
634654
atomic_long_add(nr_entries, &nr_swap_pages);
635655
si->inuse_pages -= nr_entries;
@@ -910,6 +930,7 @@ int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
910930
struct swap_info_struct *si, *next;
911931
long avail_pgs;
912932
int n_ret = 0;
933+
int node;
913934

914935
/* Only single cluster request supported */
915936
WARN_ON_ONCE(n_goal > 1 && cluster);
@@ -929,14 +950,15 @@ int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
929950
spin_lock(&swap_avail_lock);
930951

931952
start_over:
932-
plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
953+
node = numa_node_id();
954+
plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
933955
/* requeue si to after same-priority siblings */
934-
plist_requeue(&si->avail_list, &swap_avail_head);
956+
plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
935957
spin_unlock(&swap_avail_lock);
936958
spin_lock(&si->lock);
937959
if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
938960
spin_lock(&swap_avail_lock);
939-
if (plist_node_empty(&si->avail_list)) {
961+
if (plist_node_empty(&si->avail_lists[node])) {
940962
spin_unlock(&si->lock);
941963
goto nextsi;
942964
}
@@ -946,7 +968,7 @@ int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
946968
WARN(!(si->flags & SWP_WRITEOK),
947969
"swap_info %d in list but !SWP_WRITEOK\n",
948970
si->type);
949-
plist_del(&si->avail_list, &swap_avail_head);
971+
__del_from_avail_list(si);
950972
spin_unlock(&si->lock);
951973
goto nextsi;
952974
}
@@ -975,7 +997,7 @@ int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
975997
* swap_avail_head list then try it, otherwise start over
976998
* if we have not gotten any slots.
977999
*/
978-
if (plist_node_empty(&next->avail_list))
1000+
if (plist_node_empty(&next->avail_lists[node]))
9791001
goto start_over;
9801002
}
9811003

@@ -2410,10 +2432,24 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
24102432
return generic_swapfile_activate(sis, swap_file, span);
24112433
}
24122434

2435+
static int swap_node(struct swap_info_struct *p)
2436+
{
2437+
struct block_device *bdev;
2438+
2439+
if (p->bdev)
2440+
bdev = p->bdev;
2441+
else
2442+
bdev = p->swap_file->f_inode->i_sb->s_bdev;
2443+
2444+
return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
2445+
}
2446+
24132447
static void _enable_swap_info(struct swap_info_struct *p, int prio,
24142448
unsigned char *swap_map,
24152449
struct swap_cluster_info *cluster_info)
24162450
{
2451+
int i;
2452+
24172453
if (prio >= 0)
24182454
p->prio = prio;
24192455
else
@@ -2423,7 +2459,16 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
24232459
* low-to-high, while swap ordering is high-to-low
24242460
*/
24252461
p->list.prio = -p->prio;
2426-
p->avail_list.prio = -p->prio;
2462+
for_each_node(i) {
2463+
if (p->prio >= 0)
2464+
p->avail_lists[i].prio = -p->prio;
2465+
else {
2466+
if (swap_node(p) == i)
2467+
p->avail_lists[i].prio = 1;
2468+
else
2469+
p->avail_lists[i].prio = -p->prio;
2470+
}
2471+
}
24272472
p->swap_map = swap_map;
24282473
p->cluster_info = cluster_info;
24292474
p->flags |= SWP_WRITEOK;
@@ -2442,9 +2487,7 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
24422487
* swap_info_struct.
24432488
*/
24442489
plist_add(&p->list, &swap_active_head);
2445-
spin_lock(&swap_avail_lock);
2446-
plist_add(&p->avail_list, &swap_avail_head);
2447-
spin_unlock(&swap_avail_lock);
2490+
add_to_avail_list(p);
24482491
}
24492492

24502493
static void enable_swap_info(struct swap_info_struct *p, int prio,
@@ -2529,17 +2572,19 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
25292572
spin_unlock(&swap_lock);
25302573
goto out_dput;
25312574
}
2532-
spin_lock(&swap_avail_lock);
2533-
plist_del(&p->avail_list, &swap_avail_head);
2534-
spin_unlock(&swap_avail_lock);
2575+
del_from_avail_list(p);
25352576
spin_lock(&p->lock);
25362577
if (p->prio < 0) {
25372578
struct swap_info_struct *si = p;
2579+
int nid;
25382580

25392581
plist_for_each_entry_continue(si, &swap_active_head, list) {
25402582
si->prio++;
25412583
si->list.prio--;
2542-
si->avail_list.prio--;
2584+
for_each_node(nid) {
2585+
if (si->avail_lists[nid].prio != 1)
2586+
si->avail_lists[nid].prio--;
2587+
}
25432588
}
25442589
least_priority++;
25452590
}
@@ -2783,6 +2828,7 @@ static struct swap_info_struct *alloc_swap_info(void)
27832828
{
27842829
struct swap_info_struct *p;
27852830
unsigned int type;
2831+
int i;
27862832

27872833
p = kzalloc(sizeof(*p), GFP_KERNEL);
27882834
if (!p)
@@ -2818,7 +2864,8 @@ static struct swap_info_struct *alloc_swap_info(void)
28182864
}
28192865
INIT_LIST_HEAD(&p->first_swap_extent.list);
28202866
plist_node_init(&p->list, 0);
2821-
plist_node_init(&p->avail_list, 0);
2867+
for_each_node(i)
2868+
plist_node_init(&p->avail_lists[i], 0);
28222869
p->flags = SWP_USED;
28232870
spin_unlock(&swap_lock);
28242871
spin_lock_init(&p->lock);
@@ -3060,6 +3107,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
30603107
if (!capable(CAP_SYS_ADMIN))
30613108
return -EPERM;
30623109

3110+
if (!swap_avail_heads)
3111+
return -ENOMEM;
3112+
30633113
p = alloc_swap_info();
30643114
if (IS_ERR(p))
30653115
return PTR_ERR(p);
@@ -3645,3 +3695,21 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
36453695
}
36463696
}
36473697
}
3698+
3699+
static int __init swapfile_init(void)
3700+
{
3701+
int nid;
3702+
3703+
swap_avail_heads = kmalloc_array(nr_node_ids, sizeof(struct plist_head),
3704+
GFP_KERNEL);
3705+
if (!swap_avail_heads) {
3706+
pr_emerg("Not enough memory for swap heads, swap is disabled\n");
3707+
return -ENOMEM;
3708+
}
3709+
3710+
for_each_node(nid)
3711+
plist_head_init(&swap_avail_heads[nid]);
3712+
3713+
return 0;
3714+
}
3715+
subsys_initcall(swapfile_init);

0 commit comments

Comments
 (0)