Skip to content

Commit bd74fda

Browse files
yuzhaogoogleakpm00
authored andcommitted
mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[8, 10]% Ops/sec KB/sec patch1-7: 1147696.57 44640.29 patch1-8: 1245274.91 48435.66 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 48.16% lzo1x_1_do_compress (real work) 8.20% page_vma_mapped_walk (overhead) 7.06% _raw_spin_unlock_irq 2.92% ptep_clear_flush 2.53% __zram_bvec_write 2.11% do_raw_spin_lock 2.02% memmove 1.93% lru_gen_look_around 1.56% free_unref_page_list 1.40% memset patch1-8 49.44% lzo1x_1_do_compress (real work) 6.19% page_vma_mapped_walk (overhead) 5.97% _raw_spin_unlock_irq 3.13% get_pfn_folio 2.85% ptep_clear_flush 2.42% __zram_bvec_write 2.08% do_raw_spin_lock 1.92% memmove 1.44% alloc_zspage 1.36% memset Configurations: no change Thanks to the following developers for their efforts [3]. kernel test robot <[email protected]> [1] https://lwn.net/Articles/23732/ [2] https://llvm.org/docs/ScudoHardenedAllocator.html [3] https://lore.kernel.org/r/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Acked-by: Brian Geffon <[email protected]> Acked-by: Jan Alexander Steffens (heftig) <[email protected]> Acked-by: Oleksandr Natalenko <[email protected]> Acked-by: Steven Barrett <[email protected]> Acked-by: Suleiman Souhlal <[email protected]> Tested-by: Daniel Byrne <[email protected]> Tested-by: Donald Carr <[email protected]> Tested-by: Holger Hoffstätte <[email protected]> Tested-by: Konstantin Kharlamov <[email protected]> Tested-by: Shuang Zhai <[email protected]> Tested-by: Sofia Trinh <[email protected]> Tested-by: Vaibhav Jain <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Barry Song <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michael Larabel <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 018ee47 commit bd74fda

File tree

10 files changed

+1172
-17
lines changed

10 files changed

+1172
-17
lines changed

fs/exec.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
10141014
active_mm = tsk->active_mm;
10151015
tsk->active_mm = mm;
10161016
tsk->mm = mm;
1017+
lru_gen_add_mm(mm);
10171018
/*
10181019
* This prevents preemption while active_mm is being loaded and
10191020
* it and mm are being updated, which could cause problems for
@@ -1029,6 +1030,7 @@ static int exec_mmap(struct mm_struct *mm)
10291030
tsk->mm->vmacache_seqnum = 0;
10301031
vmacache_flush(tsk);
10311032
task_unlock(tsk);
1033+
lru_gen_use_mm(mm);
10321034

10331035
if (vfork)
10341036
timens_on_fork(tsk->nsproxy, tsk);

include/linux/memcontrol.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -350,6 +350,11 @@ struct mem_cgroup {
350350
struct deferred_split deferred_split_queue;
351351
#endif
352352

353+
#ifdef CONFIG_LRU_GEN
354+
/* per-memcg mm_struct list */
355+
struct lru_gen_mm_list mm_list;
356+
#endif
357+
353358
struct mem_cgroup_per_node *nodeinfo[];
354359
};
355360

include/linux/mm_types.h

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -672,6 +672,22 @@ struct mm_struct {
672672
*/
673673
unsigned long ksm_merging_pages;
674674
#endif
675+
#ifdef CONFIG_LRU_GEN
676+
struct {
677+
/* this mm_struct is on lru_gen_mm_list */
678+
struct list_head list;
679+
/*
680+
* Set when switching to this mm_struct, as a hint of
681+
* whether it has been used since the last time per-node
682+
* page table walkers cleared the corresponding bits.
683+
*/
684+
unsigned long bitmap;
685+
#ifdef CONFIG_MEMCG
686+
/* points to the memcg of "owner" above */
687+
struct mem_cgroup *memcg;
688+
#endif
689+
} lru_gen;
690+
#endif /* CONFIG_LRU_GEN */
675691
} __randomize_layout;
676692

677693
/*
@@ -698,6 +714,66 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
698714
return (struct cpumask *)&mm->cpu_bitmap;
699715
}
700716

717+
#ifdef CONFIG_LRU_GEN
718+
719+
struct lru_gen_mm_list {
720+
/* mm_struct list for page table walkers */
721+
struct list_head fifo;
722+
/* protects the list above */
723+
spinlock_t lock;
724+
};
725+
726+
void lru_gen_add_mm(struct mm_struct *mm);
727+
void lru_gen_del_mm(struct mm_struct *mm);
728+
#ifdef CONFIG_MEMCG
729+
void lru_gen_migrate_mm(struct mm_struct *mm);
730+
#endif
731+
732+
static inline void lru_gen_init_mm(struct mm_struct *mm)
733+
{
734+
INIT_LIST_HEAD(&mm->lru_gen.list);
735+
mm->lru_gen.bitmap = 0;
736+
#ifdef CONFIG_MEMCG
737+
mm->lru_gen.memcg = NULL;
738+
#endif
739+
}
740+
741+
static inline void lru_gen_use_mm(struct mm_struct *mm)
742+
{
743+
/*
744+
* When the bitmap is set, page reclaim knows this mm_struct has been
745+
* used since the last time it cleared the bitmap. So it might be worth
746+
* walking the page tables of this mm_struct to clear the accessed bit.
747+
*/
748+
WRITE_ONCE(mm->lru_gen.bitmap, -1);
749+
}
750+
751+
#else /* !CONFIG_LRU_GEN */
752+
753+
static inline void lru_gen_add_mm(struct mm_struct *mm)
754+
{
755+
}
756+
757+
static inline void lru_gen_del_mm(struct mm_struct *mm)
758+
{
759+
}
760+
761+
#ifdef CONFIG_MEMCG
762+
static inline void lru_gen_migrate_mm(struct mm_struct *mm)
763+
{
764+
}
765+
#endif
766+
767+
static inline void lru_gen_init_mm(struct mm_struct *mm)
768+
{
769+
}
770+
771+
static inline void lru_gen_use_mm(struct mm_struct *mm)
772+
{
773+
}
774+
775+
#endif /* CONFIG_LRU_GEN */
776+
701777
struct mmu_gather;
702778
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
703779
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);

include/linux/mmzone.h

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -408,7 +408,7 @@ enum {
408408
* min_seq behind.
409409
*
410410
* The number of pages in each generation is eventually consistent and therefore
411-
* can be transiently negative.
411+
* can be transiently negative when reset_batch_size() is pending.
412412
*/
413413
struct lru_gen_struct {
414414
/* the aging increments the youngest generation number */
@@ -430,6 +430,53 @@ struct lru_gen_struct {
430430
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
431431
};
432432

433+
enum {
434+
MM_LEAF_TOTAL, /* total leaf entries */
435+
MM_LEAF_OLD, /* old leaf entries */
436+
MM_LEAF_YOUNG, /* young leaf entries */
437+
MM_NONLEAF_TOTAL, /* total non-leaf entries */
438+
MM_NONLEAF_FOUND, /* non-leaf entries found in Bloom filters */
439+
MM_NONLEAF_ADDED, /* non-leaf entries added to Bloom filters */
440+
NR_MM_STATS
441+
};
442+
443+
/* double-buffering Bloom filters */
444+
#define NR_BLOOM_FILTERS 2
445+
446+
struct lru_gen_mm_state {
447+
/* set to max_seq after each iteration */
448+
unsigned long seq;
449+
/* where the current iteration continues (inclusive) */
450+
struct list_head *head;
451+
/* where the last iteration ended (exclusive) */
452+
struct list_head *tail;
453+
/* to wait for the last page table walker to finish */
454+
struct wait_queue_head wait;
455+
/* Bloom filters flip after each iteration */
456+
unsigned long *filters[NR_BLOOM_FILTERS];
457+
/* the mm stats for debugging */
458+
unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
459+
/* the number of concurrent page table walkers */
460+
int nr_walkers;
461+
};
462+
463+
struct lru_gen_mm_walk {
464+
/* the lruvec under reclaim */
465+
struct lruvec *lruvec;
466+
/* unstable max_seq from lru_gen_struct */
467+
unsigned long max_seq;
468+
/* the next address within an mm to scan */
469+
unsigned long next_addr;
470+
/* to batch promoted pages */
471+
int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
472+
/* to batch the mm stats */
473+
int mm_stats[NR_MM_STATS];
474+
/* total batched items */
475+
int batched;
476+
bool can_swap;
477+
bool force_scan;
478+
};
479+
433480
void lru_gen_init_lruvec(struct lruvec *lruvec);
434481
void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
435482

@@ -480,6 +527,8 @@ struct lruvec {
480527
#ifdef CONFIG_LRU_GEN
481528
/* evictable pages divided into generations */
482529
struct lru_gen_struct lrugen;
530+
/* to concurrently iterate lru_gen_mm_list */
531+
struct lru_gen_mm_state mm_state;
483532
#endif
484533
#ifdef CONFIG_MEMCG
485534
struct pglist_data *pgdat;
@@ -1176,6 +1225,11 @@ typedef struct pglist_data {
11761225

11771226
unsigned long flags;
11781227

1228+
#ifdef CONFIG_LRU_GEN
1229+
/* kswap mm walk data */
1230+
struct lru_gen_mm_walk mm_walk;
1231+
#endif
1232+
11791233
ZONE_PADDING(_pad2_)
11801234

11811235
/* Per-node vmstats */

include/linux/swap.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,10 @@ union swap_header {
162162
*/
163163
struct reclaim_state {
164164
unsigned long reclaimed_slab;
165+
#ifdef CONFIG_LRU_GEN
166+
/* per-thread mm walk data */
167+
struct lru_gen_mm_walk *mm_walk;
168+
#endif
165169
};
166170

167171
#ifdef __KERNEL__

kernel/exit.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -466,6 +466,7 @@ void mm_update_next_owner(struct mm_struct *mm)
466466
goto retry;
467467
}
468468
WRITE_ONCE(mm->owner, c);
469+
lru_gen_migrate_mm(mm);
469470
task_unlock(c);
470471
put_task_struct(c);
471472
}

kernel/fork.c

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1152,6 +1152,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
11521152
goto fail_nocontext;
11531153

11541154
mm->user_ns = get_user_ns(user_ns);
1155+
lru_gen_init_mm(mm);
11551156
return mm;
11561157

11571158
fail_nocontext:
@@ -1194,6 +1195,7 @@ static inline void __mmput(struct mm_struct *mm)
11941195
}
11951196
if (mm->binfmt)
11961197
module_put(mm->binfmt->module);
1198+
lru_gen_del_mm(mm);
11971199
mmdrop(mm);
11981200
}
11991201

@@ -2694,6 +2696,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
26942696
get_task_struct(p);
26952697
}
26962698

2699+
if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
2700+
/* lock the task to synchronize with memcg migration */
2701+
task_lock(p);
2702+
lru_gen_add_mm(p->mm);
2703+
task_unlock(p);
2704+
}
2705+
26972706
wake_up_new_task(p);
26982707

26992708
/* forking complete and child started to run, tell ptracer */

kernel/sched/core.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5180,6 +5180,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
51805180
* finish_task_switch()'s mmdrop().
51815181
*/
51825182
switch_mm_irqs_off(prev->active_mm, next->mm, next);
5183+
lru_gen_use_mm(next->mm);
51835184

51845185
if (!prev->mm) { // from kernel
51855186
/* will mmdrop() in finish_task_switch(). */

mm/memcontrol.c

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6204,6 +6204,30 @@ static void mem_cgroup_move_task(void)
62046204
}
62056205
#endif
62066206

6207+
#ifdef CONFIG_LRU_GEN
6208+
static void mem_cgroup_attach(struct cgroup_taskset *tset)
6209+
{
6210+
struct task_struct *task;
6211+
struct cgroup_subsys_state *css;
6212+
6213+
/* find the first leader if there is any */
6214+
cgroup_taskset_for_each_leader(task, css, tset)
6215+
break;
6216+
6217+
if (!task)
6218+
return;
6219+
6220+
task_lock(task);
6221+
if (task->mm && READ_ONCE(task->mm->owner) == task)
6222+
lru_gen_migrate_mm(task->mm);
6223+
task_unlock(task);
6224+
}
6225+
#else
6226+
static void mem_cgroup_attach(struct cgroup_taskset *tset)
6227+
{
6228+
}
6229+
#endif /* CONFIG_LRU_GEN */
6230+
62076231
static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
62086232
{
62096233
if (value == PAGE_COUNTER_MAX)
@@ -6609,6 +6633,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
66096633
.css_reset = mem_cgroup_css_reset,
66106634
.css_rstat_flush = mem_cgroup_css_rstat_flush,
66116635
.can_attach = mem_cgroup_can_attach,
6636+
.attach = mem_cgroup_attach,
66126637
.cancel_attach = mem_cgroup_cancel_attach,
66136638
.post_attach = mem_cgroup_move_task,
66146639
.dfl_cftypes = memory_files,

0 commit comments

Comments
 (0)