Skip to content

Commit ac35a49

Browse files
yuzhaogoogleakpm00
authored andcommitted
mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. Promotion in the aging path does not involve any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. The eviction consumes old generations. Given an lruvec, it increments min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. The protection of pages accessed multiple times through file descriptors takes place in the eviction path. Each generation is divided into multiple tiers. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in folio->flags. The aforementioned feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. In contrast to promotion in the aging path, the protection of a page in the eviction path is achieved by moving this page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[30, 32]% IOPS BW 5.19-rc1: 2673k 10.2GiB/s patch1-6: 3491k 13.3GiB/s Single workload: memcached (anon): -[4, 6]% Ops/sec KB/sec 5.19-rc1: 1161501.04 45177.25 patch1-6: 1106168.46 43025.04 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.19-rc1 40.33% page_vma_mapped_walk (overhead) 21.80% lzo1x_1_do_compress (real work) 7.53% do_raw_spin_lock 3.95% _raw_spin_unlock_irq 2.52% vma_interval_tree_iter_next 2.37% folio_referenced_one 2.28% vma_interval_tree_subtree_search 1.97% anon_vma_interval_tree_iter_first 1.60% ptep_clear_flush 1.06% __zram_bvec_write patch1-6 39.03% lzo1x_1_do_compress (real work) 18.47% page_vma_mapped_walk (overhead) 6.74% _raw_spin_unlock_irq 3.97% do_raw_spin_lock 2.49% ptep_clear_flush 2.48% anon_vma_interval_tree_iter_first 1.92% folio_referenced_one 1.88% __zram_bvec_write 1.48% memmove 1.31% vma_interval_tree_iter_next Configurations: CPU: single Snapdragon 7c Mem: total 4G ChromeOS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Acked-by: Brian Geffon <[email protected]> Acked-by: Jan Alexander Steffens (heftig) <[email protected]> Acked-by: Oleksandr Natalenko <[email protected]> Acked-by: Steven Barrett <[email protected]> Acked-by: Suleiman Souhlal <[email protected]> Tested-by: Daniel Byrne <[email protected]> Tested-by: Donald Carr <[email protected]> Tested-by: Holger Hoffstätte <[email protected]> Tested-by: Konstantin Kharlamov <[email protected]> Tested-by: Shuang Zhai <[email protected]> Tested-by: Sofia Trinh <[email protected]> Tested-by: Vaibhav Jain <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Barry Song <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michael Larabel <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent ec1c86b commit ac35a49

File tree

8 files changed

+1025
-11
lines changed

8 files changed

+1025
-11
lines changed

include/linux/mm_inline.h

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,33 @@ static inline int lru_gen_from_seq(unsigned long seq)
121121
return seq % MAX_NR_GENS;
122122
}
123123

124+
static inline int lru_hist_from_seq(unsigned long seq)
125+
{
126+
return seq % NR_HIST_GENS;
127+
}
128+
129+
static inline int lru_tier_from_refs(int refs)
130+
{
131+
VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH));
132+
133+
/* see the comment in folio_lru_refs() */
134+
return order_base_2(refs + 1);
135+
}
136+
137+
static inline int folio_lru_refs(struct folio *folio)
138+
{
139+
unsigned long flags = READ_ONCE(folio->flags);
140+
bool workingset = flags & BIT(PG_workingset);
141+
142+
/*
143+
* Return the number of accesses beyond PG_referenced, i.e., N-1 if the
144+
* total number of accesses is N>1, since N=0,1 both map to the first
145+
* tier. lru_tier_from_refs() will account for this off-by-one. Also see
146+
* the comment on MAX_NR_TIERS.
147+
*/
148+
return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset;
149+
}
150+
124151
static inline int folio_lru_gen(struct folio *folio)
125152
{
126153
unsigned long flags = READ_ONCE(folio->flags);
@@ -173,6 +200,15 @@ static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *foli
173200
__update_lru_size(lruvec, lru, zone, -delta);
174201
return;
175202
}
203+
204+
/* promotion */
205+
if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) {
206+
__update_lru_size(lruvec, lru, zone, -delta);
207+
__update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
208+
}
209+
210+
/* demotion requires isolation, e.g., lru_deactivate_fn() */
211+
VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen));
176212
}
177213

178214
static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)

include/linux/mmzone.h

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -350,6 +350,28 @@ enum lruvec_flags {
350350
#define MIN_NR_GENS 2U
351351
#define MAX_NR_GENS 4U
352352

353+
/*
354+
* Each generation is divided into multiple tiers. A page accessed N times
355+
* through file descriptors is in tier order_base_2(N). A page in the first tier
356+
* (N=0,1) is marked by PG_referenced unless it was faulted in through page
357+
* tables or read ahead. A page in any other tier (N>1) is marked by
358+
* PG_referenced and PG_workingset. This implies a minimum of two tiers is
359+
* supported without using additional bits in folio->flags.
360+
*
361+
* In contrast to moving across generations which requires the LRU lock, moving
362+
* across tiers only involves atomic operations on folio->flags and therefore
363+
* has a negligible cost in the buffered access path. In the eviction path,
364+
* comparisons of refaulted/(evicted+protected) from the first tier and the
365+
* rest infer whether pages accessed multiple times through file descriptors
366+
* are statistically hot and thus worth protecting.
367+
*
368+
* MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the
369+
* number of categories of the active/inactive LRU when keeping track of
370+
* accesses through file descriptors. This uses MAX_NR_TIERS-2 spare bits in
371+
* folio->flags.
372+
*/
373+
#define MAX_NR_TIERS 4U
374+
353375
#ifndef __GENERATING_BOUNDS_H
354376

355377
struct lruvec;
@@ -364,6 +386,16 @@ enum {
364386
LRU_GEN_FILE,
365387
};
366388

389+
#define MIN_LRU_BATCH BITS_PER_LONG
390+
#define MAX_LRU_BATCH (MIN_LRU_BATCH * 64)
391+
392+
/* whether to keep historical stats from evicted generations */
393+
#ifdef CONFIG_LRU_GEN_STATS
394+
#define NR_HIST_GENS MAX_NR_GENS
395+
#else
396+
#define NR_HIST_GENS 1U
397+
#endif
398+
367399
/*
368400
* The youngest generation number is stored in max_seq for both anon and file
369401
* types as they are aged on an equal footing. The oldest generation numbers are
@@ -386,6 +418,15 @@ struct lru_gen_struct {
386418
struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
387419
/* the multi-gen LRU sizes, eventually consistent */
388420
long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
421+
/* the exponential moving average of refaulted */
422+
unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
423+
/* the exponential moving average of evicted+protected */
424+
unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
425+
/* the first tier doesn't need protection, hence the minus one */
426+
unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
427+
/* can be modified without holding the LRU lock */
428+
atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
429+
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
389430
};
390431

391432
void lru_gen_init_lruvec(struct lruvec *lruvec);

include/linux/page-flags-layout.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,10 @@
106106
#error "Not enough bits in page flags"
107107
#endif
108108

109-
#define LRU_REFS_WIDTH 0
109+
/* see the comment on MAX_NR_TIERS */
110+
#define LRU_REFS_WIDTH min(__LRU_REFS_WIDTH, BITS_PER_LONG - NR_PAGEFLAGS - \
111+
ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \
112+
NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH)
110113

111114
#endif
112115
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */

kernel/bounds.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,10 @@ int main(void)
2424
DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
2525
#ifdef CONFIG_LRU_GEN
2626
DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
27+
DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2);
2728
#else
2829
DEFINE(LRU_GEN_WIDTH, 0);
30+
DEFINE(__LRU_REFS_WIDTH, 0);
2931
#endif
3032
/* End of constants */
3133

mm/Kconfig

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1118,6 +1118,7 @@ config PTE_MARKER_UFFD_WP
11181118
purposes. It is required to enable userfaultfd write protection on
11191119
file-backed memory types like shmem and hugetlbfs.
11201120

1121+
# multi-gen LRU {
11211122
config LRU_GEN
11221123
bool "Multi-Gen LRU"
11231124
depends on MMU
@@ -1126,6 +1127,16 @@ config LRU_GEN
11261127
help
11271128
A high performance LRU implementation to overcommit memory.
11281129

1130+
config LRU_GEN_STATS
1131+
bool "Full stats for debugging"
1132+
depends on LRU_GEN
1133+
help
1134+
Do not enable this option unless you plan to look at historical stats
1135+
from evicted generations for debugging purpose.
1136+
1137+
This option has a per-memcg and per-node memory overhead.
1138+
# }
1139+
11291140
source "mm/damon/Kconfig"
11301141

11311142
endmenu

mm/swap.c

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -428,6 +428,40 @@ static void __lru_cache_activate_folio(struct folio *folio)
428428
local_unlock(&cpu_fbatches.lock);
429429
}
430430

431+
#ifdef CONFIG_LRU_GEN
432+
static void folio_inc_refs(struct folio *folio)
433+
{
434+
unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
435+
436+
if (folio_test_unevictable(folio))
437+
return;
438+
439+
if (!folio_test_referenced(folio)) {
440+
folio_set_referenced(folio);
441+
return;
442+
}
443+
444+
if (!folio_test_workingset(folio)) {
445+
folio_set_workingset(folio);
446+
return;
447+
}
448+
449+
/* see the comment on MAX_NR_TIERS */
450+
do {
451+
new_flags = old_flags & LRU_REFS_MASK;
452+
if (new_flags == LRU_REFS_MASK)
453+
break;
454+
455+
new_flags += BIT(LRU_REFS_PGOFF);
456+
new_flags |= old_flags & ~LRU_REFS_MASK;
457+
} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
458+
}
459+
#else
460+
static void folio_inc_refs(struct folio *folio)
461+
{
462+
}
463+
#endif /* CONFIG_LRU_GEN */
464+
431465
/*
432466
* Mark a page as having seen activity.
433467
*
@@ -440,6 +474,11 @@ static void __lru_cache_activate_folio(struct folio *folio)
440474
*/
441475
void folio_mark_accessed(struct folio *folio)
442476
{
477+
if (lru_gen_enabled()) {
478+
folio_inc_refs(folio);
479+
return;
480+
}
481+
443482
if (!folio_test_referenced(folio)) {
444483
folio_set_referenced(folio);
445484
} else if (folio_test_unevictable(folio)) {

0 commit comments

Comments
 (0)