Skip to content

Commit ec1c86b

Browse files
yuzhaogoogleakpm00
authored andcommitted
mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in folio->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These techniques are commonly used to optimize job scheduling (bin packing) in data centers [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Acked-by: Brian Geffon <[email protected]> Acked-by: Jan Alexander Steffens (heftig) <[email protected]> Acked-by: Oleksandr Natalenko <[email protected]> Acked-by: Steven Barrett <[email protected]> Acked-by: Suleiman Souhlal <[email protected]> Tested-by: Daniel Byrne <[email protected]> Tested-by: Donald Carr <[email protected]> Tested-by: Holger Hoffstätte <[email protected]> Tested-by: Konstantin Kharlamov <[email protected]> Tested-by: Shuang Zhai <[email protected]> Tested-by: Sofia Trinh <[email protected]> Tested-by: Vaibhav Jain <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Barry Song <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michael Larabel <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent aa1b679 commit ec1c86b

File tree

15 files changed

+424
-14
lines changed

15 files changed

+424
-14
lines changed

fs/fuse/dev.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -776,7 +776,8 @@ static int fuse_check_page(struct page *page)
776776
1 << PG_active |
777777
1 << PG_workingset |
778778
1 << PG_reclaim |
779-
1 << PG_waiters))) {
779+
1 << PG_waiters |
780+
LRU_GEN_MASK | LRU_REFS_MASK))) {
780781
dump_page(page, "fuse: trying to steal weird page");
781782
return 1;
782783
}

include/linux/mm_inline.h

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,9 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
4040
{
4141
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
4242

43+
lockdep_assert_held(&lruvec->lru_lock);
44+
WARN_ON_ONCE(nr_pages != (int)nr_pages);
45+
4346
__mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
4447
__mod_zone_page_state(&pgdat->node_zones[zid],
4548
NR_ZONE_LRU_BASE + lru, nr_pages);
@@ -101,11 +104,177 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio)
101104
return lru;
102105
}
103106

107+
#ifdef CONFIG_LRU_GEN
108+
109+
static inline bool lru_gen_enabled(void)
110+
{
111+
return true;
112+
}
113+
114+
static inline bool lru_gen_in_fault(void)
115+
{
116+
return current->in_lru_fault;
117+
}
118+
119+
static inline int lru_gen_from_seq(unsigned long seq)
120+
{
121+
return seq % MAX_NR_GENS;
122+
}
123+
124+
static inline int folio_lru_gen(struct folio *folio)
125+
{
126+
unsigned long flags = READ_ONCE(folio->flags);
127+
128+
return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
129+
}
130+
131+
static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
132+
{
133+
unsigned long max_seq = lruvec->lrugen.max_seq;
134+
135+
VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
136+
137+
/* see the comment on MIN_NR_GENS */
138+
return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
139+
}
140+
141+
static inline void lru_gen_update_size(struct lruvec *lruvec, struct folio *folio,
142+
int old_gen, int new_gen)
143+
{
144+
int type = folio_is_file_lru(folio);
145+
int zone = folio_zonenum(folio);
146+
int delta = folio_nr_pages(folio);
147+
enum lru_list lru = type * LRU_INACTIVE_FILE;
148+
struct lru_gen_struct *lrugen = &lruvec->lrugen;
149+
150+
VM_WARN_ON_ONCE(old_gen != -1 && old_gen >= MAX_NR_GENS);
151+
VM_WARN_ON_ONCE(new_gen != -1 && new_gen >= MAX_NR_GENS);
152+
VM_WARN_ON_ONCE(old_gen == -1 && new_gen == -1);
153+
154+
if (old_gen >= 0)
155+
WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
156+
lrugen->nr_pages[old_gen][type][zone] - delta);
157+
if (new_gen >= 0)
158+
WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
159+
lrugen->nr_pages[new_gen][type][zone] + delta);
160+
161+
/* addition */
162+
if (old_gen < 0) {
163+
if (lru_gen_is_active(lruvec, new_gen))
164+
lru += LRU_ACTIVE;
165+
__update_lru_size(lruvec, lru, zone, delta);
166+
return;
167+
}
168+
169+
/* deletion */
170+
if (new_gen < 0) {
171+
if (lru_gen_is_active(lruvec, old_gen))
172+
lru += LRU_ACTIVE;
173+
__update_lru_size(lruvec, lru, zone, -delta);
174+
return;
175+
}
176+
}
177+
178+
static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
179+
{
180+
unsigned long seq;
181+
unsigned long flags;
182+
int gen = folio_lru_gen(folio);
183+
int type = folio_is_file_lru(folio);
184+
int zone = folio_zonenum(folio);
185+
struct lru_gen_struct *lrugen = &lruvec->lrugen;
186+
187+
VM_WARN_ON_ONCE_FOLIO(gen != -1, folio);
188+
189+
if (folio_test_unevictable(folio))
190+
return false;
191+
/*
192+
* There are three common cases for this page:
193+
* 1. If it's hot, e.g., freshly faulted in or previously hot and
194+
* migrated, add it to the youngest generation.
195+
* 2. If it's cold but can't be evicted immediately, i.e., an anon page
196+
* not in swapcache or a dirty page pending writeback, add it to the
197+
* second oldest generation.
198+
* 3. Everything else (clean, cold) is added to the oldest generation.
199+
*/
200+
if (folio_test_active(folio))
201+
seq = lrugen->max_seq;
202+
else if ((type == LRU_GEN_ANON && !folio_test_swapcache(folio)) ||
203+
(folio_test_reclaim(folio) &&
204+
(folio_test_dirty(folio) || folio_test_writeback(folio))))
205+
seq = lrugen->min_seq[type] + 1;
206+
else
207+
seq = lrugen->min_seq[type];
208+
209+
gen = lru_gen_from_seq(seq);
210+
flags = (gen + 1UL) << LRU_GEN_PGOFF;
211+
/* see the comment on MIN_NR_GENS about PG_active */
212+
set_mask_bits(&folio->flags, LRU_GEN_MASK | BIT(PG_active), flags);
213+
214+
lru_gen_update_size(lruvec, folio, -1, gen);
215+
/* for folio_rotate_reclaimable() */
216+
if (reclaiming)
217+
list_add_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
218+
else
219+
list_add(&folio->lru, &lrugen->lists[gen][type][zone]);
220+
221+
return true;
222+
}
223+
224+
static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
225+
{
226+
unsigned long flags;
227+
int gen = folio_lru_gen(folio);
228+
229+
if (gen < 0)
230+
return false;
231+
232+
VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
233+
VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
234+
235+
/* for folio_migrate_flags() */
236+
flags = !reclaiming && lru_gen_is_active(lruvec, gen) ? BIT(PG_active) : 0;
237+
flags = set_mask_bits(&folio->flags, LRU_GEN_MASK, flags);
238+
gen = ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
239+
240+
lru_gen_update_size(lruvec, folio, gen, -1);
241+
list_del(&folio->lru);
242+
243+
return true;
244+
}
245+
246+
#else /* !CONFIG_LRU_GEN */
247+
248+
static inline bool lru_gen_enabled(void)
249+
{
250+
return false;
251+
}
252+
253+
static inline bool lru_gen_in_fault(void)
254+
{
255+
return false;
256+
}
257+
258+
static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
259+
{
260+
return false;
261+
}
262+
263+
static inline bool lru_gen_del_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
264+
{
265+
return false;
266+
}
267+
268+
#endif /* CONFIG_LRU_GEN */
269+
104270
static __always_inline
105271
void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
106272
{
107273
enum lru_list lru = folio_lru_list(folio);
108274

275+
if (lru_gen_add_folio(lruvec, folio, false))
276+
return;
277+
109278
update_lru_size(lruvec, lru, folio_zonenum(folio),
110279
folio_nr_pages(folio));
111280
if (lru != LRU_UNEVICTABLE)
@@ -123,6 +292,9 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
123292
{
124293
enum lru_list lru = folio_lru_list(folio);
125294

295+
if (lru_gen_add_folio(lruvec, folio, true))
296+
return;
297+
126298
update_lru_size(lruvec, lru, folio_zonenum(folio),
127299
folio_nr_pages(folio));
128300
/* This is not expected to be used on LRU_UNEVICTABLE */
@@ -140,6 +312,9 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
140312
{
141313
enum lru_list lru = folio_lru_list(folio);
142314

315+
if (lru_gen_del_folio(lruvec, folio, false))
316+
return;
317+
143318
if (lru != LRU_UNEVICTABLE)
144319
list_del(&folio->lru);
145320
update_lru_size(lruvec, lru, folio_zonenum(folio),

include/linux/mmzone.h

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -317,6 +317,102 @@ enum lruvec_flags {
317317
*/
318318
};
319319

320+
#endif /* !__GENERATING_BOUNDS_H */
321+
322+
/*
323+
* Evictable pages are divided into multiple generations. The youngest and the
324+
* oldest generation numbers, max_seq and min_seq, are monotonically increasing.
325+
* They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
326+
* offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the
327+
* corresponding generation. The gen counter in folio->flags stores gen+1 while
328+
* a page is on one of lrugen->lists[]. Otherwise it stores 0.
329+
*
330+
* A page is added to the youngest generation on faulting. The aging needs to
331+
* check the accessed bit at least twice before handing this page over to the
332+
* eviction. The first check takes care of the accessed bit set on the initial
333+
* fault; the second check makes sure this page hasn't been used since then.
334+
* This process, AKA second chance, requires a minimum of two generations,
335+
* hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
336+
* LRU, e.g., /proc/vmstat, these two generations are considered active; the
337+
* rest of generations, if they exist, are considered inactive. See
338+
* lru_gen_is_active().
339+
*
340+
* PG_active is always cleared while a page is on one of lrugen->lists[] so that
341+
* the aging needs not to worry about it. And it's set again when a page
342+
* considered active is isolated for non-reclaiming purposes, e.g., migration.
343+
* See lru_gen_add_folio() and lru_gen_del_folio().
344+
*
345+
* MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the
346+
* number of categories of the active/inactive LRU when keeping track of
347+
* accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits
348+
* in folio->flags.
349+
*/
350+
#define MIN_NR_GENS 2U
351+
#define MAX_NR_GENS 4U
352+
353+
#ifndef __GENERATING_BOUNDS_H
354+
355+
struct lruvec;
356+
357+
#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
358+
#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
359+
360+
#ifdef CONFIG_LRU_GEN
361+
362+
enum {
363+
LRU_GEN_ANON,
364+
LRU_GEN_FILE,
365+
};
366+
367+
/*
368+
* The youngest generation number is stored in max_seq for both anon and file
369+
* types as they are aged on an equal footing. The oldest generation numbers are
370+
* stored in min_seq[] separately for anon and file types as clean file pages
371+
* can be evicted regardless of swap constraints.
372+
*
373+
* Normally anon and file min_seq are in sync. But if swapping is constrained,
374+
* e.g., out of swap space, file min_seq is allowed to advance and leave anon
375+
* min_seq behind.
376+
*
377+
* The number of pages in each generation is eventually consistent and therefore
378+
* can be transiently negative.
379+
*/
380+
struct lru_gen_struct {
381+
/* the aging increments the youngest generation number */
382+
unsigned long max_seq;
383+
/* the eviction increments the oldest generation numbers */
384+
unsigned long min_seq[ANON_AND_FILE];
385+
/* the multi-gen LRU lists, lazily sorted on eviction */
386+
struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
387+
/* the multi-gen LRU sizes, eventually consistent */
388+
long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
389+
};
390+
391+
void lru_gen_init_lruvec(struct lruvec *lruvec);
392+
393+
#ifdef CONFIG_MEMCG
394+
void lru_gen_init_memcg(struct mem_cgroup *memcg);
395+
void lru_gen_exit_memcg(struct mem_cgroup *memcg);
396+
#endif
397+
398+
#else /* !CONFIG_LRU_GEN */
399+
400+
static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
401+
{
402+
}
403+
404+
#ifdef CONFIG_MEMCG
405+
static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
406+
{
407+
}
408+
409+
static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
410+
{
411+
}
412+
#endif
413+
414+
#endif /* CONFIG_LRU_GEN */
415+
320416
struct lruvec {
321417
struct list_head lists[NR_LRU_LISTS];
322418
/* per lruvec lru_lock for memcg */
@@ -334,6 +430,10 @@ struct lruvec {
334430
unsigned long refaults[ANON_AND_FILE];
335431
/* Various lruvec state flags (enum lruvec_flags) */
336432
unsigned long flags;
433+
#ifdef CONFIG_LRU_GEN
434+
/* evictable pages divided into generations */
435+
struct lru_gen_struct lrugen;
436+
#endif
337437
#ifdef CONFIG_MEMCG
338438
struct pglist_data *pgdat;
339439
#endif
@@ -749,6 +849,8 @@ static inline bool zone_is_empty(struct zone *zone)
749849
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
750850
#define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)
751851
#define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
852+
#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
853+
#define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH)
752854

753855
/*
754856
* Define the bit shifts to access each section. For non-existent

include/linux/page-flags-layout.h

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,8 @@
5555
#define SECTIONS_WIDTH 0
5656
#endif
5757

58-
#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
58+
#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
59+
<= BITS_PER_LONG - NR_PAGEFLAGS
5960
#define NODES_WIDTH NODES_SHIFT
6061
#elif defined(CONFIG_SPARSEMEM_VMEMMAP)
6162
#error "Vmemmap: No space for nodes field in page flags"
@@ -89,8 +90,8 @@
8990
#define LAST_CPUPID_SHIFT 0
9091
#endif
9192

92-
#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
93-
<= BITS_PER_LONG - NR_PAGEFLAGS
93+
#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
94+
KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
9495
#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
9596
#else
9697
#define LAST_CPUPID_WIDTH 0
@@ -100,10 +101,12 @@
100101
#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
101102
#endif
102103

103-
#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
104-
> BITS_PER_LONG - NR_PAGEFLAGS
104+
#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
105+
KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
105106
#error "Not enough bits in page flags"
106107
#endif
107108

109+
#define LRU_REFS_WIDTH 0
110+
108111
#endif
109112
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */

include/linux/page-flags.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1058,7 +1058,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
10581058
1UL << PG_private | 1UL << PG_private_2 | \
10591059
1UL << PG_writeback | 1UL << PG_reserved | \
10601060
1UL << PG_slab | 1UL << PG_active | \
1061-
1UL << PG_unevictable | __PG_MLOCKED)
1061+
1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK)
10621062

10631063
/*
10641064
* Flags checked when a page is prepped for return by the page allocator.
@@ -1069,7 +1069,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
10691069
* alloc-free cycle to prevent from reusing the page.
10701070
*/
10711071
#define PAGE_FLAGS_CHECK_AT_PREP \
1072-
(PAGEFLAGS_MASK & ~__PG_HWPOISON)
1072+
((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
10731073

10741074
#define PAGE_FLAGS_PRIVATE \
10751075
(1UL << PG_private | 1UL << PG_private_2)

0 commit comments

Comments
 (0)