Skip to content

Commit e7d3248

Browse files
Muchun Songtorvalds
authored andcommitted
mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page
Patch series "Free the 2nd vmemmap page associated with each HugeTLB page", v7. This series can minimize the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks. The main implementation and details can refer to the commit log of patch 1. In this series, I have changed the following four helpers, the following table shows the impact of the overhead of those helpers. +------------------+-----------------------+ | APIs | head page | tail page | +------------------+-----------+-----------+ | PageHead() | Y | N | +------------------+-----------+-----------+ | PageTail() | Y | N | +------------------+-----------+-----------+ | PageCompound() | N | N | +------------------+-----------+-----------+ | compound_head() | Y | N | +------------------+-----------+-----------+ Y: Overhead is increased. N: Overhead is _NOT_ increased. It shows that the overhead of those helpers on a tail page don't change between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off". But the overhead on a head page will be increased when "hugetlb_free_vmemmap=on" (except PageCompound()). So I believe that Matthew Wilcox's folio series will help with this. The users of PageHead() and PageTail() are much less than compound_head() and most users of PageTail() are VM_BUG_ON(), so I have done some tests about the overhead of compound_head() on head pages. I have tested the overhead of calling compound_head() on a head page, which is 2.11ns (Measure the call time of 10 million times compound_head(), and then average). For a head page whose address is not aligned with PAGE_SIZE or a non-compound page, the overhead of compound_head() is 2.54ns which is increased by 20%. For a head page whose address is aligned with PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by 40%. Most pages are the former. I do not think the overhead is significant since the overhead of compound_head() itself is low. This patch (of 5): This patch minimizes the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB (2MB type). After the feature of "Free sonme vmemmap pages of HugeTLB page" is enabled, the mapping of the vmemmap addresses associated with a 2MB HugeTLB page becomes the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | ----------------------+ | | | | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+ As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and remaped. However, the 2nd vmemmap page frame is also can be freed to the buddy allocator, then we can change the mapping from the figure above to the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | | 2 | -----------------+ | | | | | | | +-----------+ | | | | | | | | 3 | -------------------+ | | | | | | +-----------+ | | | | | | | 4 | ---------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | -----------------------+ | | | | +-----------+ | | | | | 6 | -------------------------+ | | | +-----------+ | | | | 7 | ---------------------------+ | | +-----------+ | | | | | | +-----------+ After we do this, all tail vmemmap pages (1-7) are mapped to the head vmemmap page frame (0). In other words, there are more than one page struct with PG_head associated with each HugeTLB page. We __know__ that there is only one head page struct, the tail page structs with PG_head are fake head page structs. We need an approach to distinguish between those two different types of page structs so that compound_head(), PageHead() and PageTail() can work properly if the parameter is the tail page struct but with PG_head. The following code snippet describes how to distinguish between real and fake head page struct. if (test_bit(PG_head, &page->flags)) { unsigned long head = READ_ONCE(page[1].compound_head); if (head & 1) { if (head == (unsigned long)page + 1) ==> head page struct else ==> tail page struct } else ==> head page struct } We can safely access the field of the @page[1] with PG_head because the @page is a compound page composed with at least two contiguous pages. [[email protected]: restore lost comment changes] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Chen Huang <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Qi Zheng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 5c2a956 commit e7d3248

File tree

4 files changed

+130
-33
lines changed

4 files changed

+130
-33
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1625,7 +1625,7 @@
16251625
[KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
16261626
enabled.
16271627
Allows heavy hugetlb users to free up some more
1628-
memory (6 * PAGE_SIZE for each 2MB hugetlb page).
1628+
memory (7 * PAGE_SIZE for each 2MB hugetlb page).
16291629
Format: { on | off (default) }
16301630

16311631
on: enable the feature

include/linux/page-flags.h

Lines changed: 74 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -190,13 +190,69 @@ enum pageflags {
190190

191191
#ifndef __GENERATING_BOUNDS_H
192192

193+
#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
194+
extern bool hugetlb_free_vmemmap_enabled;
195+
196+
/*
197+
* If the feature of freeing some vmemmap pages associated with each HugeTLB
198+
* page is enabled, the head vmemmap page frame is reused and all of the tail
199+
* vmemmap addresses map to the head vmemmap page frame (furture details can
200+
* refer to the figure at the head of the mm/hugetlb_vmemmap.c). In other
201+
* words, there are more than one page struct with PG_head associated with each
202+
* HugeTLB page. We __know__ that there is only one head page struct, the tail
203+
* page structs with PG_head are fake head page structs. We need an approach
204+
* to distinguish between those two different types of page structs so that
205+
* compound_head() can return the real head page struct when the parameter is
206+
* the tail page struct but with PG_head.
207+
*
208+
* The page_fixed_fake_head() returns the real head page struct if the @page is
209+
* fake page head, otherwise, returns @page which can either be a true page
210+
* head or tail.
211+
*/
212+
static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
213+
{
214+
if (!hugetlb_free_vmemmap_enabled)
215+
return page;
216+
217+
/*
218+
* Only addresses aligned with PAGE_SIZE of struct page may be fake head
219+
* struct page. The alignment check aims to avoid access the fields (
220+
* e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
221+
* cold cacheline in some cases.
222+
*/
223+
if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
224+
test_bit(PG_head, &page->flags)) {
225+
/*
226+
* We can safely access the field of the @page[1] with PG_head
227+
* because the @page is a compound page composed with at least
228+
* two contiguous pages.
229+
*/
230+
unsigned long head = READ_ONCE(page[1].compound_head);
231+
232+
if (likely(head & 1))
233+
return (const struct page *)(head - 1);
234+
}
235+
return page;
236+
}
237+
#else
238+
static inline const struct page *page_fixed_fake_head(const struct page *page)
239+
{
240+
return page;
241+
}
242+
#endif
243+
244+
static __always_inline int page_is_fake_head(struct page *page)
245+
{
246+
return page_fixed_fake_head(page) != page;
247+
}
248+
193249
static inline unsigned long _compound_head(const struct page *page)
194250
{
195251
unsigned long head = READ_ONCE(page->compound_head);
196252

197253
if (unlikely(head & 1))
198254
return head - 1;
199-
return (unsigned long)page;
255+
return (unsigned long)page_fixed_fake_head(page);
200256
}
201257

202258
#define compound_head(page) ((typeof(page))_compound_head(page))
@@ -231,12 +287,13 @@ static inline unsigned long _compound_head(const struct page *page)
231287

232288
static __always_inline int PageTail(struct page *page)
233289
{
234-
return READ_ONCE(page->compound_head) & 1;
290+
return READ_ONCE(page->compound_head) & 1 || page_is_fake_head(page);
235291
}
236292

237293
static __always_inline int PageCompound(struct page *page)
238294
{
239-
return test_bit(PG_head, &page->flags) || PageTail(page);
295+
return test_bit(PG_head, &page->flags) ||
296+
READ_ONCE(page->compound_head) & 1;
240297
}
241298

242299
#define PAGE_POISON_PATTERN -1l
@@ -695,7 +752,20 @@ static inline bool test_set_page_writeback(struct page *page)
695752
return set_page_writeback(page);
696753
}
697754

698-
__PAGEFLAG(Head, head, PF_ANY) CLEARPAGEFLAG(Head, head, PF_ANY)
755+
static __always_inline bool folio_test_head(struct folio *folio)
756+
{
757+
return test_bit(PG_head, folio_flags(folio, FOLIO_PF_ANY));
758+
}
759+
760+
static __always_inline int PageHead(struct page *page)
761+
{
762+
PF_POISONED_CHECK(page);
763+
return test_bit(PG_head, &page->flags) && !page_is_fake_head(page);
764+
}
765+
766+
__SETPAGEFLAG(Head, head, PF_ANY)
767+
__CLEARPAGEFLAG(Head, head, PF_ANY)
768+
CLEARPAGEFLAG(Head, head, PF_ANY)
699769

700770
/**
701771
* folio_test_large() - Does this folio contain more than one page?

mm/hugetlb_vmemmap.c

Lines changed: 34 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -124,40 +124,40 @@
124124
* page of page structs (page 0) associated with the HugeTLB page contains the 4
125125
* page structs necessary to describe the HugeTLB. The only use of the remaining
126126
* pages of page structs (page 1 to page 7) is to point to page->compound_head.
127-
* Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
127+
* Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of page structs
128128
* will be used for each HugeTLB page. This will allow us to free the remaining
129-
* 6 pages to the buddy allocator.
129+
* 7 pages to the buddy allocator.
130130
*
131131
* Here is how things look after remapping.
132132
*
133133
* HugeTLB struct pages(8 pages) page frame(8 pages)
134134
* +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
135135
* | | | 0 | -------------> | 0 |
136136
* | | +-----------+ +-----------+
137-
* | | | 1 | -------------> | 1 |
138-
* | | +-----------+ +-----------+
139-
* | | | 2 | ----------------^ ^ ^ ^ ^ ^
140-
* | | +-----------+ | | | | |
141-
* | | | 3 | ------------------+ | | | |
142-
* | | +-----------+ | | | |
143-
* | | | 4 | --------------------+ | | |
144-
* | PMD | +-----------+ | | |
145-
* | level | | 5 | ----------------------+ | |
146-
* | mapping | +-----------+ | |
147-
* | | | 6 | ------------------------+ |
148-
* | | +-----------+ |
149-
* | | | 7 | --------------------------+
137+
* | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^
138+
* | | +-----------+ | | | | | |
139+
* | | | 2 | -----------------+ | | | | |
140+
* | | +-----------+ | | | | |
141+
* | | | 3 | -------------------+ | | | |
142+
* | | +-----------+ | | | |
143+
* | | | 4 | ---------------------+ | | |
144+
* | PMD | +-----------+ | | |
145+
* | level | | 5 | -----------------------+ | |
146+
* | mapping | +-----------+ | |
147+
* | | | 6 | -------------------------+ |
148+
* | | +-----------+ |
149+
* | | | 7 | ---------------------------+
150150
* | | +-----------+
151151
* | |
152152
* | |
153153
* | |
154154
* +-----------+
155155
*
156-
* When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
156+
* When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
157157
* vmemmap pages and restore the previous mapping relationship.
158158
*
159159
* For the HugeTLB page of the pud level mapping. It is similar to the former.
160-
* We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
160+
* We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
161161
*
162162
* Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
163163
* (e.g. aarch64) provides a contiguous bit in the translation table entries
@@ -166,7 +166,13 @@
166166
*
167167
* The contiguous bit is used to increase the mapping size at the pmd and pte
168168
* (last) level. So this type of HugeTLB page can be optimized only when its
169-
* size of the struct page structs is greater than 2 pages.
169+
* size of the struct page structs is greater than 1 page.
170+
*
171+
* Notice: The head vmemmap page is not freed to the buddy allocator and all
172+
* tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
173+
* more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page)
174+
* associated with each HugeTLB page. The compound_head() can handle this
175+
* correctly (more details refer to the comment above compound_head()).
170176
*/
171177
#define pr_fmt(fmt) "HugeTLB: " fmt
172178

@@ -175,19 +181,21 @@
175181
/*
176182
* There are a lot of struct page structures associated with each HugeTLB page.
177183
* For tail pages, the value of compound_head is the same. So we can reuse first
178-
* page of tail page structures. We map the virtual addresses of the remaining
179-
* pages of tail page structures to the first tail page struct, and then free
180-
* these page frames. Therefore, we need to reserve two pages as vmemmap areas.
184+
* page of head page structures. We map the virtual addresses of all the pages
185+
* of tail page structures to the head page struct, and then free these page
186+
* frames. Therefore, we need to reserve one pages as vmemmap areas.
181187
*/
182-
#define RESERVE_VMEMMAP_NR 2U
188+
#define RESERVE_VMEMMAP_NR 1U
183189
#define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
184190

185-
bool hugetlb_free_vmemmap_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON);
191+
bool hugetlb_free_vmemmap_enabled __read_mostly =
192+
IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON);
193+
EXPORT_SYMBOL(hugetlb_free_vmemmap_enabled);
186194

187195
static int __init early_hugetlb_free_vmemmap_param(char *buf)
188196
{
189197
/* We cannot optimize if a "struct page" crosses page boundaries. */
190-
if ((!is_power_of_2(sizeof(struct page)))) {
198+
if (!is_power_of_2(sizeof(struct page))) {
191199
pr_warn("cannot free vmemmap pages because \"struct page\" crosses page boundaries\n");
192200
return 0;
193201
}
@@ -236,7 +244,6 @@ int alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
236244
*/
237245
ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse,
238246
GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE);
239-
240247
if (!ret)
241248
ClearHPageVmemmapOptimized(head);
242249

@@ -282,9 +289,8 @@ void __init hugetlb_vmemmap_init(struct hstate *h)
282289

283290
vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT;
284291
/*
285-
* The head page and the first tail page are not to be freed to buddy
286-
* allocator, the other pages will map to the first tail page, so they
287-
* can be freed.
292+
* The head page is not to be freed to buddy allocator, the other tail
293+
* pages will map to the head page, so they can be freed.
288294
*
289295
* Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? It is true
290296
* on some architectures (e.g. aarch64). See Documentation/arm64/

mm/sparse-vmemmap.c

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,26 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
245245
set_pte_at(&init_mm, addr, pte, entry);
246246
}
247247

248+
/*
249+
* How many struct page structs need to be reset. When we reuse the head
250+
* struct page, the special metadata (e.g. page->flags or page->mapping)
251+
* cannot copy to the tail struct page structs. The invalid value will be
252+
* checked in the free_tail_pages_check(). In order to avoid the message
253+
* of "corrupted mapping in tail page". We need to reset at least 3 (one
254+
* head struct page struct and two tail struct page structs) struct page
255+
* structs.
256+
*/
257+
#define NR_RESET_STRUCT_PAGE 3
258+
259+
static inline void reset_struct_pages(struct page *start)
260+
{
261+
int i;
262+
struct page *from = start + NR_RESET_STRUCT_PAGE;
263+
264+
for (i = 0; i < NR_RESET_STRUCT_PAGE; i++)
265+
memcpy(start + i, from, sizeof(*from));
266+
}
267+
248268
static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
249269
struct vmemmap_remap_walk *walk)
250270
{
@@ -258,6 +278,7 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
258278
list_del(&page->lru);
259279
to = page_to_virt(page);
260280
copy_page(to, (void *)walk->reuse_addr);
281+
reset_struct_pages(to);
261282

262283
set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
263284
}

0 commit comments

Comments
 (0)