Skip to content

Commit 2ec74c3

Browse files
Sagi Grimbergtorvalds
authored andcommitted
mm: move all mmu notifier invocations to be done outside the PT lock
In order to allow sleeping during mmu notifier calls, we need to avoid invoking them under the page table spinlock. This patch solves the problem by calling invalidate_page notification after releasing the lock (but before freeing the page itself), or by wrapping the page invalidation with calls to invalidate_range_begin and invalidate_range_end. To prevent accidental changes to the invalidate_range_end arguments after the call to invalidate_range_begin, the patch introduces a convention of saving the arguments in consistently named locals: unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ ... mmun_start = ... mmun_end = ... mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); ... mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); The patch changes code to use this convention for all calls to mmu_notifier_invalidate_range_start/end, except those where the calls are close enough so that anyone who glances at the code can see the values aren't changing. This patchset is a preliminary step towards on-demand paging design to be added to the RDMA stack. Why do we want on-demand paging for Infiniband? Applications register memory with an RDMA adapter using system calls, and subsequently post IO operations that refer to the corresponding virtual addresses directly to HW. Until now, this was achieved by pinning the memory during the registration calls. The goal of on demand paging is to avoid pinning the pages of registered memory regions (MRs). This will allow users the same flexibility they get when swapping any other part of their processes address spaces. Instead of requiring the entire MR to fit in physical memory, we can allow the MR to be larger, and only fit the current working set in physical memory. Why should anyone care? What problems are users currently experiencing? This can make programming with RDMA much simpler. Today, developers that are working with more data than their RAM can hold need either to deregister and reregister memory regions throughout their process's life, or keep a single memory region and copy the data to it. On demand paging will allow these developers to register a single MR at the beginning of their process's life, and let the operating system manage which pages needs to be fetched at a given time. In the future, we might be able to provide a single memory access key for each process that would provide the entire process's address as one large memory region, and the developers wouldn't need to register memory regions at all. Is there any prospect that any other subsystems will utilise these infrastructural changes? If so, which and how, etc? As for other subsystems, I understand that XPMEM wanted to sleep in MMU notifiers, as Christoph Lameter wrote at http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and perhaps Andrea knows about other use cases. Scheduling in mmu notifications is required since we need to sync the hardware with the secondary page tables change. A TLB flush of an IO device is inherently slower than a CPU TLB flush, so our design works by sending the invalidation request to the device, and waiting for an interrupt before exiting the mmu notifier handler. Avi said: kvm may be a buyer. kvm::mmu_lock, which serializes guest page faults, also protects long operations such as destroying large ranges. It would be good to convert it into a spinlock, but as it is used inside mmu notifiers, this cannot be done. (there are alternatives, such as keeping the spinlock and using a generation counter to do the teardown in O(1), which is what the "may" is doing up there). [[email protected] speed tweak in hugetlb_cow(), cleanups] Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Haggai Eran <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Xiao Guangrong <[email protected]> Cc: Or Gerlitz <[email protected]> Cc: Haggai Eran <[email protected]> Cc: Shachar Raindel <[email protected]> Cc: Liran Liss <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Avi Kivity <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 36e4f20 commit 2ec74c3

File tree

7 files changed

+92
-76
lines changed

7 files changed

+92
-76
lines changed

include/linux/mmu_notifier.h

Lines changed: 0 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -246,50 +246,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
246246
__mmu_notifier_mm_destroy(mm);
247247
}
248248

249-
/*
250-
* These two macros will sometime replace ptep_clear_flush.
251-
* ptep_clear_flush is implemented as macro itself, so this also is
252-
* implemented as a macro until ptep_clear_flush will converted to an
253-
* inline function, to diminish the risk of compilation failure. The
254-
* invalidate_page method over time can be moved outside the PT lock
255-
* and these two macros can be later removed.
256-
*/
257-
#define ptep_clear_flush_notify(__vma, __address, __ptep) \
258-
({ \
259-
pte_t __pte; \
260-
struct vm_area_struct *___vma = __vma; \
261-
unsigned long ___address = __address; \
262-
__pte = ptep_clear_flush(___vma, ___address, __ptep); \
263-
mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \
264-
__pte; \
265-
})
266-
267-
#define pmdp_clear_flush_notify(__vma, __address, __pmdp) \
268-
({ \
269-
pmd_t __pmd; \
270-
struct vm_area_struct *___vma = __vma; \
271-
unsigned long ___address = __address; \
272-
VM_BUG_ON(__address & ~HPAGE_PMD_MASK); \
273-
mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address, \
274-
(__address)+HPAGE_PMD_SIZE);\
275-
__pmd = pmdp_clear_flush(___vma, ___address, __pmdp); \
276-
mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address, \
277-
(__address)+HPAGE_PMD_SIZE); \
278-
__pmd; \
279-
})
280-
281-
#define pmdp_splitting_flush_notify(__vma, __address, __pmdp) \
282-
({ \
283-
struct vm_area_struct *___vma = __vma; \
284-
unsigned long ___address = __address; \
285-
VM_BUG_ON(__address & ~HPAGE_PMD_MASK); \
286-
mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address, \
287-
(__address)+HPAGE_PMD_SIZE);\
288-
pmdp_splitting_flush(___vma, ___address, __pmdp); \
289-
mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address, \
290-
(__address)+HPAGE_PMD_SIZE); \
291-
})
292-
293249
#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
294250
({ \
295251
int __young; \
@@ -380,9 +336,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
380336

381337
#define ptep_clear_flush_young_notify ptep_clear_flush_young
382338
#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
383-
#define ptep_clear_flush_notify ptep_clear_flush
384-
#define pmdp_clear_flush_notify pmdp_clear_flush
385-
#define pmdp_splitting_flush_notify pmdp_splitting_flush
386339
#define set_pte_at_notify set_pte_at
387340

388341
#endif /* CONFIG_MMU_NOTIFIER */

mm/filemap_xip.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,11 +192,13 @@ __xip_unmap (struct address_space * mapping,
192192
if (pte) {
193193
/* Nuke the page table entry. */
194194
flush_cache_page(vma, address, pte_pfn(*pte));
195-
pteval = ptep_clear_flush_notify(vma, address, pte);
195+
pteval = ptep_clear_flush(vma, address, pte);
196196
page_remove_rmap(page);
197197
dec_mm_counter(mm, MM_FILEPAGES);
198198
BUG_ON(pte_dirty(pteval));
199199
pte_unmap_unlock(pte, ptl);
200+
/* must invalidate_page _before_ freeing the page */
201+
mmu_notifier_invalidate_page(mm, address);
200202
page_cache_release(page);
201203
}
202204
}

mm/huge_memory.c

Lines changed: 36 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -787,6 +787,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
787787
pmd_t _pmd;
788788
int ret = 0, i;
789789
struct page **pages;
790+
unsigned long mmun_start; /* For mmu_notifiers */
791+
unsigned long mmun_end; /* For mmu_notifiers */
790792

791793
pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
792794
GFP_KERNEL);
@@ -823,12 +825,16 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
823825
cond_resched();
824826
}
825827

828+
mmun_start = haddr;
829+
mmun_end = haddr + HPAGE_PMD_SIZE;
830+
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
831+
826832
spin_lock(&mm->page_table_lock);
827833
if (unlikely(!pmd_same(*pmd, orig_pmd)))
828834
goto out_free_pages;
829835
VM_BUG_ON(!PageHead(page));
830836

831-
pmdp_clear_flush_notify(vma, haddr, pmd);
837+
pmdp_clear_flush(vma, haddr, pmd);
832838
/* leave pmd empty until pte is filled */
833839

834840
pgtable = pgtable_trans_huge_withdraw(mm);
@@ -851,6 +857,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
851857
page_remove_rmap(page);
852858
spin_unlock(&mm->page_table_lock);
853859

860+
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
861+
854862
ret |= VM_FAULT_WRITE;
855863
put_page(page);
856864

@@ -859,6 +867,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
859867

860868
out_free_pages:
861869
spin_unlock(&mm->page_table_lock);
870+
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
862871
mem_cgroup_uncharge_start();
863872
for (i = 0; i < HPAGE_PMD_NR; i++) {
864873
mem_cgroup_uncharge_page(pages[i]);
@@ -875,6 +884,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
875884
int ret = 0;
876885
struct page *page, *new_page;
877886
unsigned long haddr;
887+
unsigned long mmun_start; /* For mmu_notifiers */
888+
unsigned long mmun_end; /* For mmu_notifiers */
878889

879890
VM_BUG_ON(!vma->anon_vma);
880891
spin_lock(&mm->page_table_lock);
@@ -925,31 +936,39 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
925936
copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
926937
__SetPageUptodate(new_page);
927938

939+
mmun_start = haddr;
940+
mmun_end = haddr + HPAGE_PMD_SIZE;
941+
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
942+
928943
spin_lock(&mm->page_table_lock);
929944
put_page(page);
930945
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
931946
spin_unlock(&mm->page_table_lock);
932947
mem_cgroup_uncharge_page(new_page);
933948
put_page(new_page);
934-
goto out;
949+
goto out_mn;
935950
} else {
936951
pmd_t entry;
937952
VM_BUG_ON(!PageHead(page));
938953
entry = mk_pmd(new_page, vma->vm_page_prot);
939954
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
940955
entry = pmd_mkhuge(entry);
941-
pmdp_clear_flush_notify(vma, haddr, pmd);
956+
pmdp_clear_flush(vma, haddr, pmd);
942957
page_add_new_anon_rmap(new_page, vma, haddr);
943958
set_pmd_at(mm, haddr, pmd, entry);
944959
update_mmu_cache(vma, address, pmd);
945960
page_remove_rmap(page);
946961
put_page(page);
947962
ret |= VM_FAULT_WRITE;
948963
}
949-
out_unlock:
950964
spin_unlock(&mm->page_table_lock);
965+
out_mn:
966+
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
951967
out:
952968
return ret;
969+
out_unlock:
970+
spin_unlock(&mm->page_table_lock);
971+
return ret;
953972
}
954973

955974
struct page *follow_trans_huge_pmd(struct mm_struct *mm,
@@ -1162,7 +1181,11 @@ static int __split_huge_page_splitting(struct page *page,
11621181
struct mm_struct *mm = vma->vm_mm;
11631182
pmd_t *pmd;
11641183
int ret = 0;
1184+
/* For mmu_notifiers */
1185+
const unsigned long mmun_start = address;
1186+
const unsigned long mmun_end = address + HPAGE_PMD_SIZE;
11651187

1188+
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
11661189
spin_lock(&mm->page_table_lock);
11671190
pmd = page_check_address_pmd(page, mm, address,
11681191
PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
@@ -1174,10 +1197,11 @@ static int __split_huge_page_splitting(struct page *page,
11741197
* and it won't wait on the anon_vma->root->mutex to
11751198
* serialize against split_huge_page*.
11761199
*/
1177-
pmdp_splitting_flush_notify(vma, address, pmd);
1200+
pmdp_splitting_flush(vma, address, pmd);
11781201
ret = 1;
11791202
}
11801203
spin_unlock(&mm->page_table_lock);
1204+
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
11811205

11821206
return ret;
11831207
}
@@ -1898,6 +1922,8 @@ static void collapse_huge_page(struct mm_struct *mm,
18981922
spinlock_t *ptl;
18991923
int isolated;
19001924
unsigned long hstart, hend;
1925+
unsigned long mmun_start; /* For mmu_notifiers */
1926+
unsigned long mmun_end; /* For mmu_notifiers */
19011927

19021928
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
19031929

@@ -1952,15 +1978,19 @@ static void collapse_huge_page(struct mm_struct *mm,
19521978
pte = pte_offset_map(pmd, address);
19531979
ptl = pte_lockptr(mm, pmd);
19541980

1981+
mmun_start = address;
1982+
mmun_end = address + HPAGE_PMD_SIZE;
1983+
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
19551984
spin_lock(&mm->page_table_lock); /* probably unnecessary */
19561985
/*
19571986
* After this gup_fast can't run anymore. This also removes
19581987
* any huge TLB entry from the CPU so we won't allow
19591988
* huge and small TLB entries for the same virtual address
19601989
* to avoid the risk of CPU bugs in that area.
19611990
*/
1962-
_pmd = pmdp_clear_flush_notify(vma, address, pmd);
1991+
_pmd = pmdp_clear_flush(vma, address, pmd);
19631992
spin_unlock(&mm->page_table_lock);
1993+
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
19641994

19651995
spin_lock(ptl);
19661996
isolated = __collapse_huge_page_isolate(vma, address, pte);

mm/hugetlb.c

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2355,13 +2355,15 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
23552355
struct page *page;
23562356
struct hstate *h = hstate_vma(vma);
23572357
unsigned long sz = huge_page_size(h);
2358+
const unsigned long mmun_start = start; /* For mmu_notifiers */
2359+
const unsigned long mmun_end = end; /* For mmu_notifiers */
23582360

23592361
WARN_ON(!is_vm_hugetlb_page(vma));
23602362
BUG_ON(start & ~huge_page_mask(h));
23612363
BUG_ON(end & ~huge_page_mask(h));
23622364

23632365
tlb_start_vma(tlb, vma);
2364-
mmu_notifier_invalidate_range_start(mm, start, end);
2366+
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
23652367
again:
23662368
spin_lock(&mm->page_table_lock);
23672369
for (address = start; address < end; address += sz) {
@@ -2425,7 +2427,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
24252427
if (address < end && !ref_page)
24262428
goto again;
24272429
}
2428-
mmu_notifier_invalidate_range_end(mm, start, end);
2430+
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
24292431
tlb_end_vma(tlb, vma);
24302432
}
24312433

@@ -2525,6 +2527,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
25252527
struct page *old_page, *new_page;
25262528
int avoidcopy;
25272529
int outside_reserve = 0;
2530+
unsigned long mmun_start; /* For mmu_notifiers */
2531+
unsigned long mmun_end; /* For mmu_notifiers */
25282532

25292533
old_page = pte_page(pte);
25302534

@@ -2611,6 +2615,9 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
26112615
pages_per_huge_page(h));
26122616
__SetPageUptodate(new_page);
26132617

2618+
mmun_start = address & huge_page_mask(h);
2619+
mmun_end = mmun_start + huge_page_size(h);
2620+
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
26142621
/*
26152622
* Retake the page_table_lock to check for racing updates
26162623
* before the page tables are altered
@@ -2619,20 +2626,18 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
26192626
ptep = huge_pte_offset(mm, address & huge_page_mask(h));
26202627
if (likely(pte_same(huge_ptep_get(ptep), pte))) {
26212628
/* Break COW */
2622-
mmu_notifier_invalidate_range_start(mm,
2623-
address & huge_page_mask(h),
2624-
(address & huge_page_mask(h)) + huge_page_size(h));
26252629
huge_ptep_clear_flush(vma, address, ptep);
26262630
set_huge_pte_at(mm, address, ptep,
26272631
make_huge_pte(vma, new_page, 1));
26282632
page_remove_rmap(old_page);
26292633
hugepage_add_new_anon_rmap(new_page, vma, address);
26302634
/* Make the old page be freed below */
26312635
new_page = old_page;
2632-
mmu_notifier_invalidate_range_end(mm,
2633-
address & huge_page_mask(h),
2634-
(address & huge_page_mask(h)) + huge_page_size(h));
26352636
}
2637+
spin_unlock(&mm->page_table_lock);
2638+
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
2639+
/* Caller expects lock to be held */
2640+
spin_lock(&mm->page_table_lock);
26362641
page_cache_release(new_page);
26372642
page_cache_release(old_page);
26382643
return 0;

mm/memory.c

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -712,7 +712,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
712712
add_taint(TAINT_BAD_PAGE);
713713
}
714714

715-
static inline int is_cow_mapping(vm_flags_t flags)
715+
static inline bool is_cow_mapping(vm_flags_t flags)
716716
{
717717
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
718718
}
@@ -1039,6 +1039,9 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
10391039
unsigned long next;
10401040
unsigned long addr = vma->vm_start;
10411041
unsigned long end = vma->vm_end;
1042+
unsigned long mmun_start; /* For mmu_notifiers */
1043+
unsigned long mmun_end; /* For mmu_notifiers */
1044+
bool is_cow;
10421045
int ret;
10431046

10441047
/*
@@ -1072,8 +1075,12 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
10721075
* parent mm. And a permission downgrade will only happen if
10731076
* is_cow_mapping() returns true.
10741077
*/
1075-
if (is_cow_mapping(vma->vm_flags))
1076-
mmu_notifier_invalidate_range_start(src_mm, addr, end);
1078+
is_cow = is_cow_mapping(vma->vm_flags);
1079+
mmun_start = addr;
1080+
mmun_end = end;
1081+
if (is_cow)
1082+
mmu_notifier_invalidate_range_start(src_mm, mmun_start,
1083+
mmun_end);
10771084

10781085
ret = 0;
10791086
dst_pgd = pgd_offset(dst_mm, addr);
@@ -1089,9 +1096,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
10891096
}
10901097
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
10911098

1092-
if (is_cow_mapping(vma->vm_flags))
1093-
mmu_notifier_invalidate_range_end(src_mm,
1094-
vma->vm_start, end);
1099+
if (is_cow)
1100+
mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
10951101
return ret;
10961102
}
10971103

@@ -2516,7 +2522,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
25162522
spinlock_t *ptl, pte_t orig_pte)
25172523
__releases(ptl)
25182524
{
2519-
struct page *old_page, *new_page;
2525+
struct page *old_page, *new_page = NULL;
25202526
pte_t entry;
25212527
int ret = 0;
25222528
int page_mkwrite = 0;
@@ -2760,10 +2766,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
27602766
} else
27612767
mem_cgroup_uncharge_page(new_page);
27622768

2763-
if (new_page)
2764-
page_cache_release(new_page);
27652769
unlock:
27662770
pte_unmap_unlock(page_table, ptl);
2771+
if (new_page) {
2772+
if (new_page == old_page)
2773+
/* cow happened, notify before releasing old_page */
2774+
mmu_notifier_invalidate_page(mm, address);
2775+
page_cache_release(new_page);
2776+
}
27672777
if (old_page) {
27682778
/*
27692779
* Don't let another task, with possibly unlocked vma,

mm/mremap.c

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -149,11 +149,15 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
149149
unsigned long extent, next, old_end;
150150
pmd_t *old_pmd, *new_pmd;
151151
bool need_flush = false;
152+
unsigned long mmun_start; /* For mmu_notifiers */
153+
unsigned long mmun_end; /* For mmu_notifiers */
152154

153155
old_end = old_addr + len;
154156
flush_cache_range(vma, old_addr, old_end);
155157

156-
mmu_notifier_invalidate_range_start(vma->vm_mm, old_addr, old_end);
158+
mmun_start = old_addr;
159+
mmun_end = old_end;
160+
mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
157161

158162
for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
159163
cond_resched();
@@ -197,7 +201,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
197201
if (likely(need_flush))
198202
flush_tlb_range(vma, old_end-len, old_addr);
199203

200-
mmu_notifier_invalidate_range_end(vma->vm_mm, old_end-len, old_end);
204+
mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
201205

202206
return len + old_addr - old_end; /* how much done */
203207
}

0 commit comments

Comments
 (0)