Skip to content

Commit 22061a1

Browse files
Hugh Dickinstorvalds
authored andcommitted
mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
There is a race between THP unmapping and truncation, when truncate sees pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it, but before its page_remove_rmap() gets to decrement compound_mapcount: generating false "BUG: Bad page cache" reports that the page is still mapped when deleted. This commit fixes that, but not in the way I hoped. The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK) instead of unmap_mapping_range() in truncate_cleanup_page(): it has often been an annoyance that we usually call unmap_mapping_range() with no pages locked, but there apply it to a single locked page. try_to_unmap() looks more suitable for a single locked page. However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page): it is used to insert THP migration entries, but not used to unmap THPs. Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB needs are different, I'm too ignorant of the DAX cases, and couldn't decide how far to go for anon+swap. Set that aside. The second attempt took a different tack: make no change in truncate.c, but modify zap_huge_pmd() to insert an invalidated huge pmd instead of clearing it initially, then pmd_clear() between page_remove_rmap() and unlocking at the end. Nice. But powerpc blows that approach out of the water, with its serialize_against_pte_lookup(), and interesting pgtable usage. It would need serious help to get working on powerpc (with a minor optimization issue on s390 too). Set that aside. Just add an "if (page_mapped(page)) synchronize_rcu();" or other such delay, after unmapping in truncate_cleanup_page()? Perhaps, but though that's likely to reduce or eliminate the number of incidents, it would give less assurance of whether we had identified the problem correctly. This successful iteration introduces "unmap_mapping_page(page)" instead of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route, with an addition to details. Then zap_pmd_range() watches for this case, and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but safe. Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to assert its interface; but currently that's only used to make sure that page->mapping is stable, and zap_pmd_range() doesn't care if the page is locked or not. Along these lines, in invalidate_inode_pages2_range() move the initial unmap_mapping_range() out from under page lock, before then calling unmap_mapping_page() under page lock if still mapped. Link: https://lkml.kernel.org/r/[email protected] Fixes: fc127da ("truncate: handle file thp") Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jue Wang <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Wang Yugui <[email protected]> Cc: Zi Yan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 3165717 commit 22061a1

File tree

3 files changed

+63
-24
lines changed

3 files changed

+63
-24
lines changed

include/linux/mm.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1719,6 +1719,7 @@ struct zap_details {
17191719
struct address_space *check_mapping; /* Check page->mapping if set */
17201720
pgoff_t first_index; /* Lowest page->index to unmap */
17211721
pgoff_t last_index; /* Highest page->index to unmap */
1722+
struct page *single_page; /* Locked page to be unmapped */
17221723
};
17231724

17241725
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
17661767
extern int fixup_user_fault(struct mm_struct *mm,
17671768
unsigned long address, unsigned int fault_flags,
17681769
bool *unlocked);
1770+
void unmap_mapping_page(struct page *page);
17691771
void unmap_mapping_pages(struct address_space *mapping,
17701772
pgoff_t start, pgoff_t nr, bool even_cows);
17711773
void unmap_mapping_range(struct address_space *mapping,
@@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
17861788
BUG();
17871789
return -EFAULT;
17881790
}
1791+
static inline void unmap_mapping_page(struct page *page) { }
17891792
static inline void unmap_mapping_pages(struct address_space *mapping,
17901793
pgoff_t start, pgoff_t nr, bool even_cows) { }
17911794
static inline void unmap_mapping_range(struct address_space *mapping,

mm/memory.c

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1361,7 +1361,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
13611361
else if (zap_huge_pmd(tlb, vma, pmd, addr))
13621362
goto next;
13631363
/* fall through */
1364+
} else if (details && details->single_page &&
1365+
PageTransCompound(details->single_page) &&
1366+
next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
1367+
spinlock_t *ptl = pmd_lock(tlb->mm, pmd);
1368+
/*
1369+
* Take and drop THP pmd lock so that we cannot return
1370+
* prematurely, while zap_huge_pmd() has cleared *pmd,
1371+
* but not yet decremented compound_mapcount().
1372+
*/
1373+
spin_unlock(ptl);
13641374
}
1375+
13651376
/*
13661377
* Here there can be other concurrent MADV_DONTNEED or
13671378
* trans huge page faults running, and if the pmd is
@@ -3236,6 +3247,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
32363247
}
32373248
}
32383249

3250+
/**
3251+
* unmap_mapping_page() - Unmap single page from processes.
3252+
* @page: The locked page to be unmapped.
3253+
*
3254+
* Unmap this page from any userspace process which still has it mmaped.
3255+
* Typically, for efficiency, the range of nearby pages has already been
3256+
* unmapped by unmap_mapping_pages() or unmap_mapping_range(). But once
3257+
* truncation or invalidation holds the lock on a page, it may find that
3258+
* the page has been remapped again: and then uses unmap_mapping_page()
3259+
* to unmap it finally.
3260+
*/
3261+
void unmap_mapping_page(struct page *page)
3262+
{
3263+
struct address_space *mapping = page->mapping;
3264+
struct zap_details details = { };
3265+
3266+
VM_BUG_ON(!PageLocked(page));
3267+
VM_BUG_ON(PageTail(page));
3268+
3269+
details.check_mapping = mapping;
3270+
details.first_index = page->index;
3271+
details.last_index = page->index + thp_nr_pages(page) - 1;
3272+
details.single_page = page;
3273+
3274+
i_mmap_lock_write(mapping);
3275+
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
3276+
unmap_mapping_range_tree(&mapping->i_mmap, &details);
3277+
i_mmap_unlock_write(mapping);
3278+
}
3279+
32393280
/**
32403281
* unmap_mapping_pages() - Unmap pages from processes.
32413282
* @mapping: The address space containing pages to be unmapped.

mm/truncate.c

Lines changed: 19 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
167167
* its lock, b) when a concurrent invalidate_mapping_pages got there first and
168168
* c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
169169
*/
170-
static void
171-
truncate_cleanup_page(struct address_space *mapping, struct page *page)
170+
static void truncate_cleanup_page(struct page *page)
172171
{
173-
if (page_mapped(page)) {
174-
unsigned int nr = thp_nr_pages(page);
175-
unmap_mapping_pages(mapping, page->index, nr, false);
176-
}
172+
if (page_mapped(page))
173+
unmap_mapping_page(page);
177174

178175
if (page_has_private(page))
179176
do_invalidatepage(page, 0, thp_size(page));
@@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
218215
if (page->mapping != mapping)
219216
return -EIO;
220217

221-
truncate_cleanup_page(mapping, page);
218+
truncate_cleanup_page(page);
222219
delete_from_page_cache(page);
223220
return 0;
224221
}
@@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
325322
index = indices[pagevec_count(&pvec) - 1] + 1;
326323
truncate_exceptional_pvec_entries(mapping, &pvec, indices);
327324
for (i = 0; i < pagevec_count(&pvec); i++)
328-
truncate_cleanup_page(mapping, pvec.pages[i]);
325+
truncate_cleanup_page(pvec.pages[i]);
329326
delete_from_page_cache_batch(mapping, &pvec);
330327
for (i = 0; i < pagevec_count(&pvec); i++)
331328
unlock_page(pvec.pages[i]);
@@ -639,30 +636,28 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
639636
continue;
640637
}
641638

639+
if (!did_range_unmap && page_mapped(page)) {
640+
/*
641+
* If page is mapped, before taking its lock,
642+
* zap the rest of the file in one hit.
643+
*/
644+
unmap_mapping_pages(mapping, index,
645+
(1 + end - index), false);
646+
did_range_unmap = 1;
647+
}
648+
642649
lock_page(page);
643650
WARN_ON(page_to_index(page) != index);
644651
if (page->mapping != mapping) {
645652
unlock_page(page);
646653
continue;
647654
}
648655
wait_on_page_writeback(page);
649-
if (page_mapped(page)) {
650-
if (!did_range_unmap) {
651-
/*
652-
* Zap the rest of the file in one hit.
653-
*/
654-
unmap_mapping_pages(mapping, index,
655-
(1 + end - index), false);
656-
did_range_unmap = 1;
657-
} else {
658-
/*
659-
* Just zap this page
660-
*/
661-
unmap_mapping_pages(mapping, index,
662-
1, false);
663-
}
664-
}
656+
657+
if (page_mapped(page))
658+
unmap_mapping_page(page);
665659
BUG_ON(page_mapped(page));
660+
666661
ret2 = do_launder_page(mapping, page);
667662
if (ret2 == 0) {
668663
if (!invalidate_complete_page2(mapping, page))

0 commit comments

Comments
 (0)