Skip to content

Commit 5beda7d

Browse files
amlutoKAGA-KOKO
authored andcommitted
x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
Neil Berrington reported a double-fault on a VM with 768GB of RAM that uses large amounts of vmalloc space with PTI enabled. The cause is that load_new_mm_cr3() was never fixed to take the 5-level pgd folding code into account, so, on a 4-level kernel, the pgd synchronization logic compiles away to exactly nothing. Interestingly, the problem doesn't trigger with nopti. I assume this is because the kernel is mapped with global pages if we boot with nopti. The sequence of operations when we create a new task is that we first load its mm while still running on the old stack (which crashes if the old stack is unmapped in the new mm unless the TLB saves us), then we call prepare_switch_to(), and then we switch to the new stack. prepare_switch_to() pokes the new stack directly, which will populate the mapping through vmalloc_fault(). I assume that we're getting lucky on non-PTI systems -- the old stack's TLB entry stays alive long enough to make it all the way through prepare_switch_to() and switch_to() so that we make it to a valid stack. Fixes: b50858c ("x86/mm/vmalloc: Add 5-level paging support") Reported-and-tested-by: Neil Berrington <[email protected]> Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: [email protected] Cc: Dave Hansen <[email protected]> Cc: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/346541c56caed61abbe693d7d2742b4a380c5001.1516914529.git.luto@kernel.org
1 parent 1d080f0 commit 5beda7d

File tree

1 file changed

+29
-5
lines changed

1 file changed

+29
-5
lines changed

arch/x86/mm/tlb.c

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,34 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
151151
local_irq_restore(flags);
152152
}
153153

154+
static void sync_current_stack_to_mm(struct mm_struct *mm)
155+
{
156+
unsigned long sp = current_stack_pointer;
157+
pgd_t *pgd = pgd_offset(mm, sp);
158+
159+
if (CONFIG_PGTABLE_LEVELS > 4) {
160+
if (unlikely(pgd_none(*pgd))) {
161+
pgd_t *pgd_ref = pgd_offset_k(sp);
162+
163+
set_pgd(pgd, *pgd_ref);
164+
}
165+
} else {
166+
/*
167+
* "pgd" is faked. The top level entries are "p4d"s, so sync
168+
* the p4d. This compiles to approximately the same code as
169+
* the 5-level case.
170+
*/
171+
p4d_t *p4d = p4d_offset(pgd, sp);
172+
173+
if (unlikely(p4d_none(*p4d))) {
174+
pgd_t *pgd_ref = pgd_offset_k(sp);
175+
p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
176+
177+
set_p4d(p4d, *p4d_ref);
178+
}
179+
}
180+
}
181+
154182
void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
155183
struct task_struct *tsk)
156184
{
@@ -226,11 +254,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
226254
* mapped in the new pgd, we'll double-fault. Forcibly
227255
* map it.
228256
*/
229-
unsigned int index = pgd_index(current_stack_pointer);
230-
pgd_t *pgd = next->pgd + index;
231-
232-
if (unlikely(pgd_none(*pgd)))
233-
set_pgd(pgd, init_mm.pgd[index]);
257+
sync_current_stack_to_mm(next);
234258
}
235259

236260
/* Stop remote flushes for the previous mm */

0 commit comments

Comments
 (0)