Skip to content

Commit 4819e15

Browse files
joergroedelIngo Molnar
authored andcommitted
x86/mm/32: Bring back vmalloc faulting on x86_32
One can not simply remove vmalloc faulting on x86-32. Upstream commit: 7f0a002 ("x86/mm: remove vmalloc faulting") removed it on x86 alltogether because previously the arch_sync_kernel_mappings() interface was introduced. This interface added synchronization of vmalloc/ioremap page-table updates to all page-tables in the system at creation time and was thought to make vmalloc faulting obsolete. But that assumption was incredibly naive. It turned out that there is a race window between the time the vmalloc or ioremap code establishes a mapping and the time it synchronizes this change to other page-tables in the system. During this race window another CPU or thread can establish a vmalloc mapping which uses the same intermediate page-table entries (e.g. PMD or PUD) and does no synchronization in the end, because it found all necessary mappings already present in the kernel reference page-table. But when these intermediate page-table entries are not yet synchronized, the other CPU or thread will continue with a vmalloc address that is not yet mapped in the page-table it currently uses, causing an unhandled page fault and oops like below: BUG: unable to handle page fault for address: fe80c000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 33183067 *pte = a8648163 Oops: 0002 [#1] SMP CPU: 1 PID: 13514 Comm: cve-2017-17053 Tainted: G ... Call Trace: ldt_dup_context+0x66/0x80 dup_mm+0x2b3/0x480 copy_process+0x133b/0x15c0 _do_fork+0x94/0x3e0 __ia32_sys_clone+0x67/0x80 __do_fast_syscall_32+0x3f/0x70 do_fast_syscall_32+0x29/0x60 do_SYSENTER_32+0x15/0x20 entry_SYSENTER_32+0x9f/0xf2 EIP: 0xb7eef549 So the arch_sync_kernel_mappings() interface is racy, but removing it would mean to re-introduce the vmalloc_sync_all() interface, which is even more awful. Keep arch_sync_kernel_mappings() in place and catch the race condition in the page-fault handler instead. Do a partial revert of above commit to get vmalloc faulting on x86-32 back in place. Fixes: 7f0a002 ("x86/mm: remove vmalloc faulting") Reported-by: Naresh Kamboju <[email protected]> Signed-off-by: Joerg Roedel <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent aef0148 commit 4819e15

File tree

1 file changed

+78
-0
lines changed

1 file changed

+78
-0
lines changed

arch/x86/mm/fault.c

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,53 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
190190
return pmd_k;
191191
}
192192

193+
/*
194+
* Handle a fault on the vmalloc or module mapping area
195+
*
196+
* This is needed because there is a race condition between the time
197+
* when the vmalloc mapping code updates the PMD to the point in time
198+
* where it synchronizes this update with the other page-tables in the
199+
* system.
200+
*
201+
* In this race window another thread/CPU can map an area on the same
202+
* PMD, finds it already present and does not synchronize it with the
203+
* rest of the system yet. As a result v[mz]alloc might return areas
204+
* which are not mapped in every page-table in the system, causing an
205+
* unhandled page-fault when they are accessed.
206+
*/
207+
static noinline int vmalloc_fault(unsigned long address)
208+
{
209+
unsigned long pgd_paddr;
210+
pmd_t *pmd_k;
211+
pte_t *pte_k;
212+
213+
/* Make sure we are in vmalloc area: */
214+
if (!(address >= VMALLOC_START && address < VMALLOC_END))
215+
return -1;
216+
217+
/*
218+
* Synchronize this task's top level page-table
219+
* with the 'reference' page table.
220+
*
221+
* Do _not_ use "current" here. We might be inside
222+
* an interrupt in the middle of a task switch..
223+
*/
224+
pgd_paddr = read_cr3_pa();
225+
pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);
226+
if (!pmd_k)
227+
return -1;
228+
229+
if (pmd_large(*pmd_k))
230+
return 0;
231+
232+
pte_k = pte_offset_kernel(pmd_k, address);
233+
if (!pte_present(*pte_k))
234+
return -1;
235+
236+
return 0;
237+
}
238+
NOKPROBE_SYMBOL(vmalloc_fault);
239+
193240
void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
194241
{
195242
unsigned long addr;
@@ -1110,6 +1157,37 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
11101157
*/
11111158
WARN_ON_ONCE(hw_error_code & X86_PF_PK);
11121159

1160+
#ifdef CONFIG_X86_32
1161+
/*
1162+
* We can fault-in kernel-space virtual memory on-demand. The
1163+
* 'reference' page table is init_mm.pgd.
1164+
*
1165+
* NOTE! We MUST NOT take any locks for this case. We may
1166+
* be in an interrupt or a critical region, and should
1167+
* only copy the information from the master page table,
1168+
* nothing more.
1169+
*
1170+
* Before doing this on-demand faulting, ensure that the
1171+
* fault is not any of the following:
1172+
* 1. A fault on a PTE with a reserved bit set.
1173+
* 2. A fault caused by a user-mode access. (Do not demand-
1174+
* fault kernel memory due to user-mode accesses).
1175+
* 3. A fault caused by a page-level protection violation.
1176+
* (A demand fault would be on a non-present page which
1177+
* would have X86_PF_PROT==0).
1178+
*
1179+
* This is only needed to close a race condition on x86-32 in
1180+
* the vmalloc mapping/unmapping code. See the comment above
1181+
* vmalloc_fault() for details. On x86-64 the race does not
1182+
* exist as the vmalloc mappings don't need to be synchronized
1183+
* there.
1184+
*/
1185+
if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
1186+
if (vmalloc_fault(address) >= 0)
1187+
return;
1188+
}
1189+
#endif
1190+
11131191
/* Was the fault spurious, caused by lazy TLB invalidation? */
11141192
if (spurious_kernel_fault(hw_error_code, address))
11151193
return;

0 commit comments

Comments
 (0)