Skip to content

Commit 40548c6

Browse files
committed
Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 pti updates from Thomas Gleixner: "This contains: - a PTI bugfix to avoid setting reserved CR3 bits when PCID is disabled. This seems to cause issues on a virtual machine at least and is incorrect according to the AMD manual. - a PTI bugfix which disables the perf BTS facility if PTI is enabled. The BTS AUX buffer is not globally visible and causes the CPU to fault when the mapping disappears on switching CR3 to user space. A full fix which restores BTS on PTI is non trivial and will be worked on. - PTI bugfixes for EFI and trusted boot which make sure that the user space visible page table entries have the NX bit cleared - removal of dead code in the PTI pagetable setup functions - add PTI documentation - add a selftest for vsyscall to verify that the kernel actually implements what it advertises. - a sysfs interface to expose vulnerability and mitigation information so there is a coherent way for users to retrieve the status. - the initial spectre_v2 mitigations, aka retpoline: + The necessary ASM thunk and compiler support + The ASM variants of retpoline and the conversion of affected ASM code + Make LFENCE serializing on AMD so it can be used as speculation trap + The RSB fill after vmexit - initial objtool support for retpoline As I said in the status mail this is the most of the set of patches which should go into 4.15 except two straight forward patches still on hold: - the retpoline add on of LFENCE which waits for ACKs - the RSB fill after context switch Both should be ready to go early next week and with that we'll have covered the major holes of spectre_v2 and go back to normality" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits) x86,perf: Disable intel_bts when PTI security/Kconfig: Correct the Documentation reference for PTI x86/pti: Fix !PCID and sanitize defines selftests/x86: Add test_vsyscall x86/retpoline: Fill return stack buffer on vmexit x86/retpoline/irq32: Convert assembler indirect jumps x86/retpoline/checksum32: Convert assembler indirect jumps x86/retpoline/xen: Convert Xen hypercall indirect jumps x86/retpoline/hyperv: Convert assembler indirect jumps x86/retpoline/ftrace: Convert ftrace assembler indirect jumps x86/retpoline/entry: Convert entry assembler indirect jumps x86/retpoline/crypto: Convert crypto assembler indirect jumps x86/spectre: Add boot time option to select Spectre v2 mitigation x86/retpoline: Add initial retpoline support objtool: Allow alternatives to be ignored objtool: Detect jumps to retpoline thunks x86/pti: Make unpoison of pgd for trusted boot work for real x86/alternatives: Fix optimize_nops() checking sysfs/cpu: Fix typos in vulnerability documentation x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC ...
2 parents 2c1cfa4 + 99a9dc9 commit 40548c6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1525
-100
lines changed

Documentation/ABI/testing/sysfs-devices-system-cpu

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -375,3 +375,19 @@ Contact: Linux kernel mailing list <[email protected]>
375375
Description: information about CPUs heterogeneity.
376376

377377
cpu_capacity: capacity of cpu#.
378+
379+
What: /sys/devices/system/cpu/vulnerabilities
380+
/sys/devices/system/cpu/vulnerabilities/meltdown
381+
/sys/devices/system/cpu/vulnerabilities/spectre_v1
382+
/sys/devices/system/cpu/vulnerabilities/spectre_v2
383+
Date: January 2018
384+
Contact: Linux kernel mailing list <[email protected]>
385+
Description: Information about CPU vulnerabilities
386+
387+
The files are named after the code names of CPU
388+
vulnerabilities. The output of those files reflects the
389+
state of the CPUs in the system. Possible output values:
390+
391+
"Not affected" CPU is not affected by the vulnerability
392+
"Vulnerable" CPU is affected and no mitigation in effect
393+
"Mitigation: $M" CPU is affected and mitigation $M is in effect

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 42 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2623,6 +2623,11 @@
26232623
nosmt [KNL,S390] Disable symmetric multithreading (SMT).
26242624
Equivalent to smt=1.
26252625

2626+
nospectre_v2 [X86] Disable all mitigations for the Spectre variant 2
2627+
(indirect branch prediction) vulnerability. System may
2628+
allow data leaks with this option, which is equivalent
2629+
to spectre_v2=off.
2630+
26262631
noxsave [BUGS=X86] Disables x86 extended register state save
26272632
and restore using xsave. The kernel will fallback to
26282633
enabling legacy floating-point and sse state.
@@ -2709,8 +2714,6 @@
27092714
steal time is computed, but won't influence scheduler
27102715
behaviour
27112716

2712-
nopti [X86-64] Disable kernel page table isolation
2713-
27142717
nolapic [X86-32,APIC] Do not enable or use the local APIC.
27152718

27162719
nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
@@ -3291,11 +3294,20 @@
32913294
pt. [PARIDE]
32923295
See Documentation/blockdev/paride.txt.
32933296

3294-
pti= [X86_64]
3295-
Control user/kernel address space isolation:
3296-
on - enable
3297-
off - disable
3298-
auto - default setting
3297+
pti= [X86_64] Control Page Table Isolation of user and
3298+
kernel address spaces. Disabling this feature
3299+
removes hardening, but improves performance of
3300+
system calls and interrupts.
3301+
3302+
on - unconditionally enable
3303+
off - unconditionally disable
3304+
auto - kernel detects whether your CPU model is
3305+
vulnerable to issues that PTI mitigates
3306+
3307+
Not specifying this option is equivalent to pti=auto.
3308+
3309+
nopti [X86_64]
3310+
Equivalent to pti=off
32993311

33003312
pty.legacy_count=
33013313
[KNL] Number of legacy pty's. Overwrites compiled-in
@@ -3946,6 +3958,29 @@
39463958
sonypi.*= [HW] Sony Programmable I/O Control Device driver
39473959
See Documentation/laptops/sonypi.txt
39483960

3961+
spectre_v2= [X86] Control mitigation of Spectre variant 2
3962+
(indirect branch speculation) vulnerability.
3963+
3964+
on - unconditionally enable
3965+
off - unconditionally disable
3966+
auto - kernel detects whether your CPU model is
3967+
vulnerable
3968+
3969+
Selecting 'on' will, and 'auto' may, choose a
3970+
mitigation method at run time according to the
3971+
CPU, the available microcode, the setting of the
3972+
CONFIG_RETPOLINE configuration option, and the
3973+
compiler with which the kernel was built.
3974+
3975+
Specific mitigations can also be selected manually:
3976+
3977+
retpoline - replace indirect branches
3978+
retpoline,generic - google's original retpoline
3979+
retpoline,amd - AMD-specific minimal thunk
3980+
3981+
Not specifying this option is equivalent to
3982+
spectre_v2=auto.
3983+
39493984
spia_io_base= [HW,MTD]
39503985
spia_fio_base=
39513986
spia_pedr=

Documentation/x86/pti.txt

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
Overview
2+
========
3+
4+
Page Table Isolation (pti, previously known as KAISER[1]) is a
5+
countermeasure against attacks on the shared user/kernel address
6+
space such as the "Meltdown" approach[2].
7+
8+
To mitigate this class of attacks, we create an independent set of
9+
page tables for use only when running userspace applications. When
10+
the kernel is entered via syscalls, interrupts or exceptions, the
11+
page tables are switched to the full "kernel" copy. When the system
12+
switches back to user mode, the user copy is used again.
13+
14+
The userspace page tables contain only a minimal amount of kernel
15+
data: only what is needed to enter/exit the kernel such as the
16+
entry/exit functions themselves and the interrupt descriptor table
17+
(IDT). There are a few strictly unnecessary things that get mapped
18+
such as the first C function when entering an interrupt (see
19+
comments in pti.c).
20+
21+
This approach helps to ensure that side-channel attacks leveraging
22+
the paging structures do not function when PTI is enabled. It can be
23+
enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
24+
Once enabled at compile-time, it can be disabled at boot with the
25+
'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
26+
27+
Page Table Management
28+
=====================
29+
30+
When PTI is enabled, the kernel manages two sets of page tables.
31+
The first set is very similar to the single set which is present in
32+
kernels without PTI. This includes a complete mapping of userspace
33+
that the kernel can use for things like copy_to_user().
34+
35+
Although _complete_, the user portion of the kernel page tables is
36+
crippled by setting the NX bit in the top level. This ensures
37+
that any missed kernel->user CR3 switch will immediately crash
38+
userspace upon executing its first instruction.
39+
40+
The userspace page tables map only the kernel data needed to enter
41+
and exit the kernel. This data is entirely contained in the 'struct
42+
cpu_entry_area' structure which is placed in the fixmap which gives
43+
each CPU's copy of the area a compile-time-fixed virtual address.
44+
45+
For new userspace mappings, the kernel makes the entries in its
46+
page tables like normal. The only difference is when the kernel
47+
makes entries in the top (PGD) level. In addition to setting the
48+
entry in the main kernel PGD, a copy of the entry is made in the
49+
userspace page tables' PGD.
50+
51+
This sharing at the PGD level also inherently shares all the lower
52+
layers of the page tables. This leaves a single, shared set of
53+
userspace page tables to manage. One PTE to lock, one set of
54+
accessed bits, dirty bits, etc...
55+
56+
Overhead
57+
========
58+
59+
Protection against side-channel attacks is important. But,
60+
this protection comes at a cost:
61+
62+
1. Increased Memory Use
63+
a. Each process now needs an order-1 PGD instead of order-0.
64+
(Consumes an additional 4k per process).
65+
b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
66+
aligned so that it can be mapped by setting a single PMD
67+
entry. This consumes nearly 2MB of RAM once the kernel
68+
is decompressed, but no space in the kernel image itself.
69+
70+
2. Runtime Cost
71+
a. CR3 manipulation to switch between the page table copies
72+
must be done at interrupt, syscall, and exception entry
73+
and exit (it can be skipped when the kernel is interrupted,
74+
though.) Moves to CR3 are on the order of a hundred
75+
cycles, and are required at every entry and exit.
76+
b. A "trampoline" must be used for SYSCALL entry. This
77+
trampoline depends on a smaller set of resources than the
78+
non-PTI SYSCALL entry code, so requires mapping fewer
79+
things into the userspace page tables. The downside is
80+
that stacks must be switched at entry time.
81+
d. Global pages are disabled for all kernel structures not
82+
mapped into both kernel and userspace page tables. This
83+
feature of the MMU allows different processes to share TLB
84+
entries mapping the kernel. Losing the feature means more
85+
TLB misses after a context switch. The actual loss of
86+
performance is very small, however, never exceeding 1%.
87+
d. Process Context IDentifiers (PCID) is a CPU feature that
88+
allows us to skip flushing the entire TLB when switching page
89+
tables by setting a special bit in CR3 when the page tables
90+
are changed. This makes switching the page tables (at context
91+
switch, or kernel entry/exit) cheaper. But, on systems with
92+
PCID support, the context switch code must flush both the user
93+
and kernel entries out of the TLB. The user PCID TLB flush is
94+
deferred until the exit to userspace, minimizing the cost.
95+
See intel.com/sdm for the gory PCID/INVPCID details.
96+
e. The userspace page tables must be populated for each new
97+
process. Even without PTI, the shared kernel mappings
98+
are created by copying top-level (PGD) entries into each
99+
new process. But, with PTI, there are now *two* kernel
100+
mappings: one in the kernel page tables that maps everything
101+
and one for the entry/exit structures. At fork(), we need to
102+
copy both.
103+
f. In addition to the fork()-time copying, there must also
104+
be an update to the userspace PGD any time a set_pgd() is done
105+
on a PGD used to map userspace. This ensures that the kernel
106+
and userspace copies always map the same userspace
107+
memory.
108+
g. On systems without PCID support, each CR3 write flushes
109+
the entire TLB. That means that each syscall, interrupt
110+
or exception flushes the TLB.
111+
h. INVPCID is a TLB-flushing instruction which allows flushing
112+
of TLB entries for non-current PCIDs. Some systems support
113+
PCIDs, but do not support INVPCID. On these systems, addresses
114+
can only be flushed from the TLB for the current PCID. When
115+
flushing a kernel address, we need to flush all PCIDs, so a
116+
single kernel address flush will require a TLB-flushing CR3
117+
write upon the next use of every PCID.
118+
119+
Possible Future Work
120+
====================
121+
1. We can be more careful about not actually writing to CR3
122+
unless its value is actually changed.
123+
2. Allow PTI to be enabled/disabled at runtime in addition to the
124+
boot-time switching.
125+
126+
Testing
127+
========
128+
129+
To test stability of PTI, the following test procedure is recommended,
130+
ideally doing all of these in parallel:
131+
132+
1. Set CONFIG_DEBUG_ENTRY=y
133+
2. Run several copies of all of the tools/testing/selftests/x86/ tests
134+
(excluding MPX and protection_keys) in a loop on multiple CPUs for
135+
several minutes. These tests frequently uncover corner cases in the
136+
kernel entry code. In general, old kernels might cause these tests
137+
themselves to crash, but they should never crash the kernel.
138+
3. Run the 'perf' tool in a mode (top or record) that generates many
139+
frequent performance monitoring non-maskable interrupts (see "NMI"
140+
in /proc/interrupts). This exercises the NMI entry/exit code which
141+
is known to trigger bugs in code paths that did not expect to be
142+
interrupted, including nested NMIs. Using "-c" boosts the rate of
143+
NMIs, and using two -c with separate counters encourages nested NMIs
144+
and less deterministic behavior.
145+
146+
while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
147+
148+
4. Launch a KVM virtual machine.
149+
5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
150+
This has been a lightly-tested code path and needs extra scrutiny.
151+
152+
Debugging
153+
=========
154+
155+
Bugs in PTI cause a few different signatures of crashes
156+
that are worth noting here.
157+
158+
* Failures of the selftests/x86 code. Usually a bug in one of the
159+
more obscure corners of entry_64.S
160+
* Crashes in early boot, especially around CPU bringup. Bugs
161+
in the trampoline code or mappings cause these.
162+
* Crashes at the first interrupt. Caused by bugs in entry_64.S,
163+
like screwing up a page table switch. Also caused by
164+
incorrectly mapping the IRQ handler entry code.
165+
* Crashes at the first NMI. The NMI code is separate from main
166+
interrupt handlers and can have bugs that do not affect
167+
normal interrupts. Also caused by incorrectly mapping NMI
168+
code. NMIs that interrupt the entry code must be very
169+
careful and can be the cause of crashes that show up when
170+
running perf.
171+
* Kernel crashes at the first exit to userspace. entry_64.S
172+
bugs, or failing to map some of the exit code.
173+
* Crashes at first interrupt that interrupts userspace. The paths
174+
in entry_64.S that return to userspace are sometimes separate
175+
from the ones that return to the kernel.
176+
* Double faults: overflowing the kernel stack because of page
177+
faults upon page faults. Caused by touching non-pti-mapped
178+
data in the entry code, or forgetting to switch to kernel
179+
CR3 before calling into C functions which are not pti-mapped.
180+
* Userspace segfaults early in boot, sometimes manifesting
181+
as mount(8) failing to mount the rootfs. These have
182+
tended to be TLB invalidation issues. Usually invalidating
183+
the wrong PCID, or otherwise missing an invalidation.
184+
185+
1. https://gruss.cc/files/kaiser.pdf
186+
2. https://meltdownattack.com/meltdown.pdf

arch/x86/Kconfig

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ config X86
8888
select GENERIC_CLOCKEVENTS_MIN_ADJUST
8989
select GENERIC_CMOS_UPDATE
9090
select GENERIC_CPU_AUTOPROBE
91+
select GENERIC_CPU_VULNERABILITIES
9192
select GENERIC_EARLY_IOREMAP
9293
select GENERIC_FIND_FIRST_BIT
9394
select GENERIC_IOMAP
@@ -428,6 +429,19 @@ config GOLDFISH
428429
def_bool y
429430
depends on X86_GOLDFISH
430431

432+
config RETPOLINE
433+
bool "Avoid speculative indirect branches in kernel"
434+
default y
435+
help
436+
Compile kernel with the retpoline compiler options to guard against
437+
kernel-to-user data leaks by avoiding speculative indirect
438+
branches. Requires a compiler with -mindirect-branch=thunk-extern
439+
support for full protection. The kernel may run slower.
440+
441+
Without compiler support, at least indirect branches in assembler
442+
code are eliminated. Since this includes the syscall entry path,
443+
it is not entirely pointless.
444+
431445
config INTEL_RDT
432446
bool "Intel Resource Director Technology support"
433447
default n

arch/x86/Makefile

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -230,6 +230,16 @@ KBUILD_CFLAGS += -Wno-sign-compare
230230
#
231231
KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
232232

233+
# Avoid indirect branches in kernel to deal with Spectre
234+
ifdef CONFIG_RETPOLINE
235+
RETPOLINE_CFLAGS += $(call cc-option,-mindirect-branch=thunk-extern -mindirect-branch-register)
236+
ifneq ($(RETPOLINE_CFLAGS),)
237+
KBUILD_CFLAGS += $(RETPOLINE_CFLAGS) -DRETPOLINE
238+
else
239+
$(warning CONFIG_RETPOLINE=y, but not supported by the compiler. Toolchain update recommended.)
240+
endif
241+
endif
242+
233243
archscripts: scripts_basic
234244
$(Q)$(MAKE) $(build)=arch/x86/tools relocs
235245

arch/x86/crypto/aesni-intel_asm.S

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
#include <linux/linkage.h>
3333
#include <asm/inst.h>
3434
#include <asm/frame.h>
35+
#include <asm/nospec-branch.h>
3536

3637
/*
3738
* The following macros are used to move an (un)aligned 16 byte value to/from
@@ -2884,7 +2885,7 @@ ENTRY(aesni_xts_crypt8)
28842885
pxor INC, STATE4
28852886
movdqu IV, 0x30(OUTP)
28862887

2887-
call *%r11
2888+
CALL_NOSPEC %r11
28882889

28892890
movdqu 0x00(OUTP), INC
28902891
pxor INC, STATE1
@@ -2929,7 +2930,7 @@ ENTRY(aesni_xts_crypt8)
29292930
_aesni_gf128mul_x_ble()
29302931
movups IV, (IVP)
29312932

2932-
call *%r11
2933+
CALL_NOSPEC %r11
29332934

29342935
movdqu 0x40(OUTP), INC
29352936
pxor INC, STATE1

arch/x86/crypto/camellia-aesni-avx-asm_64.S

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
#include <linux/linkage.h>
1919
#include <asm/frame.h>
20+
#include <asm/nospec-branch.h>
2021

2122
#define CAMELLIA_TABLE_BYTE_LEN 272
2223

@@ -1227,7 +1228,7 @@ camellia_xts_crypt_16way:
12271228
vpxor 14 * 16(%rax), %xmm15, %xmm14;
12281229
vpxor 15 * 16(%rax), %xmm15, %xmm15;
12291230

1230-
call *%r9;
1231+
CALL_NOSPEC %r9;
12311232

12321233
addq $(16 * 16), %rsp;
12331234

arch/x86/crypto/camellia-aesni-avx2-asm_64.S

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212

1313
#include <linux/linkage.h>
1414
#include <asm/frame.h>
15+
#include <asm/nospec-branch.h>
1516

1617
#define CAMELLIA_TABLE_BYTE_LEN 272
1718

@@ -1343,7 +1344,7 @@ camellia_xts_crypt_32way:
13431344
vpxor 14 * 32(%rax), %ymm15, %ymm14;
13441345
vpxor 15 * 32(%rax), %ymm15, %ymm15;
13451346

1346-
call *%r9;
1347+
CALL_NOSPEC %r9;
13471348

13481349
addq $(16 * 32), %rsp;
13491350

0 commit comments

Comments
 (0)