Skip to content

Commit 01c9b17

Browse files
hansendcKAGA-KOKO
authored andcommitted
x86/Documentation: Add PTI description
Add some details about how PTI works, what some of the downsides are, and how to debug it when things go wrong. Also document the kernel parameter: 'pti/nopti'. Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Moritz Lipp <[email protected]> Cc: Daniel Gruss <[email protected]> Cc: Michael Schwarz <[email protected]> Cc: Richard Fellner <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andi Lutomirsky <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
1 parent de53c37 commit 01c9b17

File tree

2 files changed

+200
-7
lines changed

2 files changed

+200
-7
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2685,8 +2685,6 @@
26852685
steal time is computed, but won't influence scheduler
26862686
behaviour
26872687

2688-
nopti [X86-64] Disable kernel page table isolation
2689-
26902688
nolapic [X86-32,APIC] Do not enable or use the local APIC.
26912689

26922690
nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
@@ -3255,11 +3253,20 @@
32553253
pt. [PARIDE]
32563254
See Documentation/blockdev/paride.txt.
32573255

3258-
pti= [X86_64]
3259-
Control user/kernel address space isolation:
3260-
on - enable
3261-
off - disable
3262-
auto - default setting
3256+
pti= [X86_64] Control Page Table Isolation of user and
3257+
kernel address spaces. Disabling this feature
3258+
removes hardening, but improves performance of
3259+
system calls and interrupts.
3260+
3261+
on - unconditionally enable
3262+
off - unconditionally disable
3263+
auto - kernel detects whether your CPU model is
3264+
vulnerable to issues that PTI mitigates
3265+
3266+
Not specifying this option is equivalent to pti=auto.
3267+
3268+
nopti [X86_64]
3269+
Equivalent to pti=off
32633270

32643271
pty.legacy_count=
32653272
[KNL] Number of legacy pty's. Overwrites compiled-in

Documentation/x86/pti.txt

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
Overview
2+
========
3+
4+
Page Table Isolation (pti, previously known as KAISER[1]) is a
5+
countermeasure against attacks on the shared user/kernel address
6+
space such as the "Meltdown" approach[2].
7+
8+
To mitigate this class of attacks, we create an independent set of
9+
page tables for use only when running userspace applications. When
10+
the kernel is entered via syscalls, interrupts or exceptions, the
11+
page tables are switched to the full "kernel" copy. When the system
12+
switches back to user mode, the user copy is used again.
13+
14+
The userspace page tables contain only a minimal amount of kernel
15+
data: only what is needed to enter/exit the kernel such as the
16+
entry/exit functions themselves and the interrupt descriptor table
17+
(IDT). There are a few strictly unnecessary things that get mapped
18+
such as the first C function when entering an interrupt (see
19+
comments in pti.c).
20+
21+
This approach helps to ensure that side-channel attacks leveraging
22+
the paging structures do not function when PTI is enabled. It can be
23+
enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
24+
Once enabled at compile-time, it can be disabled at boot with the
25+
'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
26+
27+
Page Table Management
28+
=====================
29+
30+
When PTI is enabled, the kernel manages two sets of page tables.
31+
The first set is very similar to the single set which is present in
32+
kernels without PTI. This includes a complete mapping of userspace
33+
that the kernel can use for things like copy_to_user().
34+
35+
Although _complete_, the user portion of the kernel page tables is
36+
crippled by setting the NX bit in the top level. This ensures
37+
that any missed kernel->user CR3 switch will immediately crash
38+
userspace upon executing its first instruction.
39+
40+
The userspace page tables map only the kernel data needed to enter
41+
and exit the kernel. This data is entirely contained in the 'struct
42+
cpu_entry_area' structure which is placed in the fixmap which gives
43+
each CPU's copy of the area a compile-time-fixed virtual address.
44+
45+
For new userspace mappings, the kernel makes the entries in its
46+
page tables like normal. The only difference is when the kernel
47+
makes entries in the top (PGD) level. In addition to setting the
48+
entry in the main kernel PGD, a copy of the entry is made in the
49+
userspace page tables' PGD.
50+
51+
This sharing at the PGD level also inherently shares all the lower
52+
layers of the page tables. This leaves a single, shared set of
53+
userspace page tables to manage. One PTE to lock, one set of
54+
accessed bits, dirty bits, etc...
55+
56+
Overhead
57+
========
58+
59+
Protection against side-channel attacks is important. But,
60+
this protection comes at a cost:
61+
62+
1. Increased Memory Use
63+
a. Each process now needs an order-1 PGD instead of order-0.
64+
(Consumes an additional 4k per process).
65+
b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
66+
aligned so that it can be mapped by setting a single PMD
67+
entry. This consumes nearly 2MB of RAM once the kernel
68+
is decompressed, but no space in the kernel image itself.
69+
70+
2. Runtime Cost
71+
a. CR3 manipulation to switch between the page table copies
72+
must be done at interrupt, syscall, and exception entry
73+
and exit (it can be skipped when the kernel is interrupted,
74+
though.) Moves to CR3 are on the order of a hundred
75+
cycles, and are required at every entry and exit.
76+
b. A "trampoline" must be used for SYSCALL entry. This
77+
trampoline depends on a smaller set of resources than the
78+
non-PTI SYSCALL entry code, so requires mapping fewer
79+
things into the userspace page tables. The downside is
80+
that stacks must be switched at entry time.
81+
d. Global pages are disabled for all kernel structures not
82+
mapped into both kernel and userspace page tables. This
83+
feature of the MMU allows different processes to share TLB
84+
entries mapping the kernel. Losing the feature means more
85+
TLB misses after a context switch. The actual loss of
86+
performance is very small, however, never exceeding 1%.
87+
d. Process Context IDentifiers (PCID) is a CPU feature that
88+
allows us to skip flushing the entire TLB when switching page
89+
tables by setting a special bit in CR3 when the page tables
90+
are changed. This makes switching the page tables (at context
91+
switch, or kernel entry/exit) cheaper. But, on systems with
92+
PCID support, the context switch code must flush both the user
93+
and kernel entries out of the TLB. The user PCID TLB flush is
94+
deferred until the exit to userspace, minimizing the cost.
95+
See intel.com/sdm for the gory PCID/INVPCID details.
96+
e. The userspace page tables must be populated for each new
97+
process. Even without PTI, the shared kernel mappings
98+
are created by copying top-level (PGD) entries into each
99+
new process. But, with PTI, there are now *two* kernel
100+
mappings: one in the kernel page tables that maps everything
101+
and one for the entry/exit structures. At fork(), we need to
102+
copy both.
103+
f. In addition to the fork()-time copying, there must also
104+
be an update to the userspace PGD any time a set_pgd() is done
105+
on a PGD used to map userspace. This ensures that the kernel
106+
and userspace copies always map the same userspace
107+
memory.
108+
g. On systems without PCID support, each CR3 write flushes
109+
the entire TLB. That means that each syscall, interrupt
110+
or exception flushes the TLB.
111+
h. INVPCID is a TLB-flushing instruction which allows flushing
112+
of TLB entries for non-current PCIDs. Some systems support
113+
PCIDs, but do not support INVPCID. On these systems, addresses
114+
can only be flushed from the TLB for the current PCID. When
115+
flushing a kernel address, we need to flush all PCIDs, so a
116+
single kernel address flush will require a TLB-flushing CR3
117+
write upon the next use of every PCID.
118+
119+
Possible Future Work
120+
====================
121+
1. We can be more careful about not actually writing to CR3
122+
unless its value is actually changed.
123+
2. Allow PTI to be enabled/disabled at runtime in addition to the
124+
boot-time switching.
125+
126+
Testing
127+
========
128+
129+
To test stability of PTI, the following test procedure is recommended,
130+
ideally doing all of these in parallel:
131+
132+
1. Set CONFIG_DEBUG_ENTRY=y
133+
2. Run several copies of all of the tools/testing/selftests/x86/ tests
134+
(excluding MPX and protection_keys) in a loop on multiple CPUs for
135+
several minutes. These tests frequently uncover corner cases in the
136+
kernel entry code. In general, old kernels might cause these tests
137+
themselves to crash, but they should never crash the kernel.
138+
3. Run the 'perf' tool in a mode (top or record) that generates many
139+
frequent performance monitoring non-maskable interrupts (see "NMI"
140+
in /proc/interrupts). This exercises the NMI entry/exit code which
141+
is known to trigger bugs in code paths that did not expect to be
142+
interrupted, including nested NMIs. Using "-c" boosts the rate of
143+
NMIs, and using two -c with separate counters encourages nested NMIs
144+
and less deterministic behavior.
145+
146+
while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
147+
148+
4. Launch a KVM virtual machine.
149+
5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
150+
This has been a lightly-tested code path and needs extra scrutiny.
151+
152+
Debugging
153+
=========
154+
155+
Bugs in PTI cause a few different signatures of crashes
156+
that are worth noting here.
157+
158+
* Failures of the selftests/x86 code. Usually a bug in one of the
159+
more obscure corners of entry_64.S
160+
* Crashes in early boot, especially around CPU bringup. Bugs
161+
in the trampoline code or mappings cause these.
162+
* Crashes at the first interrupt. Caused by bugs in entry_64.S,
163+
like screwing up a page table switch. Also caused by
164+
incorrectly mapping the IRQ handler entry code.
165+
* Crashes at the first NMI. The NMI code is separate from main
166+
interrupt handlers and can have bugs that do not affect
167+
normal interrupts. Also caused by incorrectly mapping NMI
168+
code. NMIs that interrupt the entry code must be very
169+
careful and can be the cause of crashes that show up when
170+
running perf.
171+
* Kernel crashes at the first exit to userspace. entry_64.S
172+
bugs, or failing to map some of the exit code.
173+
* Crashes at first interrupt that interrupts userspace. The paths
174+
in entry_64.S that return to userspace are sometimes separate
175+
from the ones that return to the kernel.
176+
* Double faults: overflowing the kernel stack because of page
177+
faults upon page faults. Caused by touching non-pti-mapped
178+
data in the entry code, or forgetting to switch to kernel
179+
CR3 before calling into C functions which are not pti-mapped.
180+
* Userspace segfaults early in boot, sometimes manifesting
181+
as mount(8) failing to mount the rootfs. These have
182+
tended to be TLB invalidation issues. Usually invalidating
183+
the wrong PCID, or otherwise missing an invalidation.
184+
185+
1. https://gruss.cc/files/kaiser.pdf
186+
2. https://meltdownattack.com/meltdown.pdf

0 commit comments

Comments
 (0)