Skip to content

Commit b956575

Browse files
amlutoIngo Molnar
authored andcommitted
x86/mm: Flush more aggressively in lazy TLB mode
Since commit: 94b1b03 ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") x86's lazy TLB mode has been all the way lazy: when running a kernel thread (including the idle thread), the kernel keeps using the last user mm's page tables without attempting to maintain user TLB coherence at all. From a pure semantic perspective, this is fine -- kernel threads won't attempt to access user pages, so having stale TLB entries doesn't matter. Unfortunately, I forgot about a subtlety. By skipping TLB flushes, we also allow any paging-structure caches that may exist on the CPU to become incoherent. This means that we can have a paging-structure cache entry that references a freed page table, and the CPU is within its rights to do a speculative page walk starting at the freed page table. I can imagine this causing two different problems: - A speculative page walk starting from a bogus page table could read IO addresses. I haven't seen any reports of this causing problems. - A speculative page walk that involves a bogus page table can install garbage in the TLB. Such garbage would always be at a user VA, but some AMD CPUs have logic that triggers a machine check when it notices these bogus entries. I've seen a couple reports of this. Boris further explains the failure mode: > It is actually more of an optimization which assumes that paging-structure > entries are in WB DRAM: > > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables > performance optimization that assumes PML4, PDP, PDE, and PTE entries > are in cacheable WB-DRAM; memory type checks may be bypassed, and > addresses outside of WB-DRAM may result in undefined behavior or NB > protocol errors. 1=Disables performance optimization and allows PML4, > PDP, PDE and PTE entries to be in any memory type. Operating systems > that maintain page tables in memory types other than WB- DRAM must set > TlbCacheDis to insure proper operation." > > The MCE generated is an NB protocol error to signal that > > "Link: A specific coherent-only packet from a CPU was issued to an > IO link. This may be caused by software which addresses page table > structures in a memory type other than cacheable WB-DRAM without > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for > example, when page table structure addresses are above top of memory. In > such cases, the NB will generate an MCE if it sees a mismatch between > the memory operation generated by the core and the link type." > > I'm assuming coherent-only packets don't go out on IO links, thus the > error. To fix this, reinstate TLB coherence in lazy mode. With this patch applied, we do it in one of two ways: - If we have PCID, we simply switch back to init_mm's page tables when we enter a kernel thread -- this seems to be quite cheap except for the cost of serializing the CPU. - If we don't have PCID, then we set a flag and switch to init_mm the first time we would otherwise need to flush the TLB. The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed to override the default mode for benchmarking. In theory, we could optimize this better by only flushing the TLB in lazy CPUs when a page table is freed. Doing that would require auditing the mm code to make sure that all page table freeing goes through tlb_remove_page() as well as reworking some data structures to implement the improved flush logic. Reported-by: Markus Trippelsdorf <[email protected]> Reported-by: Adam Borowski <[email protected]> Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Eric Biggers <[email protected]> Cc: Johannes Hirte <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Kagan <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: 94b1b03 ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
1 parent 616dd58 commit b956575

File tree

3 files changed

+136
-49
lines changed

3 files changed

+136
-49
lines changed

arch/x86/include/asm/mmu_context.h

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
126126
DEBUG_LOCKS_WARN_ON(preemptible());
127127
}
128128

129-
static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
130-
{
131-
int cpu = smp_processor_id();
132-
133-
if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
134-
cpumask_clear_cpu(cpu, mm_cpumask(mm));
135-
}
129+
void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
136130

137131
static inline int init_new_context(struct task_struct *tsk,
138132
struct mm_struct *mm)

arch/x86/include/asm/tlbflush.h

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
8282
#define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
8383
#endif
8484

85+
/*
86+
* If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
87+
* to init_mm when we switch to a kernel thread (e.g. the idle thread). If
88+
* it's false, then we immediately switch CR3 when entering a kernel thread.
89+
*/
90+
DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
91+
8592
/*
8693
* 6 because 6 should be plenty and struct tlb_state will fit in
8794
* two cache lines.
@@ -104,6 +111,23 @@ struct tlb_state {
104111
u16 loaded_mm_asid;
105112
u16 next_asid;
106113

114+
/*
115+
* We can be in one of several states:
116+
*
117+
* - Actively using an mm. Our CPU's bit will be set in
118+
* mm_cpumask(loaded_mm) and is_lazy == false;
119+
*
120+
* - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit
121+
* will not be set in mm_cpumask(&init_mm) and is_lazy == false.
122+
*
123+
* - Lazily using a real mm. loaded_mm != &init_mm, our bit
124+
* is set in mm_cpumask(loaded_mm), but is_lazy == true.
125+
* We're heuristically guessing that the CR3 load we
126+
* skipped more than makes up for the overhead added by
127+
* lazy mode.
128+
*/
129+
bool is_lazy;
130+
107131
/*
108132
* Access to this CR4 shadow and to H/W CR4 is protected by
109133
* disabling interrupts when modifying either one.

arch/x86/mm/tlb.c

Lines changed: 111 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@
3030

3131
atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
3232

33+
DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
34+
3335
static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
3436
u16 *new_asid, bool *need_flush)
3537
{
@@ -80,7 +82,7 @@ void leave_mm(int cpu)
8082
return;
8183

8284
/* Warn if we're not lazy. */
83-
WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
85+
WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
8486

8587
switch_mm(NULL, &init_mm, NULL);
8688
}
@@ -142,45 +144,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
142144
__flush_tlb_all();
143145
}
144146
#endif
147+
this_cpu_write(cpu_tlbstate.is_lazy, false);
145148

146149
if (real_prev == next) {
147150
VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
148151
next->context.ctx_id);
149152

150-
if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
151-
/*
152-
* There's nothing to do: we weren't lazy, and we
153-
* aren't changing our mm. We don't need to flush
154-
* anything, nor do we need to update CR3, CR4, or
155-
* LDTR.
156-
*/
157-
return;
158-
}
159-
160-
/* Resume remote flushes and then read tlb_gen. */
161-
cpumask_set_cpu(cpu, mm_cpumask(next));
162-
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
163-
164-
if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
165-
next_tlb_gen) {
166-
/*
167-
* Ideally, we'd have a flush_tlb() variant that
168-
* takes the known CR3 value as input. This would
169-
* be faster on Xen PV and on hypothetical CPUs
170-
* on which INVPCID is fast.
171-
*/
172-
this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
173-
next_tlb_gen);
174-
write_cr3(build_cr3(next, prev_asid));
175-
trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
176-
TLB_FLUSH_ALL);
177-
}
178-
179153
/*
180-
* We just exited lazy mode, which means that CR4 and/or LDTR
181-
* may be stale. (Changes to the required CR4 and LDTR states
182-
* are not reflected in tlb_gen.)
154+
* We don't currently support having a real mm loaded without
155+
* our cpu set in mm_cpumask(). We have all the bookkeeping
156+
* in place to figure out whether we would need to flush
157+
* if our cpu were cleared in mm_cpumask(), but we don't
158+
* currently use it.
183159
*/
160+
if (WARN_ON_ONCE(real_prev != &init_mm &&
161+
!cpumask_test_cpu(cpu, mm_cpumask(next))))
162+
cpumask_set_cpu(cpu, mm_cpumask(next));
163+
164+
return;
184165
} else {
185166
u16 new_asid;
186167
bool need_flush;
@@ -199,10 +180,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
199180
}
200181

201182
/* Stop remote flushes for the previous mm */
202-
if (cpumask_test_cpu(cpu, mm_cpumask(real_prev)))
203-
cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
204-
205-
VM_WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
183+
VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
184+
real_prev != &init_mm);
185+
cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
206186

207187
/*
208188
* Start remote flushes and then read tlb_gen.
@@ -232,6 +212,37 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
232212
switch_ldt(real_prev, next);
233213
}
234214

215+
/*
216+
* enter_lazy_tlb() is a hint from the scheduler that we are entering a
217+
* kernel thread or other context without an mm. Acceptable implementations
218+
* include doing nothing whatsoever, switching to init_mm, or various clever
219+
* lazy tricks to try to minimize TLB flushes.
220+
*
221+
* The scheduler reserves the right to call enter_lazy_tlb() several times
222+
* in a row. It will notify us that we're going back to a real mm by
223+
* calling switch_mm_irqs_off().
224+
*/
225+
void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
226+
{
227+
if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
228+
return;
229+
230+
if (static_branch_unlikely(&tlb_use_lazy_mode)) {
231+
/*
232+
* There's a significant optimization that may be possible
233+
* here. We have accurate enough TLB flush tracking that we
234+
* don't need to maintain coherence of TLB per se when we're
235+
* lazy. We do, however, need to maintain coherence of
236+
* paging-structure caches. We could, in principle, leave our
237+
* old mm loaded and only switch to init_mm when
238+
* tlb_remove_page() happens.
239+
*/
240+
this_cpu_write(cpu_tlbstate.is_lazy, true);
241+
} else {
242+
switch_mm(NULL, &init_mm, NULL);
243+
}
244+
}
245+
235246
/*
236247
* Call this when reinitializing a CPU. It fixes the following potential
237248
* problems:
@@ -303,16 +314,20 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
303314
/* This code cannot presently handle being reentered. */
304315
VM_WARN_ON(!irqs_disabled());
305316

317+
if (unlikely(loaded_mm == &init_mm))
318+
return;
319+
306320
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
307321
loaded_mm->context.ctx_id);
308322

309-
if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
323+
if (this_cpu_read(cpu_tlbstate.is_lazy)) {
310324
/*
311-
* We're in lazy mode -- don't flush. We can get here on
312-
* remote flushes due to races and on local flushes if a
313-
* kernel thread coincidentally flushes the mm it's lazily
314-
* still using.
325+
* We're in lazy mode. We need to at least flush our
326+
* paging-structure cache to avoid speculatively reading
327+
* garbage into our TLB. Since switching to init_mm is barely
328+
* slower than a minimal flush, just switch to init_mm.
315329
*/
330+
switch_mm_irqs_off(NULL, &init_mm, NULL);
316331
return;
317332
}
318333

@@ -611,3 +626,57 @@ static int __init create_tlb_single_page_flush_ceiling(void)
611626
return 0;
612627
}
613628
late_initcall(create_tlb_single_page_flush_ceiling);
629+
630+
static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf,
631+
size_t count, loff_t *ppos)
632+
{
633+
char buf[2];
634+
635+
buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0';
636+
buf[1] = '\n';
637+
638+
return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
639+
}
640+
641+
static ssize_t tlblazy_write_file(struct file *file,
642+
const char __user *user_buf, size_t count, loff_t *ppos)
643+
{
644+
bool val;
645+
646+
if (kstrtobool_from_user(user_buf, count, &val))
647+
return -EINVAL;
648+
649+
if (val)
650+
static_branch_enable(&tlb_use_lazy_mode);
651+
else
652+
static_branch_disable(&tlb_use_lazy_mode);
653+
654+
return count;
655+
}
656+
657+
static const struct file_operations fops_tlblazy = {
658+
.read = tlblazy_read_file,
659+
.write = tlblazy_write_file,
660+
.llseek = default_llseek,
661+
};
662+
663+
static int __init init_tlb_use_lazy_mode(void)
664+
{
665+
if (boot_cpu_has(X86_FEATURE_PCID)) {
666+
/*
667+
* Heuristic: with PCID on, switching to and from
668+
* init_mm is reasonably fast, but remote flush IPIs
669+
* as expensive as ever, so turn off lazy TLB mode.
670+
*
671+
* We can't do this in setup_pcid() because static keys
672+
* haven't been initialized yet, and it would blow up
673+
* badly.
674+
*/
675+
static_branch_disable(&tlb_use_lazy_mode);
676+
}
677+
678+
debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR,
679+
arch_debugfs_dir, NULL, &fops_tlblazy);
680+
return 0;
681+
}
682+
late_initcall(init_tlb_use_lazy_mode);

0 commit comments

Comments
 (0)