Skip to content

Commit 6f1a489

Browse files
committed
x86/apic/msi: Plug non-maskable MSI affinity race
Evan tracked down a subtle race between the update of the MSI message and the device raising an interrupt internally on PCI devices which do not support MSI masking. The update of the MSI message is non-atomic and consists of either 2 or 3 sequential 32bit wide writes to the PCI config space. - Write address low 32bits - Write address high 32bits (If supported by device) - Write data When an interrupt is migrated then both address and data might change, so the kernel attempts to mask the MSI interrupt first. But for MSI masking is optional, so there exist devices which do not provide it. That means that if the device raises an interrupt internally between the writes then a MSI message is sent built from half updated state. On x86 this can lead to spurious interrupts on the wrong interrupt vector when the affinity setting changes both address and data. As a consequence the device interrupt can be lost causing the device to become stuck or malfunctioning. Evan tried to handle that by disabling MSI accross an MSI message update. That's not feasible because disabling MSI has issues on its own: If MSI is disabled the PCI device is routing an interrupt to the legacy INTx mechanism. The INTx delivery can be disabled, but the disablement is not working on all devices. Some devices lose interrupts when both MSI and INTx delivery are disabled. Another way to solve this would be to enforce the allocation of the same vector on all CPUs in the system for this kind of screwed devices. That could be done, but it would bring back the vector space exhaustion problems which got solved a few years ago. Fortunately the high address (if supported by the device) is only relevant when X2APIC is enabled which implies interrupt remapping. In the interrupt remapping case the affinity setting is happening at the interrupt remapping unit and the PCI MSI message is programmed only once when the PCI device is initialized. That makes it possible to solve it with a two step update: 1) Target the MSI msg to the new vector on the current target CPU 2) Target the MSI msg to the new vector on the new target CPU In both cases writing the MSI message is only changing a single 32bit word which prevents the issue of inconsistency. After writing the final destination it is necessary to check whether the device issued an interrupt while the intermediate state #1 (new vector, current CPU) was in effect. This is possible because the affinity change is always happening on the current target CPU. The code runs with interrupts disabled, so the interrupt can be detected by checking the IRR of the local APIC. If the vector is pending in the IRR then the interrupt is retriggered on the new target CPU by sending an IPI for the associated vector on the target CPU. This can cause spurious interrupts on both the local and the new target CPU. 1) If the new vector is not in use on the local CPU and the device affected by the affinity change raised an interrupt during the transitional state (step #1 above) then interrupt entry code will ignore that spurious interrupt. The vector is marked so that the 'No irq handler for vector' warning is supressed once. 2) If the new vector is in use already on the local CPU then the IRR check might see an pending interrupt from the device which is using this vector. The IPI to the new target CPU will then invoke the handler of the device, which got the affinity change, even if that device did not issue an interrupt 3) If the new vector is in use already on the local CPU and the device affected by the affinity change raised an interrupt during the transitional state (step #1 above) then the handler of the device which uses that vector on the local CPU will be invoked. expose issues in device driver interrupt handlers which are not prepared to handle a spurious interrupt correctly. This not a regression, it's just exposing something which was already broken as spurious interrupts can happen for a lot of reasons and all driver handlers need to be able to deal with them. Reported-by: Evan Green <[email protected]> Debugged-by: Evan Green <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Evan Green <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
1 parent 2b73ea3 commit 6f1a489

File tree

6 files changed

+163
-4
lines changed

6 files changed

+163
-4
lines changed

arch/x86/include/asm/apic.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -454,6 +454,14 @@ static inline void ack_APIC_irq(void)
454454
apic_eoi();
455455
}
456456

457+
458+
static inline bool lapic_vector_set_in_irr(unsigned int vector)
459+
{
460+
u32 irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
461+
462+
return !!(irr & (1U << (vector % 32)));
463+
}
464+
457465
static inline unsigned default_get_apic_id(unsigned long x)
458466
{
459467
unsigned int ver = GET_APIC_VERSION(apic_read(APIC_LVR));

arch/x86/kernel/apic/msi.c

Lines changed: 125 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,8 @@
2323

2424
static struct irq_domain *msi_default_domain;
2525

26-
static void irq_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
26+
static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg)
2727
{
28-
struct irq_cfg *cfg = irqd_cfg(data);
29-
3028
msg->address_hi = MSI_ADDR_BASE_HI;
3129

3230
if (x2apic_enabled())
@@ -47,6 +45,127 @@ static void irq_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
4745
MSI_DATA_VECTOR(cfg->vector);
4846
}
4947

48+
static void irq_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
49+
{
50+
__irq_msi_compose_msg(irqd_cfg(data), msg);
51+
}
52+
53+
static void irq_msi_update_msg(struct irq_data *irqd, struct irq_cfg *cfg)
54+
{
55+
struct msi_msg msg[2] = { [1] = { }, };
56+
57+
__irq_msi_compose_msg(cfg, msg);
58+
irq_data_get_irq_chip(irqd)->irq_write_msi_msg(irqd, msg);
59+
}
60+
61+
static int
62+
msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
63+
{
64+
struct irq_cfg old_cfg, *cfg = irqd_cfg(irqd);
65+
struct irq_data *parent = irqd->parent_data;
66+
unsigned int cpu;
67+
int ret;
68+
69+
/* Save the current configuration */
70+
cpu = cpumask_first(irq_data_get_effective_affinity_mask(irqd));
71+
old_cfg = *cfg;
72+
73+
/* Allocate a new target vector */
74+
ret = parent->chip->irq_set_affinity(parent, mask, force);
75+
if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
76+
return ret;
77+
78+
/*
79+
* For non-maskable and non-remapped MSI interrupts the migration
80+
* to a different destination CPU and a different vector has to be
81+
* done careful to handle the possible stray interrupt which can be
82+
* caused by the non-atomic update of the address/data pair.
83+
*
84+
* Direct update is possible when:
85+
* - The MSI is maskable (remapped MSI does not use this code path)).
86+
* The quirk bit is not set in this case.
87+
* - The new vector is the same as the old vector
88+
* - The old vector is MANAGED_IRQ_SHUTDOWN_VECTOR (interrupt starts up)
89+
* - The new destination CPU is the same as the old destination CPU
90+
*/
91+
if (!irqd_msi_nomask_quirk(irqd) ||
92+
cfg->vector == old_cfg.vector ||
93+
old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR ||
94+
cfg->dest_apicid == old_cfg.dest_apicid) {
95+
irq_msi_update_msg(irqd, cfg);
96+
return ret;
97+
}
98+
99+
/*
100+
* Paranoia: Validate that the interrupt target is the local
101+
* CPU.
102+
*/
103+
if (WARN_ON_ONCE(cpu != smp_processor_id())) {
104+
irq_msi_update_msg(irqd, cfg);
105+
return ret;
106+
}
107+
108+
/*
109+
* Redirect the interrupt to the new vector on the current CPU
110+
* first. This might cause a spurious interrupt on this vector if
111+
* the device raises an interrupt right between this update and the
112+
* update to the final destination CPU.
113+
*
114+
* If the vector is in use then the installed device handler will
115+
* denote it as spurious which is no harm as this is a rare event
116+
* and interrupt handlers have to cope with spurious interrupts
117+
* anyway. If the vector is unused, then it is marked so it won't
118+
* trigger the 'No irq handler for vector' warning in do_IRQ().
119+
*
120+
* This requires to hold vector lock to prevent concurrent updates to
121+
* the affected vector.
122+
*/
123+
lock_vector_lock();
124+
125+
/*
126+
* Mark the new target vector on the local CPU if it is currently
127+
* unused. Reuse the VECTOR_RETRIGGERED state which is also used in
128+
* the CPU hotplug path for a similar purpose. This cannot be
129+
* undone here as the current CPU has interrupts disabled and
130+
* cannot handle the interrupt before the whole set_affinity()
131+
* section is done. In the CPU unplug case, the current CPU is
132+
* about to vanish and will not handle any interrupts anymore. The
133+
* vector is cleaned up when the CPU comes online again.
134+
*/
135+
if (IS_ERR_OR_NULL(this_cpu_read(vector_irq[cfg->vector])))
136+
this_cpu_write(vector_irq[cfg->vector], VECTOR_RETRIGGERED);
137+
138+
/* Redirect it to the new vector on the local CPU temporarily */
139+
old_cfg.vector = cfg->vector;
140+
irq_msi_update_msg(irqd, &old_cfg);
141+
142+
/* Now transition it to the target CPU */
143+
irq_msi_update_msg(irqd, cfg);
144+
145+
/*
146+
* All interrupts after this point are now targeted at the new
147+
* vector/CPU.
148+
*
149+
* Drop vector lock before testing whether the temporary assignment
150+
* to the local CPU was hit by an interrupt raised in the device,
151+
* because the retrigger function acquires vector lock again.
152+
*/
153+
unlock_vector_lock();
154+
155+
/*
156+
* Check whether the transition raced with a device interrupt and
157+
* is pending in the local APICs IRR. It is safe to do this outside
158+
* of vector lock as the irq_desc::lock of this interrupt is still
159+
* held and interrupts are disabled: The check is not accessing the
160+
* underlying vector store. It's just checking the local APIC's
161+
* IRR.
162+
*/
163+
if (lapic_vector_set_in_irr(cfg->vector))
164+
irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
165+
166+
return ret;
167+
}
168+
50169
/*
51170
* IRQ Chip for MSI PCI/PCI-X/PCI-Express Devices,
52171
* which implement the MSI or MSI-X Capability Structure.
@@ -58,6 +177,7 @@ static struct irq_chip pci_msi_controller = {
58177
.irq_ack = irq_chip_ack_parent,
59178
.irq_retrigger = irq_chip_retrigger_hierarchy,
60179
.irq_compose_msi_msg = irq_msi_compose_msg,
180+
.irq_set_affinity = msi_set_affinity,
61181
.flags = IRQCHIP_SKIP_SET_WAKE,
62182
};
63183

@@ -146,6 +266,8 @@ void __init arch_init_msi_domain(struct irq_domain *parent)
146266
}
147267
if (!msi_default_domain)
148268
pr_warn("failed to initialize irqdomain for MSI/MSI-x.\n");
269+
else
270+
msi_default_domain->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
149271
}
150272

151273
#ifdef CONFIG_IRQ_REMAP

include/linux/irq.h

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,8 @@ struct irq_data {
209209
* IRQD_SINGLE_TARGET - IRQ allows only a single affinity target
210210
* IRQD_DEFAULT_TRIGGER_SET - Expected trigger already been set
211211
* IRQD_CAN_RESERVE - Can use reservation mode
212+
* IRQD_MSI_NOMASK_QUIRK - Non-maskable MSI quirk for affinity change
213+
* required
212214
*/
213215
enum {
214216
IRQD_TRIGGER_MASK = 0xf,
@@ -231,6 +233,7 @@ enum {
231233
IRQD_SINGLE_TARGET = (1 << 24),
232234
IRQD_DEFAULT_TRIGGER_SET = (1 << 25),
233235
IRQD_CAN_RESERVE = (1 << 26),
236+
IRQD_MSI_NOMASK_QUIRK = (1 << 27),
234237
};
235238

236239
#define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -390,6 +393,21 @@ static inline bool irqd_can_reserve(struct irq_data *d)
390393
return __irqd_to_state(d) & IRQD_CAN_RESERVE;
391394
}
392395

396+
static inline void irqd_set_msi_nomask_quirk(struct irq_data *d)
397+
{
398+
__irqd_to_state(d) |= IRQD_MSI_NOMASK_QUIRK;
399+
}
400+
401+
static inline void irqd_clr_msi_nomask_quirk(struct irq_data *d)
402+
{
403+
__irqd_to_state(d) &= ~IRQD_MSI_NOMASK_QUIRK;
404+
}
405+
406+
static inline bool irqd_msi_nomask_quirk(struct irq_data *d)
407+
{
408+
return __irqd_to_state(d) & IRQD_MSI_NOMASK_QUIRK;
409+
}
410+
393411
#undef __irqd_to_state
394412

395413
static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)

include/linux/irqdomain.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,13 @@ enum {
206206
/* Irq domain implements MSI remapping */
207207
IRQ_DOMAIN_FLAG_MSI_REMAP = (1 << 5),
208208

209+
/*
210+
* Quirk to handle MSI implementations which do not provide
211+
* masking. Currently known to affect x86, but partially
212+
* handled in core code.
213+
*/
214+
IRQ_DOMAIN_MSI_NOMASK_QUIRK = (1 << 6),
215+
209216
/*
210217
* Flags starting from IRQ_DOMAIN_FLAG_NONCORE are reserved
211218
* for implementation specific purposes and ignored by the

kernel/irq/debugfs.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ static const struct irq_bit_descr irqdata_states[] = {
114114
BIT_MASK_DESCR(IRQD_AFFINITY_MANAGED),
115115
BIT_MASK_DESCR(IRQD_MANAGED_SHUTDOWN),
116116
BIT_MASK_DESCR(IRQD_CAN_RESERVE),
117+
BIT_MASK_DESCR(IRQD_MSI_NOMASK_QUIRK),
117118

118119
BIT_MASK_DESCR(IRQD_FORWARDED_TO_VCPU),
119120

kernel/irq/msi.c

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -453,8 +453,11 @@ int msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,
453453
continue;
454454

455455
irq_data = irq_domain_get_irq_data(domain, desc->irq);
456-
if (!can_reserve)
456+
if (!can_reserve) {
457457
irqd_clr_can_reserve(irq_data);
458+
if (domain->flags & IRQ_DOMAIN_MSI_NOMASK_QUIRK)
459+
irqd_set_msi_nomask_quirk(irq_data);
460+
}
458461
ret = irq_domain_activate_irq(irq_data, can_reserve);
459462
if (ret)
460463
goto cleanup;

0 commit comments

Comments
 (0)