Skip to content

Commit c0b8dca

Browse files
yanzhao56bonzini
authored andcommitted
KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive
Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. Chain exists of: &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&rq->__lock); lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&p->pi_lock); *** DEADLOCK *** In the wakeup handler, the callchain is *always*: sysvec_kvm_posted_intr_wakeup_ipi() | --> pi_wakeup_handler() | --> kvm_vcpu_wake_up() | --> try_to_wake_up(), and the lock order is: &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. For the schedule out path, the callchain is always (for all intents and purposes; if the kernel is preemptible, kvm_sched_out() can be called from something other than schedule(), but the beginning of the callchain will be the same point in vcpu_block()): vcpu_block() | --> schedule() | --> kvm_sched_out() | --> vmx_vcpu_put() | --> vmx_vcpu_pi_put() | --> pi_enable_wakeup_handler() and the lock order is: &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for wakeup, and complains about the A=>C versus C=>A inversion. In practice, deadlock can't occur between schedule out and the wakeup handler as they are mutually exclusive. The entirely of the schedule out code that runs with the problematic scheduler locks held, does so with IRQs disabled, i.e. can't run concurrently with the wakeup handler. Use a subclass instead disabling lockdep entirely, and tell lockdep that both subclasses are being acquired when loading a vCPU, as the sched_out and sched_in paths are NOT mutually exclusive, e.g. CPU 0 CPU 1 --------------- --------------- vCPU0 sched_out vCPU1 sched_in vCPU1 sched_out vCPU 0 sched_in where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup list+lock. Signed-off-by: Yan Zhao <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
1 parent 6bad6ec commit c0b8dca

File tree

1 file changed

+29
-3
lines changed

1 file changed

+29
-3
lines changed

arch/x86/kvm/vmx/posted_intr.c

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
3131
*/
3232
static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);
3333

34+
#define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING
35+
3436
static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
3537
{
3638
return &(to_vmx(vcpu)->pi_desc);
@@ -89,9 +91,20 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
8991
* current pCPU if the task was migrated.
9092
*/
9193
if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
92-
raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
94+
raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu);
95+
96+
/*
97+
* In addition to taking the wakeup lock for the regular/IRQ
98+
* context, tell lockdep it is being taken for the "sched out"
99+
* context as well. vCPU loads happens in task context, and
100+
* this is taking the lock of the *previous* CPU, i.e. can race
101+
* with both the scheduler and the wakeup handler.
102+
*/
103+
raw_spin_lock(spinlock);
104+
spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_);
93105
list_del(&vmx->pi_wakeup_list);
94-
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
106+
spin_release(&spinlock->dep_map, _RET_IP_);
107+
raw_spin_unlock(spinlock);
95108
}
96109

97110
dest = cpu_physical_id(cpu);
@@ -151,7 +164,20 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
151164

152165
lockdep_assert_irqs_disabled();
153166

154-
raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
167+
/*
168+
* Acquire the wakeup lock using the "sched out" context to workaround
169+
* a lockdep false positive. When this is called, schedule() holds
170+
* various per-CPU scheduler locks. When the wakeup handler runs, it
171+
* holds this CPU's wakeup lock while calling try_to_wake_up(), which
172+
* can eventually take the aforementioned scheduler locks, which causes
173+
* lockdep to assume there is deadlock.
174+
*
175+
* Deadlock can't actually occur because IRQs are disabled for the
176+
* entirety of the sched_out critical section, i.e. the wakeup handler
177+
* can't run while the scheduler locks are held.
178+
*/
179+
raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu),
180+
PI_LOCK_SCHED_OUT);
155181
list_add_tail(&vmx->pi_wakeup_list,
156182
&per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
157183
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));

0 commit comments

Comments
 (0)