Skip to content

Commit 0df9dab

Browse files
sean-jcbonzini
authored andcommitted
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <[email protected]> Cc: David Stevens <[email protected]> Cc: Yiwei Zhang<[email protected]> Cc: Paul Hsia <[email protected]> Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
1 parent 441a5df commit 0df9dab

File tree

6 files changed

+68
-103
lines changed

6 files changed

+68
-103
lines changed

arch/x86/include/asm/kvm_host.h

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1419,7 +1419,6 @@ struct kvm_arch {
14191419
* the thread holds the MMU lock in write mode.
14201420
*/
14211421
spinlock_t tdp_mmu_pages_lock;
1422-
struct workqueue_struct *tdp_mmu_zap_wq;
14231422
#endif /* CONFIG_X86_64 */
14241423

14251424
/*
@@ -1835,7 +1834,7 @@ void kvm_mmu_vendor_module_exit(void);
18351834

18361835
void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
18371836
int kvm_mmu_create(struct kvm_vcpu *vcpu);
1838-
int kvm_mmu_init_vm(struct kvm *kvm);
1837+
void kvm_mmu_init_vm(struct kvm *kvm);
18391838
void kvm_mmu_uninit_vm(struct kvm *kvm);
18401839

18411840
void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu);

arch/x86/kvm/mmu/mmu.c

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6167,20 +6167,15 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
61676167
return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
61686168
}
61696169

6170-
int kvm_mmu_init_vm(struct kvm *kvm)
6170+
void kvm_mmu_init_vm(struct kvm *kvm)
61716171
{
6172-
int r;
6173-
61746172
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
61756173
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
61766174
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
61776175
spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
61786176

6179-
if (tdp_mmu_enabled) {
6180-
r = kvm_mmu_init_tdp_mmu(kvm);
6181-
if (r < 0)
6182-
return r;
6183-
}
6177+
if (tdp_mmu_enabled)
6178+
kvm_mmu_init_tdp_mmu(kvm);
61846179

61856180
kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
61866181
kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
@@ -6189,8 +6184,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
61896184

61906185
kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
61916186
kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
6192-
6193-
return 0;
61946187
}
61956188

61966189
static void mmu_free_vm_memory_caches(struct kvm *kvm)

arch/x86/kvm/mmu/mmu_internal.h

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,12 @@ struct kvm_mmu_page {
5858

5959
bool tdp_mmu_page;
6060
bool unsync;
61-
u8 mmu_valid_gen;
61+
union {
62+
u8 mmu_valid_gen;
63+
64+
/* Only accessed under slots_lock. */
65+
bool tdp_mmu_scheduled_root_to_zap;
66+
};
6267

6368
/*
6469
* The shadow page can't be replaced by an equivalent huge page
@@ -100,13 +105,7 @@ struct kvm_mmu_page {
100105
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
101106
tdp_ptep_t ptep;
102107
};
103-
union {
104-
DECLARE_BITMAP(unsync_child_bitmap, 512);
105-
struct {
106-
struct work_struct tdp_mmu_async_work;
107-
void *tdp_mmu_async_data;
108-
};
109-
};
108+
DECLARE_BITMAP(unsync_child_bitmap, 512);
110109

111110
/*
112111
* Tracks shadow pages that, if zapped, would allow KVM to create an NX

arch/x86/kvm/mmu/tdp_mmu.c

Lines changed: 55 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -12,18 +12,10 @@
1212
#include <trace/events/kvm.h>
1313

1414
/* Initializes the TDP MMU for the VM, if enabled. */
15-
int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
15+
void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
1616
{
17-
struct workqueue_struct *wq;
18-
19-
wq = alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
20-
if (!wq)
21-
return -ENOMEM;
22-
2317
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
2418
spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
25-
kvm->arch.tdp_mmu_zap_wq = wq;
26-
return 1;
2719
}
2820

2921
/* Arbitrarily returns true so that this may be used in if statements. */
@@ -46,20 +38,15 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
4638
* ultimately frees all roots.
4739
*/
4840
kvm_tdp_mmu_invalidate_all_roots(kvm);
49-
50-
/*
51-
* Destroying a workqueue also first flushes the workqueue, i.e. no
52-
* need to invoke kvm_tdp_mmu_zap_invalidated_roots().
53-
*/
54-
destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
41+
kvm_tdp_mmu_zap_invalidated_roots(kvm);
5542

5643
WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
5744
WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
5845

5946
/*
6047
* Ensure that all the outstanding RCU callbacks to free shadow pages
61-
* can run before the VM is torn down. Work items on tdp_mmu_zap_wq
62-
* can call kvm_tdp_mmu_put_root and create new callbacks.
48+
* can run before the VM is torn down. Putting the last reference to
49+
* zapped roots will create new callbacks.
6350
*/
6451
rcu_barrier();
6552
}
@@ -86,46 +73,6 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
8673
tdp_mmu_free_sp(sp);
8774
}
8875

89-
static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
90-
bool shared);
91-
92-
static void tdp_mmu_zap_root_work(struct work_struct *work)
93-
{
94-
struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
95-
tdp_mmu_async_work);
96-
struct kvm *kvm = root->tdp_mmu_async_data;
97-
98-
read_lock(&kvm->mmu_lock);
99-
100-
/*
101-
* A TLB flush is not necessary as KVM performs a local TLB flush when
102-
* allocating a new root (see kvm_mmu_load()), and when migrating vCPU
103-
* to a different pCPU. Note, the local TLB flush on reuse also
104-
* invalidates any paging-structure-cache entries, i.e. TLB entries for
105-
* intermediate paging structures, that may be zapped, as such entries
106-
* are associated with the ASID on both VMX and SVM.
107-
*/
108-
tdp_mmu_zap_root(kvm, root, true);
109-
110-
/*
111-
* Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
112-
* avoiding an infinite loop. By design, the root is reachable while
113-
* it's being asynchronously zapped, thus a different task can put its
114-
* last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
115-
* asynchronously zapped root is unavoidable.
116-
*/
117-
kvm_tdp_mmu_put_root(kvm, root, true);
118-
119-
read_unlock(&kvm->mmu_lock);
120-
}
121-
122-
static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root)
123-
{
124-
root->tdp_mmu_async_data = kvm;
125-
INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_work);
126-
queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
127-
}
128-
12976
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
13077
bool shared)
13178
{
@@ -211,11 +158,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
211158
#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
212159
__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared, true)
213160

214-
#define for_each_tdp_mmu_root_yield_safe(_kvm, _root) \
215-
for (_root = tdp_mmu_next_root(_kvm, NULL, false, false); \
161+
#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _shared) \
162+
for (_root = tdp_mmu_next_root(_kvm, NULL, _shared, false); \
216163
_root; \
217-
_root = tdp_mmu_next_root(_kvm, _root, false, false)) \
218-
if (!kvm_lockdep_assert_mmu_lock_held(_kvm, false)) { \
164+
_root = tdp_mmu_next_root(_kvm, _root, _shared, false)) \
165+
if (!kvm_lockdep_assert_mmu_lock_held(_kvm, _shared)) { \
219166
} else
220167

221168
/*
@@ -296,7 +243,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
296243
* by a memslot update or by the destruction of the VM. Initialize the
297244
* refcount to two; one reference for the vCPU, and one reference for
298245
* the TDP MMU itself, which is held until the root is invalidated and
299-
* is ultimately put by tdp_mmu_zap_root_work().
246+
* is ultimately put by kvm_tdp_mmu_zap_invalidated_roots().
300247
*/
301248
refcount_set(&root->tdp_mmu_root_count, 2);
302249

@@ -885,7 +832,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
885832
{
886833
struct kvm_mmu_page *root;
887834

888-
for_each_tdp_mmu_root_yield_safe(kvm, root)
835+
for_each_tdp_mmu_root_yield_safe(kvm, root, false)
889836
flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
890837

891838
return flush;
@@ -907,7 +854,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
907854
* is being destroyed or the userspace VMM has exited. In both cases,
908855
* KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
909856
*/
910-
for_each_tdp_mmu_root_yield_safe(kvm, root)
857+
for_each_tdp_mmu_root_yield_safe(kvm, root, false)
911858
tdp_mmu_zap_root(kvm, root, false);
912859
}
913860

@@ -917,18 +864,47 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
917864
*/
918865
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
919866
{
920-
flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
867+
struct kvm_mmu_page *root;
868+
869+
read_lock(&kvm->mmu_lock);
870+
871+
for_each_tdp_mmu_root_yield_safe(kvm, root, true) {
872+
if (!root->tdp_mmu_scheduled_root_to_zap)
873+
continue;
874+
875+
root->tdp_mmu_scheduled_root_to_zap = false;
876+
KVM_BUG_ON(!root->role.invalid, kvm);
877+
878+
/*
879+
* A TLB flush is not necessary as KVM performs a local TLB
880+
* flush when allocating a new root (see kvm_mmu_load()), and
881+
* when migrating a vCPU to a different pCPU. Note, the local
882+
* TLB flush on reuse also invalidates paging-structure-cache
883+
* entries, i.e. TLB entries for intermediate paging structures,
884+
* that may be zapped, as such entries are associated with the
885+
* ASID on both VMX and SVM.
886+
*/
887+
tdp_mmu_zap_root(kvm, root, true);
888+
889+
/*
890+
* The referenced needs to be put *after* zapping the root, as
891+
* the root must be reachable by mmu_notifiers while it's being
892+
* zapped
893+
*/
894+
kvm_tdp_mmu_put_root(kvm, root, true);
895+
}
896+
897+
read_unlock(&kvm->mmu_lock);
921898
}
922899

923900
/*
924901
* Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
925902
* is about to be zapped, e.g. in response to a memslots update. The actual
926-
* zapping is performed asynchronously. Using a separate workqueue makes it
927-
* easy to ensure that the destruction is performed before the "fast zap"
928-
* completes, without keeping a separate list of invalidated roots; the list is
929-
* effectively the list of work items in the workqueue.
903+
* zapping is done separately so that it happens with mmu_lock with read,
904+
* whereas invalidating roots must be done with mmu_lock held for write (unless
905+
* the VM is being destroyed).
930906
*
931-
* Note, the asynchronous worker is gifted the TDP MMU's reference.
907+
* Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
932908
* See kvm_tdp_mmu_get_vcpu_root_hpa().
933909
*/
934910
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
@@ -953,19 +929,20 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
953929
/*
954930
* As above, mmu_lock isn't held when destroying the VM! There can't
955931
* be other references to @kvm, i.e. nothing else can invalidate roots
956-
* or be consuming roots, but walking the list of roots does need to be
957-
* guarded against roots being deleted by the asynchronous zap worker.
932+
* or get/put references to roots.
958933
*/
959-
rcu_read_lock();
960-
961-
list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
934+
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
935+
/*
936+
* Note, invalid roots can outlive a memslot update! Invalid
937+
* roots must be *zapped* before the memslot update completes,
938+
* but a different task can acquire a reference and keep the
939+
* root alive after its been zapped.
940+
*/
962941
if (!root->role.invalid) {
942+
root->tdp_mmu_scheduled_root_to_zap = true;
963943
root->role.invalid = true;
964-
tdp_mmu_schedule_zap_root(kvm, root);
965944
}
966945
}
967-
968-
rcu_read_unlock();
969946
}
970947

971948
/*

arch/x86/kvm/mmu/tdp_mmu.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
#include "spte.h"
99

10-
int kvm_mmu_init_tdp_mmu(struct kvm *kvm);
10+
void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
1111
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
1212

1313
hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);

arch/x86/kvm/x86.c

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12308,9 +12308,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
1230812308
if (ret)
1230912309
goto out;
1231012310

12311-
ret = kvm_mmu_init_vm(kvm);
12312-
if (ret)
12313-
goto out_page_track;
12311+
kvm_mmu_init_vm(kvm);
1231412312

1231512313
ret = static_call(kvm_x86_vm_init)(kvm);
1231612314
if (ret)
@@ -12355,7 +12353,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
1235512353

1235612354
out_uninit_mmu:
1235712355
kvm_mmu_uninit_vm(kvm);
12358-
out_page_track:
1235912356
kvm_page_track_cleanup(kvm);
1236012357
out:
1236112358
return ret;

0 commit comments

Comments
 (0)