Skip to content

Commit 86eb1ae

Browse files
committed
Merge branch 'kvm-mirror-page-tables' into HEAD
As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. Confidential computing solutions almost invariably have concepts of private and shared memory, but they may different a lot in the details. In SEV, for example, the bit is handled more like a permission bit as far as the page tables are concerned: the private/shared bit is not included in the physical address. For TDX, instead, the bit is more like a physical address bit, with the host mapping private memory in one half of the address space and shared in another. Furthermore, the two halves are mapped by different EPT roots and only the shared half is managed by KVM; the private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module via SEAMCALLs. As a result, the operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a mirror of the private EPT in host memory. This allows KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. There are thus three sets of EPT page tables: external, mirror and direct. In the case of TDX (the only user of this framework) the first two cover private memory, whereas the third manages shared memory: external EPT - Hidden within the TDX module, modified via TDX module calls. mirror EPT - Bookkeeping tree used as an optimization by KVM, not used by the processor. direct EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. Modifying external EPT ---------------------- Modifications to the mirrored page tables need to also perform the same operations to the private page tables, which will be handled via kvm_x86_ops. Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module; however, the locking is more complicated because inserting a single PTE cannot anymore be done atomically with a single CMPXCHG. For this reason, the existing FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides protecting operation of KVM, it limits the set of cases in which the TDX module will encounter contention on its own PTE locks. Zapping external EPT -------------------- While the framework tries to be relatively generic, and to be understandable without knowing TDX much in detail, some requirements of TDX sometimes leak; for example the private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast(). For normal VMs, guest memory is zapped for several reasons: user memory getting paged out by the guest, memslots getting deleted, passthrough of devices with non-coherent DMA. Confidential computing adds to these the conversion of memory between shared and privates. These operations must not zap any private memory that is in use by the guest. This is possible because the only zapping that is out of the control of KVM/userspace is paging out userspace memory, which cannot apply to guestmemfd operations. Thus a TDX VM will only zap private memory from memslot deletion and from conversion between private and shared memory which is triggered by the guest. To avoid zapping too much memory, enums are introduced so that operations can choose to target only private or shared memory, and thus only direct or mirror EPT. For example: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only Other cases of zapping will not be supported for KVM, for example APICv update or non-coherent DMA status update; for the latter, TDX will simply require that the CPU supports self-snoop and honor guest PAT unconditionally for shared memory.
2 parents 3eba032 + 7c54803 commit 86eb1ae

File tree

15 files changed

+554
-127
lines changed

15 files changed

+554
-127
lines changed

arch/x86/include/asm/kvm-x86-ops.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,10 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
9393
KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
9494
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
9595
KVM_X86_OP(load_mmu_pgd)
96+
KVM_X86_OP_OPTIONAL(link_external_spt)
97+
KVM_X86_OP_OPTIONAL(set_external_spte)
98+
KVM_X86_OP_OPTIONAL(free_external_spt)
99+
KVM_X86_OP_OPTIONAL(remove_external_spte)
96100
KVM_X86_OP(has_wbinvd_exit)
97101
KVM_X86_OP(get_l2_tsc_offset)
98102
KVM_X86_OP(get_l2_tsc_multiplier)

arch/x86/include/asm/kvm_host.h

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -313,10 +313,11 @@ struct kvm_kernel_irq_routing_entry;
313313
* the number of unique SPs that can theoretically be created is 2^n, where n
314314
* is the number of bits that are used to compute the role.
315315
*
316-
* But, even though there are 19 bits in the mask below, not all combinations
316+
* But, even though there are 20 bits in the mask below, not all combinations
317317
* of modes and flags are possible:
318318
*
319-
* - invalid shadow pages are not accounted, so the bits are effectively 18
319+
* - invalid shadow pages are not accounted, mirror pages are not shadowed,
320+
* so the bits are effectively 18.
320321
*
321322
* - quadrant will only be used if has_4_byte_gpte=1 (non-PAE paging);
322323
* execonly and ad_disabled are only used for nested EPT which has
@@ -349,7 +350,8 @@ union kvm_mmu_page_role {
349350
unsigned ad_disabled:1;
350351
unsigned guest_mode:1;
351352
unsigned passthrough:1;
352-
unsigned :5;
353+
unsigned is_mirror:1;
354+
unsigned :4;
353355

354356
/*
355357
* This is left at the top of the word so that
@@ -457,6 +459,7 @@ struct kvm_mmu {
457459
int (*sync_spte)(struct kvm_vcpu *vcpu,
458460
struct kvm_mmu_page *sp, int i);
459461
struct kvm_mmu_root_info root;
462+
hpa_t mirror_root_hpa;
460463
union kvm_cpu_role cpu_role;
461464
union kvm_mmu_page_role root_role;
462465

@@ -830,6 +833,11 @@ struct kvm_vcpu_arch {
830833
struct kvm_mmu_memory_cache mmu_shadow_page_cache;
831834
struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
832835
struct kvm_mmu_memory_cache mmu_page_header_cache;
836+
/*
837+
* This cache is to allocate external page table. E.g. private EPT used
838+
* by the TDX module.
839+
*/
840+
struct kvm_mmu_memory_cache mmu_external_spt_cache;
833841

834842
/*
835843
* QEMU userspace and the guest each have their own FPU state.
@@ -1549,6 +1557,8 @@ struct kvm_arch {
15491557
*/
15501558
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
15511559
struct kvm_mmu_memory_cache split_desc_cache;
1560+
1561+
gfn_t gfn_direct_bits;
15521562
};
15531563

15541564
struct kvm_vm_stat {
@@ -1761,6 +1771,21 @@ struct kvm_x86_ops {
17611771
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
17621772
int root_level);
17631773

1774+
/* Update external mapping with page table link. */
1775+
int (*link_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
1776+
void *external_spt);
1777+
/* Update the external page table from spte getting set. */
1778+
int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
1779+
kvm_pfn_t pfn_for_gfn);
1780+
1781+
/* Update external page tables for page table about to be freed. */
1782+
int (*free_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
1783+
void *external_spt);
1784+
1785+
/* Update external page table from spte getting removed, and flush TLB. */
1786+
int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
1787+
kvm_pfn_t pfn_for_gfn);
1788+
17641789
bool (*has_wbinvd_exit)(void);
17651790

17661791
u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);

arch/x86/include/uapi/asm/kvm.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -925,5 +925,6 @@ struct kvm_hyperv_eventfd {
925925
#define KVM_X86_SEV_VM 2
926926
#define KVM_X86_SEV_ES_VM 3
927927
#define KVM_X86_SNP_VM 4
928+
#define KVM_X86_TDX_VM 5
928929

929930
#endif /* _ASM_X86_KVM_H */

arch/x86/kvm/mmu.h

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,15 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
104104

105105
static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
106106
{
107+
/*
108+
* Checking root.hpa is sufficient even when KVM has mirror root.
109+
* We can have either:
110+
* (1) mirror_root_hpa = INVALID_PAGE, root.hpa = INVALID_PAGE
111+
* (2) mirror_root_hpa = root, root.hpa = INVALID_PAGE
112+
* (3) mirror_root_hpa = root1, root.hpa = root2
113+
* We don't ever have:
114+
* mirror_root_hpa = INVALID_PAGE, root.hpa = root
115+
*/
107116
if (likely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
108117
return 0;
109118

@@ -287,4 +296,26 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
287296
return gpa;
288297
return translate_nested_gpa(vcpu, gpa, access, exception);
289298
}
299+
300+
static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
301+
{
302+
return kvm->arch.vm_type == KVM_X86_TDX_VM;
303+
}
304+
305+
static inline gfn_t kvm_gfn_direct_bits(const struct kvm *kvm)
306+
{
307+
return kvm->arch.gfn_direct_bits;
308+
}
309+
310+
static inline bool kvm_is_addr_direct(struct kvm *kvm, gpa_t gpa)
311+
{
312+
gpa_t gpa_direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(kvm));
313+
314+
return !gpa_direct_bits || (gpa & gpa_direct_bits);
315+
}
316+
317+
static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
318+
{
319+
return gfn & kvm_gfn_direct_bits(kvm);
320+
}
290321
#endif

arch/x86/kvm/mmu/mmu.c

Lines changed: 51 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -599,6 +599,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
599599
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
600600
if (r)
601601
return r;
602+
if (kvm_has_mirrored_tdp(vcpu->kvm)) {
603+
r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_external_spt_cache,
604+
PT64_ROOT_MAX_LEVEL);
605+
if (r)
606+
return r;
607+
}
602608
r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
603609
PT64_ROOT_MAX_LEVEL);
604610
if (r)
@@ -618,6 +624,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
618624
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
619625
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
620626
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
627+
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_external_spt_cache);
621628
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
622629
}
623630

@@ -3656,8 +3663,13 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
36563663
unsigned i;
36573664
int r;
36583665

3659-
if (tdp_mmu_enabled)
3660-
return kvm_tdp_mmu_alloc_root(vcpu);
3666+
if (tdp_mmu_enabled) {
3667+
if (kvm_has_mirrored_tdp(vcpu->kvm) &&
3668+
!VALID_PAGE(mmu->mirror_root_hpa))
3669+
kvm_tdp_mmu_alloc_root(vcpu, true);
3670+
kvm_tdp_mmu_alloc_root(vcpu, false);
3671+
return 0;
3672+
}
36613673

36623674
write_lock(&vcpu->kvm->mmu_lock);
36633675
r = make_mmu_pages_available(vcpu);
@@ -4379,8 +4391,12 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
43794391
struct kvm_page_fault *fault, unsigned int access)
43804392
{
43814393
struct kvm_memory_slot *slot = fault->slot;
4394+
struct kvm *kvm = vcpu->kvm;
43824395
int ret;
43834396

4397+
if (KVM_BUG_ON(kvm_is_gfn_alias(kvm, fault->gfn), kvm))
4398+
return -EFAULT;
4399+
43844400
/*
43854401
* Note that the mmu_invalidate_seq also serves to detect a concurrent
43864402
* change in attributes. is_page_fault_stale() will detect an
@@ -4394,7 +4410,7 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
43944410
* Now that we have a snapshot of mmu_invalidate_seq we can check for a
43954411
* private vs. shared mismatch.
43964412
*/
4397-
if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
4413+
if (fault->is_private != kvm_mem_is_private(kvm, fault->gfn)) {
43984414
kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
43994415
return -EFAULT;
44004416
}
@@ -4456,7 +4472,7 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
44564472
* *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
44574473
* to detect retry guarantees the worst case latency for the vCPU.
44584474
*/
4459-
if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
4475+
if (mmu_invalidate_retry_gfn_unsafe(kvm, fault->mmu_seq, fault->gfn))
44604476
return RET_PF_RETRY;
44614477

44624478
ret = __kvm_mmu_faultin_pfn(vcpu, fault);
@@ -4476,7 +4492,7 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
44764492
* overall cost of failing to detect the invalidation until after
44774493
* mmu_lock is acquired.
44784494
*/
4479-
if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) {
4495+
if (mmu_invalidate_retry_gfn_unsafe(kvm, fault->mmu_seq, fault->gfn)) {
44804496
kvm_mmu_finish_page_fault(vcpu, fault, RET_PF_RETRY);
44814497
return RET_PF_RETRY;
44824498
}
@@ -6095,8 +6111,16 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
60956111
else if (r == RET_PF_SPURIOUS)
60966112
vcpu->stat.pf_spurious++;
60976113

6114+
/*
6115+
* None of handle_mmio_page_fault(), kvm_mmu_do_page_fault(), or
6116+
* kvm_mmu_write_protect_fault() return RET_PF_CONTINUE.
6117+
* kvm_mmu_do_page_fault() only uses RET_PF_CONTINUE internally to
6118+
* indicate continuing the page fault handling until to the final
6119+
* page table mapping phase.
6120+
*/
6121+
WARN_ON_ONCE(r == RET_PF_CONTINUE);
60986122
if (r != RET_PF_EMULATE)
6099-
return 1;
6123+
return r;
61006124

61016125
emulate:
61026126
return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
@@ -6272,6 +6296,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
62726296

62736297
mmu->root.hpa = INVALID_PAGE;
62746298
mmu->root.pgd = 0;
6299+
mmu->mirror_root_hpa = INVALID_PAGE;
62756300
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
62766301
mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
62776302

@@ -6441,8 +6466,13 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
64416466
* write and in the same critical section as making the reload request,
64426467
* e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
64436468
*/
6444-
if (tdp_mmu_enabled)
6445-
kvm_tdp_mmu_invalidate_all_roots(kvm);
6469+
if (tdp_mmu_enabled) {
6470+
/*
6471+
* External page tables don't support fast zapping, therefore
6472+
* their mirrors must be invalidated separately by the caller.
6473+
*/
6474+
kvm_tdp_mmu_invalidate_roots(kvm, KVM_DIRECT_ROOTS);
6475+
}
64466476

64476477
/*
64486478
* Notify all vcpus to reload its shadow page table and flush TLB.
@@ -6467,7 +6497,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
64676497
* lead to use-after-free.
64686498
*/
64696499
if (tdp_mmu_enabled)
6470-
kvm_tdp_mmu_zap_invalidated_roots(kvm);
6500+
kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
64716501
}
64726502

64736503
void kvm_mmu_init_vm(struct kvm *kvm)
@@ -7220,6 +7250,12 @@ int kvm_mmu_vendor_module_init(void)
72207250
void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
72217251
{
72227252
kvm_mmu_unload(vcpu);
7253+
if (tdp_mmu_enabled) {
7254+
read_lock(&vcpu->kvm->mmu_lock);
7255+
mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->mirror_root_hpa,
7256+
NULL);
7257+
read_unlock(&vcpu->kvm->mmu_lock);
7258+
}
72237259
free_mmu_pages(&vcpu->arch.root_mmu);
72247260
free_mmu_pages(&vcpu->arch.guest_mmu);
72257261
mmu_free_memory_caches(vcpu);
@@ -7452,6 +7488,12 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
74527488
if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
74537489
return false;
74547490

7491+
/* Unmap the old attribute page. */
7492+
if (range->arg.attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)
7493+
range->attr_filter = KVM_FILTER_SHARED;
7494+
else
7495+
range->attr_filter = KVM_FILTER_PRIVATE;
7496+
74557497
return kvm_unmap_gfn_range(kvm, range);
74567498
}
74577499

0 commit comments

Comments
 (0)