Skip to content

Commit 540745d

Browse files
Sean Christophersonsuryasaimadhu
authored andcommitted
x86/sgx: Introduce virtual EPC for use by KVM guests
Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw" Enclave Page Cache (EPC) without an associated enclave. The intended and only known use case for raw EPC allocation is to expose EPC to a KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM Kconfig. The SGX driver uses the misc device /dev/sgx_enclave to support userspace in creating an enclave. Each file descriptor returned from opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver, KVM doesn't control how the guest uses the EPC, therefore EPC allocated to a KVM guest is not associated with an enclave, and /dev/sgx_enclave is not suitable for allocating EPC for a KVM guest. Having separate device nodes for the SGX driver and KVM virtual EPC also allows separate permission control for running host SGX enclaves and KVM SGX guests. To use /dev/sgx_vepc to allocate a virtual EPC instance with particular size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the intended size to get an address range of virtual EPC. Then it may use the address range to create one KVM memory slot as virtual EPC for a guest. Implement the "raw" EPC allocation in the x86 core-SGX subsystem via /dev/sgx_vepc rather than in KVM. Doing so has two major advantages: - Does not require changes to KVM's uAPI, e.g. EPC gets handled as just another memory backend for guests. - EPC management is wholly contained in the SGX subsystem, e.g. SGX does not have to export any symbols, changes to reclaim flows don't need to be routed through KVM, SGX's dirty laundry doesn't have to get aired out for the world to see, and so on and so forth. The virtual EPC pages allocated to guests are currently not reclaimable. Reclaiming an EPC page used by enclave requires a special reclaim mechanism separate from normal page reclaim, and that mechanism is not supported for virutal EPC pages. Due to the complications of handling reclaim conflicts between guest and host, reclaiming virtual EPC pages is significantly more complex than basic support for SGX virtualization. [ bp: - Massage commit message and comments - use cpu_feature_enabled() - vertically align struct members init - massage Virtual EPC clarification text - move Kconfig prompt to Virtualization ] Signed-off-by: Sean Christopherson <[email protected]> Co-developed-by: Kai Huang <[email protected]> Signed-off-by: Kai Huang <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Dave Hansen <[email protected]> Acked-by: Jarkko Sakkinen <[email protected]> Link: https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com
1 parent 231d3db commit 540745d

File tree

5 files changed

+297
-0
lines changed

5 files changed

+297
-0
lines changed

Documentation/x86/sgx.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,3 +234,19 @@ As a result, when this happpens, user should stop running any new
234234
SGX workloads, (or just any new workloads), and migrate all valuable
235235
workloads. Although a machine reboot can recover all EPC memory, the bug
236236
should be reported to Linux developers.
237+
238+
239+
Virtual EPC
240+
===========
241+
242+
The implementation has also a virtual EPC driver to support SGX enclaves
243+
in guests. Unlike the SGX driver, an EPC page allocated by the virtual
244+
EPC driver doesn't have a specific enclave associated with it. This is
245+
because KVM doesn't track how a guest uses EPC pages.
246+
247+
As a result, the SGX core page reclaimer doesn't support reclaiming EPC
248+
pages allocated to KVM guests through the virtual EPC driver. If the
249+
user wants to deploy SGX applications both on the host and in guests
250+
on the same machine, the user should reserve enough EPC (by taking out
251+
total virtual EPC size of all SGX VMs from the physical EPC size) for
252+
host SGX applications so they can run with acceptable performance.

arch/x86/kernel/cpu/sgx/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ obj-y += \
33
encl.o \
44
ioctl.o \
55
main.o
6+
obj-$(CONFIG_X86_SGX_KVM) += virt.o

arch/x86/kernel/cpu/sgx/sgx.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,4 +84,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
8484
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
8585
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
8686

87+
#ifdef CONFIG_X86_SGX_KVM
88+
int __init sgx_vepc_init(void);
89+
#else
90+
static inline int __init sgx_vepc_init(void)
91+
{
92+
return -ENODEV;
93+
}
94+
#endif
95+
8796
#endif /* _X86_SGX_H */

arch/x86/kernel/cpu/sgx/virt.c

Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
// SPDX-License-Identifier: GPL-2.0
2+
/*
3+
* Device driver to expose SGX enclave memory to KVM guests.
4+
*
5+
* Copyright(c) 2021 Intel Corporation.
6+
*/
7+
8+
#include <linux/miscdevice.h>
9+
#include <linux/mm.h>
10+
#include <linux/mman.h>
11+
#include <linux/sched/mm.h>
12+
#include <linux/sched/signal.h>
13+
#include <linux/slab.h>
14+
#include <linux/xarray.h>
15+
#include <asm/sgx.h>
16+
#include <uapi/asm/sgx.h>
17+
18+
#include "encls.h"
19+
#include "sgx.h"
20+
21+
struct sgx_vepc {
22+
struct xarray page_array;
23+
struct mutex lock;
24+
};
25+
26+
/*
27+
* Temporary SECS pages that cannot be EREMOVE'd due to having child in other
28+
* virtual EPC instances, and the lock to protect it.
29+
*/
30+
static struct mutex zombie_secs_pages_lock;
31+
static struct list_head zombie_secs_pages;
32+
33+
static int __sgx_vepc_fault(struct sgx_vepc *vepc,
34+
struct vm_area_struct *vma, unsigned long addr)
35+
{
36+
struct sgx_epc_page *epc_page;
37+
unsigned long index, pfn;
38+
int ret;
39+
40+
WARN_ON(!mutex_is_locked(&vepc->lock));
41+
42+
/* Calculate index of EPC page in virtual EPC's page_array */
43+
index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
44+
45+
epc_page = xa_load(&vepc->page_array, index);
46+
if (epc_page)
47+
return 0;
48+
49+
epc_page = sgx_alloc_epc_page(vepc, false);
50+
if (IS_ERR(epc_page))
51+
return PTR_ERR(epc_page);
52+
53+
ret = xa_err(xa_store(&vepc->page_array, index, epc_page, GFP_KERNEL));
54+
if (ret)
55+
goto err_free;
56+
57+
pfn = PFN_DOWN(sgx_get_epc_phys_addr(epc_page));
58+
59+
ret = vmf_insert_pfn(vma, addr, pfn);
60+
if (ret != VM_FAULT_NOPAGE) {
61+
ret = -EFAULT;
62+
goto err_delete;
63+
}
64+
65+
return 0;
66+
67+
err_delete:
68+
xa_erase(&vepc->page_array, index);
69+
err_free:
70+
sgx_free_epc_page(epc_page);
71+
return ret;
72+
}
73+
74+
static vm_fault_t sgx_vepc_fault(struct vm_fault *vmf)
75+
{
76+
struct vm_area_struct *vma = vmf->vma;
77+
struct sgx_vepc *vepc = vma->vm_private_data;
78+
int ret;
79+
80+
mutex_lock(&vepc->lock);
81+
ret = __sgx_vepc_fault(vepc, vma, vmf->address);
82+
mutex_unlock(&vepc->lock);
83+
84+
if (!ret)
85+
return VM_FAULT_NOPAGE;
86+
87+
if (ret == -EBUSY && (vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
88+
mmap_read_unlock(vma->vm_mm);
89+
return VM_FAULT_RETRY;
90+
}
91+
92+
return VM_FAULT_SIGBUS;
93+
}
94+
95+
const struct vm_operations_struct sgx_vepc_vm_ops = {
96+
.fault = sgx_vepc_fault,
97+
};
98+
99+
static int sgx_vepc_mmap(struct file *file, struct vm_area_struct *vma)
100+
{
101+
struct sgx_vepc *vepc = file->private_data;
102+
103+
if (!(vma->vm_flags & VM_SHARED))
104+
return -EINVAL;
105+
106+
vma->vm_ops = &sgx_vepc_vm_ops;
107+
/* Don't copy VMA in fork() */
108+
vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTDUMP | VM_DONTCOPY;
109+
vma->vm_private_data = vepc;
110+
111+
return 0;
112+
}
113+
114+
static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
115+
{
116+
int ret;
117+
118+
/*
119+
* Take a previously guest-owned EPC page and return it to the
120+
* general EPC page pool.
121+
*
122+
* Guests can not be trusted to have left this page in a good
123+
* state, so run EREMOVE on the page unconditionally. In the
124+
* case that a guest properly EREMOVE'd this page, a superfluous
125+
* EREMOVE is harmless.
126+
*/
127+
ret = __eremove(sgx_get_epc_virt_addr(epc_page));
128+
if (ret) {
129+
/*
130+
* Only SGX_CHILD_PRESENT is expected, which is because of
131+
* EREMOVE'ing an SECS still with child, in which case it can
132+
* be handled by EREMOVE'ing the SECS again after all pages in
133+
* virtual EPC have been EREMOVE'd. See comments in below in
134+
* sgx_vepc_release().
135+
*
136+
* The user of virtual EPC (KVM) needs to guarantee there's no
137+
* logical processor is still running in the enclave in guest,
138+
* otherwise EREMOVE will get SGX_ENCLAVE_ACT which cannot be
139+
* handled here.
140+
*/
141+
WARN_ONCE(ret != SGX_CHILD_PRESENT, EREMOVE_ERROR_MESSAGE,
142+
ret, ret);
143+
return ret;
144+
}
145+
146+
sgx_free_epc_page(epc_page);
147+
148+
return 0;
149+
}
150+
151+
static int sgx_vepc_release(struct inode *inode, struct file *file)
152+
{
153+
struct sgx_vepc *vepc = file->private_data;
154+
struct sgx_epc_page *epc_page, *tmp, *entry;
155+
unsigned long index;
156+
157+
LIST_HEAD(secs_pages);
158+
159+
xa_for_each(&vepc->page_array, index, entry) {
160+
/*
161+
* Remove all normal, child pages. sgx_vepc_free_page()
162+
* will fail if EREMOVE fails, but this is OK and expected on
163+
* SECS pages. Those can only be EREMOVE'd *after* all their
164+
* child pages. Retries below will clean them up.
165+
*/
166+
if (sgx_vepc_free_page(entry))
167+
continue;
168+
169+
xa_erase(&vepc->page_array, index);
170+
}
171+
172+
/*
173+
* Retry EREMOVE'ing pages. This will clean up any SECS pages that
174+
* only had children in this 'epc' area.
175+
*/
176+
xa_for_each(&vepc->page_array, index, entry) {
177+
epc_page = entry;
178+
/*
179+
* An EREMOVE failure here means that the SECS page still
180+
* has children. But, since all children in this 'sgx_vepc'
181+
* have been removed, the SECS page must have a child on
182+
* another instance.
183+
*/
184+
if (sgx_vepc_free_page(epc_page))
185+
list_add_tail(&epc_page->list, &secs_pages);
186+
187+
xa_erase(&vepc->page_array, index);
188+
}
189+
190+
/*
191+
* SECS pages are "pinned" by child pages, and "unpinned" once all
192+
* children have been EREMOVE'd. A child page in this instance
193+
* may have pinned an SECS page encountered in an earlier release(),
194+
* creating a zombie. Since some children were EREMOVE'd above,
195+
* try to EREMOVE all zombies in the hopes that one was unpinned.
196+
*/
197+
mutex_lock(&zombie_secs_pages_lock);
198+
list_for_each_entry_safe(epc_page, tmp, &zombie_secs_pages, list) {
199+
/*
200+
* Speculatively remove the page from the list of zombies,
201+
* if the page is successfully EREMOVE'd it will be added to
202+
* the list of free pages. If EREMOVE fails, throw the page
203+
* on the local list, which will be spliced on at the end.
204+
*/
205+
list_del(&epc_page->list);
206+
207+
if (sgx_vepc_free_page(epc_page))
208+
list_add_tail(&epc_page->list, &secs_pages);
209+
}
210+
211+
if (!list_empty(&secs_pages))
212+
list_splice_tail(&secs_pages, &zombie_secs_pages);
213+
mutex_unlock(&zombie_secs_pages_lock);
214+
215+
kfree(vepc);
216+
217+
return 0;
218+
}
219+
220+
static int sgx_vepc_open(struct inode *inode, struct file *file)
221+
{
222+
struct sgx_vepc *vepc;
223+
224+
vepc = kzalloc(sizeof(struct sgx_vepc), GFP_KERNEL);
225+
if (!vepc)
226+
return -ENOMEM;
227+
mutex_init(&vepc->lock);
228+
xa_init(&vepc->page_array);
229+
230+
file->private_data = vepc;
231+
232+
return 0;
233+
}
234+
235+
static const struct file_operations sgx_vepc_fops = {
236+
.owner = THIS_MODULE,
237+
.open = sgx_vepc_open,
238+
.release = sgx_vepc_release,
239+
.mmap = sgx_vepc_mmap,
240+
};
241+
242+
static struct miscdevice sgx_vepc_dev = {
243+
.minor = MISC_DYNAMIC_MINOR,
244+
.name = "sgx_vepc",
245+
.nodename = "sgx_vepc",
246+
.fops = &sgx_vepc_fops,
247+
};
248+
249+
int __init sgx_vepc_init(void)
250+
{
251+
/* SGX virtualization requires KVM to work */
252+
if (!cpu_feature_enabled(X86_FEATURE_VMX))
253+
return -ENODEV;
254+
255+
INIT_LIST_HEAD(&zombie_secs_pages);
256+
mutex_init(&zombie_secs_pages_lock);
257+
258+
return misc_register(&sgx_vepc_dev);
259+
}

arch/x86/kvm/Kconfig

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,18 @@ config KVM_INTEL
8484
To compile this as a module, choose M here: the module
8585
will be called kvm-intel.
8686

87+
config X86_SGX_KVM
88+
bool "Software Guard eXtensions (SGX) Virtualization"
89+
depends on X86_SGX && KVM_INTEL
90+
help
91+
92+
Enables KVM guests to create SGX enclaves.
93+
94+
This includes support to expose "raw" unreclaimable enclave memory to
95+
guests via a device node, e.g. /dev/sgx_vepc.
96+
97+
If unsure, say N.
98+
8799
config KVM_AMD
88100
tristate "KVM for AMD processors support"
89101
depends on KVM

0 commit comments

Comments
 (0)