Skip to content

Commit 9876cfe

Browse files
cypharakpm00
authored andcommitted
memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
This sysctl has the very unusual behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were to set this sysctl to a more restrictive option in the host pidns you would need to reboot your machine in order to reset it. The justification given in [1] is that this is a security feature and thus it should not be possible to disable. Aside from the fact that we have plenty of security-related sysctls that can be disabled after being enabled (fs.protected_symlinks for instance), the protection provided by the sysctl is to stop users from being able to create a binary and then execute it. A user with CAP_SYS_ADMIN can trivially do this without memfd_create(2): % cat mount-memfd.c #include <fcntl.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/mount.h> #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" int main(void) { int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); assert(fsfd >= 0); assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); assert(dfd >= 0); int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782); assert(execfd >= 0); assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); assert(!close(execfd)); char *execpath = NULL; char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL }; execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC); assert(execfd >= 0); assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); assert(!execve(execpath, argv, envp)); } % ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 % Given that it is possible for CAP_SYS_ADMIN users to create executable binaries without memfd_create(2) and without touching the host filesystem (not to mention the many other things a CAP_SYS_ADMIN process would be able to do that would be equivalent or worse), it seems strange to cause a fair amount of headache to admins when there doesn't appear to be an actual security benefit to blocking this. There appear to be concerns about confused-deputy-esque attacks[2] but a confused deputy that can write to arbitrary sysctls is a bigger security issue than executable memfds. /* New API */ The primary requirement from the original author appears to be more based on the need to be able to restrict an entire system in a hierarchical manner[3], such that child namespaces cannot re-enable executable memfds. So, implement that behaviour explicitly -- the vm.memfd_noexec scope is evaluated up the pidns tree to &init_pid_ns and you have the most restrictive value applied to you. The new lower limit you can set vm.memfd_noexec is whatever limit applies to your parent. Note that a pidns will inherit a copy of the parent pidns's effective vm.memfd_noexec setting at unshare() time. This matches the existing behaviour, and it also ensures that a pidns will never have its vm.memfd_noexec setting *lowered* behind its back (but it will be raised if the parent raises theirs). /* Backwards Compatibility */ As the previous version of the sysctl didn't allow you to lower the setting at all, there are no backwards compatibility issues with this aspect of the change. However it should be noted that now that the setting is completely hierarchical. Previously, a cloned pidns would just copy the current pidns setting, meaning that if the parent's vm.memfd_noexec was changed it wouldn't propoagate to existing pid namespaces. Now, the restriction applies recursively. This is a uAPI change, however: * The sysctl is very new, having been merged in 6.3. * Several aspects of the sysctl were broken up until this patchset and the other patchset by Jeff Xu last month. And thus it seems incredibly unlikely that any real users would run into this issue. In the worst case, if this causes userspace isues we could make it so that modifying the setting follows the hierarchical rules but the restriction checking uses the cached copy. [1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/ [2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/ [3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 105ff53 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <[email protected]> Cc: Dominique Martinet <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Daniel Verkamp <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 434ed33 commit 9876cfe

File tree

5 files changed

+42
-21
lines changed

5 files changed

+42
-21
lines changed

include/linux/pid_namespace.h

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,6 @@ struct pid_namespace {
3939
int reboot; /* group exit code if this pidns was rebooted */
4040
struct ns_common ns;
4141
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
42-
/* sysctl for vm.memfd_noexec */
4342
int memfd_noexec_scope;
4443
#endif
4544
} __randomize_layout;
@@ -56,6 +55,23 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
5655
return ns;
5756
}
5857

58+
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
59+
static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
60+
{
61+
int scope = MEMFD_NOEXEC_SCOPE_EXEC;
62+
63+
for (; ns; ns = ns->parent)
64+
scope = max(scope, READ_ONCE(ns->memfd_noexec_scope));
65+
66+
return scope;
67+
}
68+
#else
69+
static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
70+
{
71+
return 0;
72+
}
73+
#endif
74+
5975
extern struct pid_namespace *copy_pid_ns(unsigned long flags,
6076
struct user_namespace *user_ns, struct pid_namespace *ns);
6177
extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
@@ -70,6 +86,11 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
7086
return ns;
7187
}
7288

89+
static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
90+
{
91+
return 0;
92+
}
93+
7394
static inline struct pid_namespace *copy_pid_ns(unsigned long flags,
7495
struct user_namespace *user_ns, struct pid_namespace *ns)
7596
{

kernel/pid.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,9 @@ struct pid_namespace init_pid_ns = {
8383
#ifdef CONFIG_PID_NS
8484
.ns.ops = &pidns_operations,
8585
#endif
86+
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
87+
.memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC,
88+
#endif
8689
};
8790
EXPORT_SYMBOL_GPL(init_pid_ns);
8891

kernel/pid_namespace.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
110110
ns->user_ns = get_user_ns(user_ns);
111111
ns->ucounts = ucounts;
112112
ns->pid_allocated = PIDNS_ADDING;
113-
114-
initialize_memfd_noexec_scope(ns);
115-
113+
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
114+
ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
115+
#endif
116116
return ns;
117117

118118
out_free_idr:

kernel/pid_sysctl.h

Lines changed: 12 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,33 +5,30 @@
55
#include <linux/pid_namespace.h>
66

77
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
8-
static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns)
9-
{
10-
ns->memfd_noexec_scope =
11-
task_active_pid_ns(current)->memfd_noexec_scope;
12-
}
13-
148
static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table,
159
int write, void *buf, size_t *lenp, loff_t *ppos)
1610
{
1711
struct pid_namespace *ns = task_active_pid_ns(current);
1812
struct ctl_table table_copy;
13+
int err, scope, parent_scope;
1914

2015
if (write && !ns_capable(ns->user_ns, CAP_SYS_ADMIN))
2116
return -EPERM;
2217

2318
table_copy = *table;
24-
if (ns != &init_pid_ns)
25-
table_copy.data = &ns->memfd_noexec_scope;
2619

27-
/*
28-
* set minimum to current value, the effect is only bigger
29-
* value is accepted.
30-
*/
31-
if (*(int *)table_copy.data > *(int *)table_copy.extra1)
32-
table_copy.extra1 = table_copy.data;
20+
/* You cannot set a lower enforcement value than your parent. */
21+
parent_scope = pidns_memfd_noexec_scope(ns->parent);
22+
/* Equivalent to pidns_memfd_noexec_scope(ns). */
23+
scope = max(READ_ONCE(ns->memfd_noexec_scope), parent_scope);
24+
25+
table_copy.data = &scope;
26+
table_copy.extra1 = &parent_scope;
3327

34-
return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
28+
err = proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
29+
if (!err && write)
30+
WRITE_ONCE(ns->memfd_noexec_scope, scope);
31+
return err;
3532
}
3633

3734
static struct ctl_table pid_ns_ctl_table_vm[] = {
@@ -51,7 +48,6 @@ static inline void register_pid_ns_sysctl_table_vm(void)
5148
register_sysctl("vm", pid_ns_ctl_table_vm);
5249
}
5350
#else
54-
static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) {}
5551
static inline void register_pid_ns_sysctl_table_vm(void) {}
5652
#endif
5753

mm/memfd.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
271271
static int check_sysctl_memfd_noexec(unsigned int *flags)
272272
{
273273
#ifdef CONFIG_SYSCTL
274-
int sysctl = task_active_pid_ns(current)->memfd_noexec_scope;
274+
struct pid_namespace *ns = task_active_pid_ns(current);
275+
int sysctl = pidns_memfd_noexec_scope(ns);
275276

276277
if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
277278
if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL)

0 commit comments

Comments
 (0)