Skip to content

Commit 7f192e3

Browse files
committed
fork: add clone3
This adds the clone3 system call. As mentioned several times already (cf. [7], [8]) here's the promised patchset for clone3(). We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last free flag from clone(). Independent of the CLONE_PIDFD patchset a time namespace has been discussed at Linux Plumber Conference last year and has been sent out and reviewed (cf. [5]). It is expected that it will go upstream in the not too distant future. However, it relies on the addition of the CLONE_NEWTIME flag to clone(). The only other good candidate - CLONE_DETACHED - is currently not recyclable as we have identified at least two large or widely used codebases that currently pass this flag (cf. [2], [3], and [4]). Given that CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively blocked. clone3() has the advantage that it will unblock this patchset again. In general, clone3() is extensible and allows for the implementation of new features. The idea is to keep clone3() very simple and close to the original clone(), specifically, to keep on supporting old clone()-based workloads. We know there have been various creative proposals how a new process creation syscall or even api is supposed to look like. Some people even going so far as to argue that the traditional fork()+exec() split should be abandoned in favor of an in-kernel version of spawn(). Independent of whether or not we personally think spawn() is a good idea this patchset has and does not want to have anything to do with this. One stance we take is that there's no real good alternative to clone()+exec() and we need and want to support this model going forward; independent of spawn(). The following requirements guided clone3(): - bump the number of available flags - move arguments that are currently passed as separate arguments in clone() into a dedicated struct clone_args - choose a struct layout that is easy to handle on 32 and on 64 bit - choose a struct layout that is extensible - give new flags that currently need to abuse another flag's dedicated return argument in clone() their own dedicated return argument (e.g. CLONE_PIDFD) - use a separate kernel internal struct kernel_clone_args that is properly typed according to current kernel conventions in fork.c and is different from the uapi struct clone_args - port _do_fork() to use kernel_clone_args so that all process creation syscalls such as fork(), vfork(), clone(), and clone3() behave identical (Arnd suggested, that we can probably also port do_fork() itself in a separate patchset.) - ease of transition for userspace from clone() to clone3() This very much means that we do *not* remove functionality that userspace currently relies on as the latter is a good way of creating a syscall that won't be adopted. - do not try to be clever or complex: keep clone3() as dumb as possible In accordance with Linus suggestions (cf. [11]), clone3() has the following signature: /* uapi */ struct clone_args { __aligned_u64 flags; __aligned_u64 pidfd; __aligned_u64 child_tid; __aligned_u64 parent_tid; __aligned_u64 exit_signal; __aligned_u64 stack; __aligned_u64 stack_size; __aligned_u64 tls; }; /* kernel internal */ struct kernel_clone_args { u64 flags; int __user *pidfd; int __user *child_tid; int __user *parent_tid; int exit_signal; unsigned long stack; unsigned long stack_size; unsigned long tls; }; long sys_clone3(struct clone_args __user *uargs, size_t size) clone3() cleanly supports all of the supported flags from clone() and thus all legacy workloads. The advantage of sticking close to the old clone() is the low cost for userspace to switch to this new api. Quite a lot of userspace apis (e.g. pthreads) are based on the clone() syscall. With the new clone3() syscall supporting all of the old workloads and opening up the ability to add new features should make switching to it for userspace more appealing. In essence, glibc can just write a simple wrapper to switch from clone() to clone3(). There has been some interest in this patchset already. We have received a patch from the CRIU corner for clone3() that would set the PID/TID of a restored process without /proc/sys/kernel/ns_last_pid to eliminate a race. /* User visible differences to legacy clone() */ - CLONE_DETACHED will cause EINVAL with clone3() - CSIGNAL is deprecated It is superseeded by a dedicated "exit_signal" argument in struct clone_args freeing up space for additional flags. This is based on a suggestion from Andrei and Linus (cf. [9] and [10]) /* References */ [1]: b3e5838 [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343 [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233 [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740 [5]: https://lore.kernel.org/lkml/[email protected]/ [6]: https://lore.kernel.org/lkml/[email protected]/ [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/ [8]: https://lore.kernel.org/lkml/[email protected]/ [9]: https://lore.kernel.org/lkml/[email protected]/ [10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/ [11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/ Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Serge Hallyn <[email protected]> Cc: Kees Cook <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Jann Horn <[email protected]> Cc: David Howells <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Adrian Reber <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Al Viro <[email protected]> Cc: Florian Weimer <[email protected]> Cc: [email protected]
1 parent a188339 commit 7f192e3

File tree

5 files changed

+199
-51
lines changed

5 files changed

+199
-51
lines changed

arch/x86/ia32/sys_ia32.c

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,14 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
237237
unsigned long, newsp, int __user *, parent_tidptr,
238238
unsigned long, tls_val, int __user *, child_tidptr)
239239
{
240-
return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
241-
tls_val);
240+
struct kernel_clone_args args = {
241+
.flags = (clone_flags & ~CSIGNAL),
242+
.child_tid = child_tidptr,
243+
.parent_tid = parent_tidptr,
244+
.exit_signal = (clone_flags & CSIGNAL),
245+
.stack = newsp,
246+
.tls = tls_val,
247+
};
248+
249+
return _do_fork(&args);
242250
}

include/linux/sched/task.h

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,26 @@
88
*/
99

1010
#include <linux/sched.h>
11+
#include <linux/uaccess.h>
1112

1213
struct task_struct;
1314
struct rusage;
1415
union thread_union;
1516

17+
/* All the bits taken by the old clone syscall. */
18+
#define CLONE_LEGACY_FLAGS 0xffffffffULL
19+
20+
struct kernel_clone_args {
21+
u64 flags;
22+
int __user *pidfd;
23+
int __user *child_tid;
24+
int __user *parent_tid;
25+
int exit_signal;
26+
unsigned long stack;
27+
unsigned long stack_size;
28+
unsigned long tls;
29+
};
30+
1631
/*
1732
* This serializes "schedule()" and also protects
1833
* the run-queue from deletions/modifications (but
@@ -73,7 +88,7 @@ extern void do_group_exit(int);
7388
extern void exit_files(struct task_struct *);
7489
extern void exit_itimers(struct signal_struct *);
7590

76-
extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
91+
extern long _do_fork(struct kernel_clone_args *kargs);
7792
extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
7893
struct task_struct *fork_idle(int);
7994
struct mm_struct *copy_init_mm(void);

include/linux/syscalls.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ struct sigaltstack;
7070
struct rseq;
7171
union bpf_attr;
7272
struct io_uring_params;
73+
struct clone_args;
7374

7475
#include <linux/types.h>
7576
#include <linux/aio_abi.h>
@@ -852,6 +853,9 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
852853
int __user *, unsigned long);
853854
#endif
854855
#endif
856+
857+
asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
858+
855859
asmlinkage long sys_execve(const char __user *filename,
856860
const char __user *const __user *argv,
857861
const char __user *const __user *envp);

include/uapi/linux/sched.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
#ifndef _UAPI_LINUX_SCHED_H
33
#define _UAPI_LINUX_SCHED_H
44

5+
#include <linux/types.h>
6+
57
/*
68
* cloning flags:
79
*/
@@ -31,6 +33,20 @@
3133
#define CLONE_NEWNET 0x40000000 /* New network namespace */
3234
#define CLONE_IO 0x80000000 /* Clone io context */
3335

36+
/*
37+
* Arguments for the clone3 syscall
38+
*/
39+
struct clone_args {
40+
__aligned_u64 flags;
41+
__aligned_u64 pidfd;
42+
__aligned_u64 child_tid;
43+
__aligned_u64 parent_tid;
44+
__aligned_u64 exit_signal;
45+
__aligned_u64 stack;
46+
__aligned_u64 stack_size;
47+
__aligned_u64 tls;
48+
};
49+
3450
/*
3551
* Scheduling policies
3652
*/

0 commit comments

Comments
 (0)