Skip to content

Commit 83fa805

Browse files
committed
Merge tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread management updates from Christian Brauner: "Sargun Dhillon over the last cycle has worked on the pidfd_getfd() syscall. This syscall allows for the retrieval of file descriptors of a process based on its pidfd. A task needs to have ptrace_may_access() permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and Andy) on the target. One of the main use-cases is in combination with seccomp's user notification feature. As a reminder, seccomp's user notification feature was made available in v5.0. It allows a task to retrieve a file descriptor for its seccomp filter. The file descriptor is usually handed of to a more privileged supervising process. The supervisor can then listen for syscall events caught by the seccomp filter of the supervisee and perform actions in lieu of the supervisee, usually emulating syscalls. pidfd_getfd() is needed to expand its uses. There are currently two major users that wait on pidfd_getfd() and one future user: - Netflix, Sargun said, is working on a service mesh where users should be able to connect to a dns-based VIP. When a user connects to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be redirected to an envoy process. This service mesh uses seccomp user notifications and pidfd to intercept all connect calls and instead of connecting them to 1.2.3.4:80 connects them to e.g. 127.0.0.1:8080. - LXD uses the seccomp notifier heavily to intercept and emulate mknod() and mount() syscalls for unprivileged containers/processes. With pidfd_getfd() more uses-cases e.g. bridging socket connections will be possible. - The patchset has also seen some interest from the browser corner. Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a broker process. In the future glibc will start blocking all signals during dlopen() rendering this type of sandbox impossible. Hence, in the future Firefox will switch to a seccomp-user-nofication based sandbox which also makes use of file descriptor retrieval. The thread for this can be found at https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html With pidfd_getfd() it is e.g. possible to bridge socket connections for the supervisee (binding to a privileged port) and taking actions on file descriptors on behalf of the supervisee in general. Sargun's first version was using an ioctl on pidfds but various people pushed for it to be a proper syscall which he duely implemented as well over various review cycles. Selftests are of course included. I've also added instructions how to deal with merge conflicts below. There's also a small fix coming from the kernel mentee project to correctly annotate struct sighand_struct with __rcu to fix various sparse warnings. We've received a few more such fixes and even though they are mostly trivial I've decided to postpone them until after -rc1 since they came in rather late and I don't want to risk introducing build warnings. Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is needed to avoid allocation recursions triggerable by storage drivers that have userspace parts that run in the IO path (e.g. dm-multipath, iscsi, etc). These allocation recursions deadlock the device. The new prctl() allows such privileged userspace components to avoid allocation recursions by setting the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE flags. The patch carries the necessary acks from the relevant maintainers and is routed here as part of prctl() thread-management." * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim sched.h: Annotate sighand_struct with __rcu test: Add test for pidfd getfd arch: wire up pidfd_getfd syscall pid: Implement pidfd_getfd syscall vfs, fdtable: Add fget_task helper
2 parents 896f8d2 + 8d19f1c commit 83fa805

File tree

32 files changed

+427
-7
lines changed

32 files changed

+427
-7
lines changed

arch/alpha/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -476,3 +476,4 @@
476476
544 common pidfd_open sys_pidfd_open
477477
# 545 reserved for clone3
478478
547 common openat2 sys_openat2
479+
548 common pidfd_getfd sys_pidfd_getfd

arch/arm/tools/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -450,3 +450,4 @@
450450
434 common pidfd_open sys_pidfd_open
451451
435 common clone3 sys_clone3
452452
437 common openat2 sys_openat2
453+
438 common pidfd_getfd sys_pidfd_getfd

arch/arm64/include/asm/unistd.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
3939
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
4040

41-
#define __NR_compat_syscalls 438
41+
#define __NR_compat_syscalls 439
4242
#endif
4343

4444
#define __ARCH_WANT_SYS_CLONE

arch/arm64/include/asm/unistd32.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -881,6 +881,8 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
881881
__SYSCALL(__NR_clone3, sys_clone3)
882882
#define __NR_openat2 437
883883
__SYSCALL(__NR_openat2, sys_openat2)
884+
#define __NR_pidfd_getfd 438
885+
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
884886

885887
/*
886888
* Please add new compat syscalls above this comment and update

arch/ia64/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -357,3 +357,4 @@
357357
434 common pidfd_open sys_pidfd_open
358358
# 435 reserved for clone3
359359
437 common openat2 sys_openat2
360+
438 common pidfd_getfd sys_pidfd_getfd

arch/m68k/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -436,3 +436,4 @@
436436
434 common pidfd_open sys_pidfd_open
437437
435 common clone3 __sys_clone3
438438
437 common openat2 sys_openat2
439+
438 common pidfd_getfd sys_pidfd_getfd

arch/microblaze/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -442,3 +442,4 @@
442442
434 common pidfd_open sys_pidfd_open
443443
435 common clone3 sys_clone3
444444
437 common openat2 sys_openat2
445+
438 common pidfd_getfd sys_pidfd_getfd

arch/mips/kernel/syscalls/syscall_n32.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -375,3 +375,4 @@
375375
434 n32 pidfd_open sys_pidfd_open
376376
435 n32 clone3 __sys_clone3
377377
437 n32 openat2 sys_openat2
378+
438 n32 pidfd_getfd sys_pidfd_getfd

arch/mips/kernel/syscalls/syscall_n64.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -351,3 +351,4 @@
351351
434 n64 pidfd_open sys_pidfd_open
352352
435 n64 clone3 __sys_clone3
353353
437 n64 openat2 sys_openat2
354+
438 n64 pidfd_getfd sys_pidfd_getfd

arch/mips/kernel/syscalls/syscall_o32.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -424,3 +424,4 @@
424424
434 o32 pidfd_open sys_pidfd_open
425425
435 o32 clone3 __sys_clone3
426426
437 o32 openat2 sys_openat2
427+
438 o32 pidfd_getfd sys_pidfd_getfd

arch/parisc/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -434,3 +434,4 @@
434434
434 common pidfd_open sys_pidfd_open
435435
435 common clone3 sys_clone3_wrapper
436436
437 common openat2 sys_openat2
437+
438 common pidfd_getfd sys_pidfd_getfd

arch/powerpc/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -518,3 +518,4 @@
518518
434 common pidfd_open sys_pidfd_open
519519
435 nospu clone3 ppc_clone3
520520
437 common openat2 sys_openat2
521+
438 common pidfd_getfd sys_pidfd_getfd

arch/s390/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -439,3 +439,4 @@
439439
434 common pidfd_open sys_pidfd_open sys_pidfd_open
440440
435 common clone3 sys_clone3 sys_clone3
441441
437 common openat2 sys_openat2 sys_openat2
442+
438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd

arch/sh/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -439,3 +439,4 @@
439439
434 common pidfd_open sys_pidfd_open
440440
# 435 reserved for clone3
441441
437 common openat2 sys_openat2
442+
438 common pidfd_getfd sys_pidfd_getfd

arch/sparc/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -482,3 +482,4 @@
482482
434 common pidfd_open sys_pidfd_open
483483
# 435 reserved for clone3
484484
437 common openat2 sys_openat2
485+
438 common pidfd_getfd sys_pidfd_getfd

arch/x86/entry/syscalls/syscall_32.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -441,3 +441,4 @@
441441
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
442442
435 i386 clone3 sys_clone3 __ia32_sys_clone3
443443
437 i386 openat2 sys_openat2 __ia32_sys_openat2
444+
438 i386 pidfd_getfd sys_pidfd_getfd __ia32_sys_pidfd_getfd

arch/x86/entry/syscalls/syscall_64.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -358,6 +358,7 @@
358358
434 common pidfd_open __x64_sys_pidfd_open
359359
435 common clone3 __x64_sys_clone3/ptregs
360360
437 common openat2 __x64_sys_openat2
361+
438 common pidfd_getfd __x64_sys_pidfd_getfd
361362

362363
#
363364
# x32-specific system call numbers start at 512 to avoid cache impact

arch/xtensa/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -407,3 +407,4 @@
407407
434 common pidfd_open sys_pidfd_open
408408
435 common clone3 sys_clone3
409409
437 common openat2 sys_openat2
410+
438 common pidfd_getfd sys_pidfd_getfd

fs/file.c

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -708,9 +708,9 @@ void do_close_on_exec(struct files_struct *files)
708708
spin_unlock(&files->file_lock);
709709
}
710710

711-
static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
711+
static struct file *__fget_files(struct files_struct *files, unsigned int fd,
712+
fmode_t mask, unsigned int refs)
712713
{
713-
struct files_struct *files = current->files;
714714
struct file *file;
715715

716716
rcu_read_lock();
@@ -731,6 +731,12 @@ static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
731731
return file;
732732
}
733733

734+
static inline struct file *__fget(unsigned int fd, fmode_t mask,
735+
unsigned int refs)
736+
{
737+
return __fget_files(current->files, fd, mask, refs);
738+
}
739+
734740
struct file *fget_many(unsigned int fd, unsigned int refs)
735741
{
736742
return __fget(fd, FMODE_PATH, refs);
@@ -748,6 +754,18 @@ struct file *fget_raw(unsigned int fd)
748754
}
749755
EXPORT_SYMBOL(fget_raw);
750756

757+
struct file *fget_task(struct task_struct *task, unsigned int fd)
758+
{
759+
struct file *file = NULL;
760+
761+
task_lock(task);
762+
if (task->files)
763+
file = __fget_files(task->files, fd, 0, 1);
764+
task_unlock(task);
765+
766+
return file;
767+
}
768+
751769
/*
752770
* Lightweight file lookup - no refcnt increment if fd table isn't shared.
753771
*

include/linux/file.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ extern void fput(struct file *);
1616
extern void fput_many(struct file *, unsigned int);
1717

1818
struct file_operations;
19+
struct task_struct;
1920
struct vfsmount;
2021
struct dentry;
2122
struct inode;
@@ -47,6 +48,7 @@ static inline void fdput(struct fd fd)
4748
extern struct file *fget(unsigned int fd);
4849
extern struct file *fget_many(unsigned int fd, unsigned int refs);
4950
extern struct file *fget_raw(unsigned int fd);
51+
extern struct file *fget_task(struct task_struct *task, unsigned int fd);
5052
extern unsigned long __fdget(unsigned int fd);
5153
extern unsigned long __fdget_raw(unsigned int fd);
5254
extern unsigned long __fdget_pos(unsigned int fd);

include/linux/sched.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -917,7 +917,7 @@ struct task_struct {
917917

918918
/* Signal handlers: */
919919
struct signal_struct *signal;
920-
struct sighand_struct *sighand;
920+
struct sighand_struct __rcu *sighand;
921921
sigset_t blocked;
922922
sigset_t real_blocked;
923923
/* Restored if set_restore_sigmask() was used: */

include/linux/syscalls.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1002,6 +1002,7 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags)
10021002
asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
10031003
siginfo_t __user *info,
10041004
unsigned int flags);
1005+
asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
10051006

10061007
/*
10071008
* Architecture-specific system calls

include/uapi/asm-generic/unistd.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -853,9 +853,11 @@ __SYSCALL(__NR_clone3, sys_clone3)
853853

854854
#define __NR_openat2 437
855855
__SYSCALL(__NR_openat2, sys_openat2)
856+
#define __NR_pidfd_getfd 438
857+
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
856858

857859
#undef __NR_syscalls
858-
#define __NR_syscalls 438
860+
#define __NR_syscalls 439
859861

860862
/*
861863
* 32 bit systems traditionally used different

include/uapi/linux/capability.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,7 @@ struct vfs_ns_cap_data {
301301
/* Allow more than 64hz interrupts from the real-time clock */
302302
/* Override max number of consoles on console allocation */
303303
/* Override max number of keymaps */
304+
/* Control memory reclaim behavior */
304305

305306
#define CAP_SYS_RESOURCE 24
306307

include/uapi/linux/prctl.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,4 +234,8 @@ struct prctl_mm_map {
234234
#define PR_GET_TAGGED_ADDR_CTRL 56
235235
# define PR_TAGGED_ADDR_ENABLE (1UL << 0)
236236

237+
/* Control reclaim behavior when allocating memory */
238+
#define PR_SET_IO_FLUSHER 57
239+
#define PR_GET_IO_FLUSHER 58
240+
237241
#endif /* _LINUX_PRCTL_H */

kernel/pid.c

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -578,3 +578,93 @@ void __init pid_idr_init(void)
578578
init_pid_ns.pid_cachep = KMEM_CACHE(pid,
579579
SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
580580
}
581+
582+
static struct file *__pidfd_fget(struct task_struct *task, int fd)
583+
{
584+
struct file *file;
585+
int ret;
586+
587+
ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
588+
if (ret)
589+
return ERR_PTR(ret);
590+
591+
if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
592+
file = fget_task(task, fd);
593+
else
594+
file = ERR_PTR(-EPERM);
595+
596+
mutex_unlock(&task->signal->cred_guard_mutex);
597+
598+
return file ?: ERR_PTR(-EBADF);
599+
}
600+
601+
static int pidfd_getfd(struct pid *pid, int fd)
602+
{
603+
struct task_struct *task;
604+
struct file *file;
605+
int ret;
606+
607+
task = get_pid_task(pid, PIDTYPE_PID);
608+
if (!task)
609+
return -ESRCH;
610+
611+
file = __pidfd_fget(task, fd);
612+
put_task_struct(task);
613+
if (IS_ERR(file))
614+
return PTR_ERR(file);
615+
616+
ret = security_file_receive(file);
617+
if (ret) {
618+
fput(file);
619+
return ret;
620+
}
621+
622+
ret = get_unused_fd_flags(O_CLOEXEC);
623+
if (ret < 0)
624+
fput(file);
625+
else
626+
fd_install(ret, file);
627+
628+
return ret;
629+
}
630+
631+
/**
632+
* sys_pidfd_getfd() - Get a file descriptor from another process
633+
*
634+
* @pidfd: the pidfd file descriptor of the process
635+
* @fd: the file descriptor number to get
636+
* @flags: flags on how to get the fd (reserved)
637+
*
638+
* This syscall gets a copy of a file descriptor from another process
639+
* based on the pidfd, and file descriptor number. It requires that
640+
* the calling process has the ability to ptrace the process represented
641+
* by the pidfd. The process which is having its file descriptor copied
642+
* is otherwise unaffected.
643+
*
644+
* Return: On success, a cloexec file descriptor is returned.
645+
* On error, a negative errno number will be returned.
646+
*/
647+
SYSCALL_DEFINE3(pidfd_getfd, int, pidfd, int, fd,
648+
unsigned int, flags)
649+
{
650+
struct pid *pid;
651+
struct fd f;
652+
int ret;
653+
654+
/* flags is currently unused - make sure it's unset */
655+
if (flags)
656+
return -EINVAL;
657+
658+
f = fdget(pidfd);
659+
if (!f.file)
660+
return -EBADF;
661+
662+
pid = pidfd_pid(f.file);
663+
if (IS_ERR(pid))
664+
ret = PTR_ERR(pid);
665+
else
666+
ret = pidfd_getfd(pid, fd);
667+
668+
fdput(f);
669+
return ret;
670+
}

kernel/signal.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1383,7 +1383,7 @@ struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
13831383
* must see ->sighand == NULL.
13841384
*/
13851385
spin_lock_irqsave(&sighand->siglock, *flags);
1386-
if (likely(sighand == tsk->sighand))
1386+
if (likely(sighand == rcu_access_pointer(tsk->sighand)))
13871387
break;
13881388
spin_unlock_irqrestore(&sighand->siglock, *flags);
13891389
}

kernel/sys.c

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2261,6 +2261,8 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
22612261
return -EINVAL;
22622262
}
22632263

2264+
#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE)
2265+
22642266
SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
22652267
unsigned long, arg4, unsigned long, arg5)
22662268
{
@@ -2488,6 +2490,29 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
24882490
return -EINVAL;
24892491
error = GET_TAGGED_ADDR_CTRL();
24902492
break;
2493+
case PR_SET_IO_FLUSHER:
2494+
if (!capable(CAP_SYS_RESOURCE))
2495+
return -EPERM;
2496+
2497+
if (arg3 || arg4 || arg5)
2498+
return -EINVAL;
2499+
2500+
if (arg2 == 1)
2501+
current->flags |= PR_IO_FLUSHER;
2502+
else if (!arg2)
2503+
current->flags &= ~PR_IO_FLUSHER;
2504+
else
2505+
return -EINVAL;
2506+
break;
2507+
case PR_GET_IO_FLUSHER:
2508+
if (!capable(CAP_SYS_RESOURCE))
2509+
return -EPERM;
2510+
2511+
if (arg2 || arg3 || arg4 || arg5)
2512+
return -EINVAL;
2513+
2514+
error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
2515+
break;
24912516
default:
24922517
error = -EINVAL;
24932518
break;

tools/testing/selftests/pidfd/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ pidfd_open_test
22
pidfd_poll_test
33
pidfd_test
44
pidfd_wait
5+
pidfd_getfd_test
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# SPDX-License-Identifier: GPL-2.0-only
22
CFLAGS += -g -I../../../../usr/include/ -pthread
33

4-
TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait
4+
TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait pidfd_getfd_test
55

66
include ../lib.mk
77

tools/testing/selftests/pidfd/pidfd.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,10 @@
3636
#define __NR_clone3 -1
3737
#endif
3838

39+
#ifndef __NR_pidfd_getfd
40+
#define __NR_pidfd_getfd -1
41+
#endif
42+
3943
/*
4044
* The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
4145
* That means, when it wraps around any pid < 300 will be skipped.
@@ -84,4 +88,9 @@ static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
8488
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
8589
}
8690

91+
static inline int sys_pidfd_getfd(int pidfd, int fd, int flags)
92+
{
93+
return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
94+
}
95+
8796
#endif /* __PIDFD_H */

0 commit comments

Comments
 (0)