Skip to content

Commit 76ce2c9

Browse files
nickalcockjfvogel
authored andcommitted
waitfd: new syscall implementing waitpid() over fds
This syscall, originally due to Casey Dahlin but significantly modified since, is called quite like waitid(): fd = waitfd(P_PID, some_pid, WEXITED | WSTOPPED, 0); This returns a file descriptor which becomes ready whenever waitpid() would return, and when read() returns the return value waitpid() would have returned. (Alternatively, you can use it as a pure indication that waitpid() is callable without hanging, and then call waitpid()). See the example in tools/testing/selftests/waitfd/. The original reason for rejection of this patch back in 2009 was that it was redundant to waitpid()ing in a separate thread and transmitting process information to another thread that polls: but this is only the case for the conventional child-process use of waitpid(). Other waitpid() uses, such as ptrace() returns, are targetted on a single thread, so without waitfd or something like it, it is impossible to have a thread that both accepts requests for servicing from other threads over an fd *and* manipulates the state of a ptrace()d process in response to those requests without ugly CPU-chewing polling (accepting requests requires blocking in poll() or select(): handling the ptraced process requires blocking in waitpid()). There is one ugliness in this patch which I would appreciate suggestions to improve (due to me, not due to Casey, don't blame him). The poll() machinery expects to be used with files, or things enough like files that the wake_up key contains an indication as to whether this wakeup corresponds to a POLLIN / POLLOUT / POLLERR event on this fd. You can override this in your poll_queue_proc, but the poll() and epoll() queue procs both have this interpretation. Unfortunately, this is not true for waitfds, which wait on the the wait_chldexit waitqueue, whose key is a pointer to the task_struct of the task being killed. We can't do anything with this key, but we certainly don't want the poll machinery treating it as a bitmask and checking it against poll events! So we introduce a new poll_wait() analogue, poll_wait_fixed(). This is used for poll_wait() calls which know they must wait on waitqueues whose keys are not a typecast representation of poll events, and passes in an extra argument to the poll_queue_proc, which if nonzero is the event which a wakeup on this waitqueue should be considered as equivalent to. The poll_queue_proc can then skip adding entirely if that fixed event is not included in the set to be caught by this poll(). We also add a new poll_table_entry.fixed_key. The poll_queue_proc can record the fixed key it is passed in here, and reuse it at wakeup time to track that a nonzero fixed key was passed in to poll_wait_fixed() and that the key should be ignored in preference to fixed_key. With this in place, you can say, e.g. (as waitfd does) poll_wait_fixed(file, &current->signal->wait_chldexit, wait, POLLIN); and the key passed to wakeups on the wait_chldexit waitqueue will be ignored: the fd will always be treated as having raised POLLIN, waking up poll()s and epoll()s that have specified that event. (Obviously, a poll function that calls this should return the same value from the poll function as was passed to poll_wait_fixed(), or, as usual, zero if this was a spurious wakeup.) I do not like this scheme: it's sufficiently arcane that I had to go back to my old commit messages to figure out what it was doing and why. But I don't see another way to cause poll() to return on appropriate activity on waitqueues that do not actually correspond to files. (I do wonder how signalfd works. It doesn't seem to need any of this and I don't understand why not. I would be overjoyed to remove the whole invasive poll_wait_fixed() mess, but I'm not sure what to replace it with.) Orabug: 30544408 Signed-off-by: Nick Alcock <[email protected]> Signed-off-by: Tomas Jedlicka <[email protected]> Reviewed-by: Kris Van Hees <[email protected]>
1 parent 6b71b0c commit 76ce2c9

File tree

21 files changed

+372
-18
lines changed

21 files changed

+372
-18
lines changed

arch/x86/entry/syscalls/syscall_32.tbl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -440,3 +440,6 @@
440440
433 i386 fspick sys_fspick __ia32_sys_fspick
441441
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
442442
435 i386 clone3 sys_clone3 __ia32_sys_clone3
443+
# This one is a temporary number, designed for no clashes.
444+
# Nothing but DTrace should use it.
445+
473 i386 waitfd sys_waitfd __ia32_sys_waitfd

arch/x86/entry/syscalls/syscall_64.tbl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,9 @@
357357
433 common fspick __x64_sys_fspick
358358
434 common pidfd_open __x64_sys_pidfd_open
359359
435 common clone3 __x64_sys_clone3/ptregs
360+
# This one is a temporary number, designed for no clashes.
361+
# Nothing but DTrace should use it.
362+
473 common waitfd __x64_sys_waitfd
360363

361364
#
362365
# x32-specific system call numbers start at 512 to avoid cache impact

drivers/vfio/virqfd.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,8 @@ static int virqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void
7676
}
7777

7878
static void virqfd_ptable_queue_proc(struct file *file,
79-
wait_queue_head_t *wqh, poll_table *pt)
79+
wait_queue_head_t *wqh, poll_table *pt,
80+
unsigned long unused)
8081
{
8182
struct virqfd *virqfd = container_of(pt, struct virqfd, pt);
8283
add_wait_queue(wqh, &virqfd->wait);

drivers/vhost/vhost.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ static void vhost_flush_work(struct vhost_work *work)
156156
}
157157

158158
static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
159-
poll_table *pt)
159+
poll_table *pt, unsigned long unused)
160160
{
161161
struct vhost_poll *poll;
162162

fs/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
3030
obj-$(CONFIG_TIMERFD) += timerfd.o
3131
obj-$(CONFIG_EVENTFD) += eventfd.o
3232
obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
33+
obj-$(CONFIG_WAITFD) += waitfd.o
3334
obj-$(CONFIG_AIO) += aio.o
3435
obj-$(CONFIG_IO_URING) += io_uring.o
3536
obj-$(CONFIG_FS_DAX) += dax.o

fs/aio.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1699,7 +1699,7 @@ struct aio_poll_table {
16991699

17001700
static void
17011701
aio_poll_queue_proc(struct file *file, struct wait_queue_head *head,
1702-
struct poll_table_struct *p)
1702+
struct poll_table_struct *p, unsigned long fixed_event)
17031703
{
17041704
struct aio_poll_table *pt = container_of(p, struct aio_poll_table, pt);
17051705

fs/eventpoll.c

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,9 @@ struct epitem {
157157
/* Number of active wait queue attached to poll operations */
158158
int nwait;
159159

160+
/* fd always raises this fixed event. */
161+
unsigned long fixed_event;
162+
160163
/* List containing poll wait queues */
161164
struct list_head pwqlist;
162165

@@ -874,7 +877,7 @@ static int ep_eventpoll_release(struct inode *inode, struct file *file)
874877
static __poll_t ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
875878
void *priv);
876879
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
877-
poll_table *pt);
880+
poll_table *pt, unsigned long fixed_event);
878881

879882
/*
880883
* Differs from ep_eventpoll_poll() in that internal callers already have
@@ -1290,6 +1293,13 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
12901293
if (!(epi->event.events & EPOLLEXCLUSIVE))
12911294
ewake = 1;
12921295

1296+
/*
1297+
* If this fd type has a hardwired event which should override the key
1298+
* (e.g. if it is waiting on a non-file waitqueue), jam it in here.
1299+
*/
1300+
if (epi->fixed_event)
1301+
key = (void *)epi->fixed_event;
1302+
12931303
if (pollflags & POLLFREE) {
12941304
/*
12951305
* If we race with ep_remove_wait_queue() it can miss
@@ -1314,11 +1324,17 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
13141324
* target file wakeup lists.
13151325
*/
13161326
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
1317-
poll_table *pt)
1327+
poll_table *pt, unsigned long fixed_event)
13181328
{
13191329
struct epitem *epi = ep_item_from_epqueue(pt);
13201330
struct eppoll_entry *pwq;
13211331

1332+
if (fixed_event & !(epi->event.events & fixed_event))
1333+
return;
1334+
1335+
if (fixed_event)
1336+
epi->fixed_event = fixed_event;
1337+
13221338
if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
13231339
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
13241340
pwq->whead = whead;
@@ -1518,6 +1534,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
15181534
ep_set_ffd(&epi->ffd, tfile, fd);
15191535
epi->event = *event;
15201536
epi->nwait = 0;
1537+
epi->fixed_event = 0;
15211538
epi->next = EP_UNACTIVE_PTR;
15221539
if (epi->event.events & EPOLLWAKEUP) {
15231540
error = ep_create_wakeup_source(epi);
@@ -2379,7 +2396,6 @@ static int __init eventpoll_init(void)
23792396
* We can have many thousands of epitems, so prevent this from
23802397
* using an extra cache line on 64-bit (and smaller) CPUs
23812398
*/
2382-
BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128);
23832399

23842400
/* Allocates slab cache used to allocate "struct epitem" items */
23852401
epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),

fs/io_uring.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1815,7 +1815,7 @@ struct io_poll_table {
18151815
};
18161816

18171817
static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
1818-
struct poll_table_struct *p)
1818+
struct poll_table_struct *p, unsigned long fixed_event)
18191819
{
18201820
struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
18211821

fs/select.c

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ struct poll_table_page {
116116
* poll table.
117117
*/
118118
static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
119-
poll_table *p);
119+
poll_table *p, unsigned long fixed_event);
120120

121121
void poll_initwait(struct poll_wqueues *pwq)
122122
{
@@ -212,22 +212,37 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key
212212
struct poll_table_entry *entry;
213213

214214
entry = container_of(wait, struct poll_table_entry, wait);
215+
216+
/*
217+
* If this fd type has a hardwired key which should override the key
218+
* (e.g. if it is waiting on a non-file waitqueue), jam it in here.
219+
*/
220+
if (entry->fixed_key)
221+
key = (void *)entry->fixed_key;
222+
215223
if (key && !(key_to_poll(key) & entry->key))
216224
return 0;
217225
return __pollwake(wait, mode, sync, key);
218226
}
219227

220228
/* Add a new entry */
221229
static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
222-
poll_table *p)
230+
poll_table *p, unsigned long fixed_event)
223231
{
224232
struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);
225-
struct poll_table_entry *entry = poll_get_entry(pwq);
233+
struct poll_table_entry *entry;
234+
235+
if (fixed_event && !(p->_key & fixed_event))
236+
return;
237+
238+
entry = poll_get_entry(pwq);
226239
if (!entry)
227240
return;
241+
228242
entry->filp = get_file(filp);
229243
entry->wait_address = wait_address;
230244
entry->key = p->_key;
245+
entry->fixed_key = fixed_event;
231246
init_waitqueue_func_entry(&entry->wait, pollwake);
232247
entry->wait.private = pwq;
233248
add_wait_queue(wait_address, &entry->wait);

fs/waitfd.c

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
/* SPDX-License-Identifier: GPL-2.0 */
2+
/*
3+
* fs/waitfd.c
4+
*
5+
* Copyright (C) 2008 Red Hat, Casey Dahlin <[email protected]>
6+
*
7+
* Largely derived from fs/signalfd.c
8+
*/
9+
10+
#include <linux/file.h>
11+
#include <linux/poll.h>
12+
#include <linux/init.h>
13+
#include <linux/fs.h>
14+
#include <linux/sched.h>
15+
#include <linux/slab.h>
16+
#include <linux/kernel.h>
17+
#include <linux/signal.h>
18+
#include <linux/list.h>
19+
#include <linux/anon_inodes.h>
20+
#include <linux/syscalls.h>
21+
22+
long kernel_wait4(pid_t upid, int __user *stat_addr,
23+
int options, struct rusage __user *ru);
24+
25+
struct waitfd_ctx {
26+
int options;
27+
pid_t upid;
28+
};
29+
30+
static int waitfd_release(struct inode *inode, struct file *file)
31+
{
32+
kfree(file->private_data);
33+
return 0;
34+
}
35+
36+
static unsigned int waitfd_poll(struct file *file, poll_table *wait)
37+
{
38+
struct waitfd_ctx *ctx = file->private_data;
39+
long value;
40+
41+
poll_wait_fixed(file, &current->signal->wait_chldexit, wait,
42+
POLLIN);
43+
44+
value = kernel_wait4(ctx->upid, NULL, ctx->options | WNOHANG | WNOWAIT,
45+
NULL);
46+
if (value > 0 || value == -ECHILD)
47+
return POLLIN | POLLRDNORM;
48+
49+
return 0;
50+
}
51+
52+
/*
53+
* Returns a multiple of the size of a stat_addr, or a negative error code. The
54+
* "count" parameter must be at least sizeof(int).
55+
*/
56+
static ssize_t waitfd_read(struct file *file, char __user *buf, size_t count,
57+
loff_t *ppos)
58+
{
59+
struct waitfd_ctx *ctx = file->private_data;
60+
int __user *stat_addr = (int *)buf;
61+
int flags = ctx->options;
62+
ssize_t ret, total = 0;
63+
64+
count /= sizeof(int);
65+
if (!count)
66+
return -EINVAL;
67+
68+
if (file->f_flags & O_NONBLOCK)
69+
flags |= WNOHANG;
70+
71+
do {
72+
ret = kernel_wait4(ctx->upid, stat_addr, flags, NULL);
73+
if (ret == 0)
74+
ret = -EAGAIN;
75+
if (ret == -ECHILD)
76+
ret = 0;
77+
if (ret <= 0)
78+
break;
79+
80+
stat_addr++;
81+
total += sizeof(int);
82+
} while (--count);
83+
84+
return total ? total : ret;
85+
}
86+
87+
static const struct file_operations waitfd_fops = {
88+
.release = waitfd_release,
89+
.poll = waitfd_poll,
90+
.read = waitfd_read,
91+
.llseek = noop_llseek,
92+
};
93+
94+
SYSCALL_DEFINE4(waitfd, int __maybe_unused, which, pid_t, upid, int, options,
95+
int __maybe_unused, flags)
96+
{
97+
int ufd;
98+
struct waitfd_ctx *ctx;
99+
100+
/*
101+
* Options validation from kernel_wait4(), minus WNOWAIT, which is
102+
* only used by our polling implementation. If WEXITED or WSTOPPED
103+
* are provided, silently remove them (for backward compatibility with
104+
* older callers).
105+
*/
106+
options &= ~(WEXITED | WSTOPPED);
107+
if (options & ~(WNOHANG|WUNTRACED|WCONTINUED|
108+
__WNOTHREAD|__WCLONE|__WALL))
109+
return -EINVAL;
110+
111+
ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
112+
if (!ctx)
113+
return -ENOMEM;
114+
115+
ctx->options = options;
116+
ctx->upid = upid;
117+
118+
ufd = anon_inode_getfd("[waitfd]", &waitfd_fops, ctx,
119+
O_RDWR | flags | ((options & WNOHANG) ?
120+
O_NONBLOCK | 0 : 0));
121+
/*
122+
* Use the fd's nonblocking state from now on, since that can change.
123+
*/
124+
ctx->options &= ~WNOHANG;
125+
126+
if (ufd < 0)
127+
kfree(ctx);
128+
129+
return ufd;
130+
}

include/linux/poll.h

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,8 @@ struct poll_table_struct;
3434
/*
3535
* structures and helpers for f_op->poll implementations
3636
*/
37-
typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);
37+
typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *,
38+
struct poll_table_struct *, unsigned long fixed_event);
3839

3940
/*
4041
* Do not touch the structure directly, use the access functions
@@ -48,7 +49,15 @@ typedef struct poll_table_struct {
4849
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
4950
{
5051
if (p && p->_qproc && wait_address)
51-
p->_qproc(filp, wait_address, p);
52+
p->_qproc(filp, wait_address, p, 0);
53+
}
54+
55+
static inline void poll_wait_fixed(struct file *filp,
56+
wait_queue_head_t *wait_address, poll_table *p,
57+
unsigned long fixed_event)
58+
{
59+
if (p && p->_qproc && wait_address)
60+
p->_qproc(filp, wait_address, p, fixed_event);
5261
}
5362

5463
/*
@@ -93,6 +102,7 @@ static inline __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt)
93102
struct poll_table_entry {
94103
struct file *filp;
95104
__poll_t key;
105+
unsigned long fixed_key;
96106
wait_queue_entry_t wait;
97107
wait_queue_head_t *wait_address;
98108
};

include/linux/syscalls.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1420,5 +1420,6 @@ long ksys_old_shmctl(int shmid, int cmd, struct shmid_ds __user *buf);
14201420
long compat_ksys_semtimedop(int semid, struct sembuf __user *tsems,
14211421
unsigned int nsops,
14221422
const struct old_timespec32 __user *timeout);
1423+
long sys_waitfd(int which, pid_t upid, int options, int flags);
14231424

14241425
#endif

include/uapi/asm-generic/unistd.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -851,8 +851,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
851851
__SYSCALL(__NR_clone3, sys_clone3)
852852
#endif
853853

854+
#define __NR_waitfd 473
855+
__SYSCALL(__NR_waitfd, sys_waitfd)
856+
854857
#undef __NR_syscalls
855-
#define __NR_syscalls 436
858+
#define __NR_syscalls 474
856859

857860
/*
858861
* 32 bit systems traditionally used different

init/Kconfig

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1517,6 +1517,22 @@ config EPOLL
15171517
Disabling this option will cause the kernel to be built without
15181518
support for epoll family of system calls.
15191519

1520+
config WAITFD
1521+
bool "Enable waitfd() system call" if EXPERT
1522+
select ANON_INODES
1523+
default n
1524+
help
1525+
Enable the waitfd() system call that allows receiving child state
1526+
changes from a file descriptor. This permits use of poll() to
1527+
monitor waitpid() output simultaneously with other fd state changes,
1528+
even if the waitpid() output is coming from thread-targetted sources
1529+
such as ptrace().
1530+
1531+
Note: this system call is not upstream: its syscall number is not
1532+
finalized, so the call itself should only be used with caution.
1533+
1534+
If unsure, say N.
1535+
15201536
config SIGNALFD
15211537
bool "Enable signalfd() system call" if EXPERT
15221538
default y

0 commit comments

Comments
 (0)