Skip to content

Commit a16ceb1

Browse files
Benjamin Segallakpm00
authored andcommitted
epoll: autoremove wakers even more aggressively
If a process is killed or otherwise exits while having active network connections and many threads waiting on epoll_wait, the threads will all be woken immediately, but not removed from ep->wq. Then when network traffic scans ep->wq in wake_up, every wakeup attempt will fail, and will not remove the entries from the list. This means that the cost of the wakeup attempt is far higher than usual, does not decrease, and this also competes with the dying threads trying to actually make progress and remove themselves from the wq. Handle this by removing visited epoll wq entries unconditionally, rather than only when the wakeup succeeds - the structure of ep_poll means that the only potential loss is the timed_out->eavail heuristic, which now can race and result in a redundant ep_send_events attempt. (But only when incoming data and a timeout actually race, not on every timeout) Shakeel added: : We are seeing this issue in production with real workloads and it has : caused hard lockups. Particularly network heavy workloads with a lot : of threads in epoll_wait() can easily trigger this issue if they get : killed (oom-killed in our case). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ben Segall <[email protected]> Tested-by: Shakeel Butt <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Roman Penyaev <[email protected]> Cc: Jason Baron <[email protected]> Cc: Khazhismel Kumykov <[email protected]> Cc: Heiher <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 2c795fb commit a16ceb1

File tree

1 file changed

+22
-0
lines changed

1 file changed

+22
-0
lines changed

fs/eventpoll.c

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1747,6 +1747,21 @@ static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long ms)
17471747
return to;
17481748
}
17491749

1750+
/*
1751+
* autoremove_wake_function, but remove even on failure to wake up, because we
1752+
* know that default_wake_function/ttwu will only fail if the thread is already
1753+
* woken, and in that case the ep_poll loop will remove the entry anyways, not
1754+
* try to reuse it.
1755+
*/
1756+
static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
1757+
unsigned int mode, int sync, void *key)
1758+
{
1759+
int ret = default_wake_function(wq_entry, mode, sync, key);
1760+
1761+
list_del_init(&wq_entry->entry);
1762+
return ret;
1763+
}
1764+
17501765
/**
17511766
* ep_poll - Retrieves ready events, and delivers them to the caller-supplied
17521767
* event buffer.
@@ -1828,8 +1843,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
18281843
* normal wakeup path no need to call __remove_wait_queue()
18291844
* explicitly, thus ep->lock is not taken, which halts the
18301845
* event delivery.
1846+
*
1847+
* In fact, we now use an even more aggressive function that
1848+
* unconditionally removes, because we don't reuse the wait
1849+
* entry between loop iterations. This lets us also avoid the
1850+
* performance issue if a process is killed, causing all of its
1851+
* threads to wake up without being removed normally.
18311852
*/
18321853
init_wait(&wait);
1854+
wait.func = ep_autoremove_wake_function;
18331855

18341856
write_lock_irq(&ep->lock);
18351857
/*

0 commit comments

Comments
 (0)