Skip to content

Commit 3201de4

Browse files
committed
Merge branch 'report-rcu-qs-for-busy-network-kthreads'
Yan Zhai says: ==================== Report RCU QS for busy network kthreads This changeset fixes a common problem for busy networking kthreads. These threads, e.g. NAPI threads, typically will do: * polling a batch of packets * if there are more work, call cond_resched() to allow scheduling * continue to poll more packets when rx queue is not empty We observed this being a problem in production, since it can block RCU tasks from making progress under heavy load. Investigation indicates that just calling cond_resched() is insufficient for RCU tasks to reach quiescent states. This also has the side effect of frequently clearing the TIF_NEED_RESCHED flag on voluntary preempt kernels. As a result, schedule() will not be called in these circumstances, despite schedule() in fact provides required quiescent states. This at least affects NAPI threads, napi_busy_loop, and also cpumap kthread. By reporting RCU QSes in these kthreads periodically before cond_resched, the blocked RCU waiters can correctly progress. Instead of just reporting QS for RCU tasks, these code share the same concern as noted in the commit d28139c ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe"). So report a consolidated QS for safety. It is worth noting that, although this problem is reproducible in napi_busy_loop, it only shows up when setting the polling interval to as high as 2ms, which is far larger than recommended 50us-100us in the documentation. So napi_busy_loop is left untouched. Lastly, this does not affect RT kernels, which does not enter the scheduler through cond_resched(). Without the mentioned side effect, schedule() will be called time by time, and clear the RCU task holdouts. V4: https://lore.kernel.org/bpf/[email protected]/ V3: https://lore.kernel.org/lkml/[email protected]/t/ V2: https://lore.kernel.org/bpf/[email protected]/ V1: https://lore.kernel.org/lkml/[email protected]/#t ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2 parents f7bf0ec + 00bf631 commit 3201de4

File tree

3 files changed

+37
-0
lines changed

3 files changed

+37
-0
lines changed

include/linux/rcupdate.h

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -247,6 +247,37 @@ do { \
247247
cond_resched(); \
248248
} while (0)
249249

250+
/**
251+
* rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
252+
* @old_ts: jiffies at start of processing.
253+
*
254+
* This helper is for long-running softirq handlers, such as NAPI threads in
255+
* networking. The caller should initialize the variable passed in as @old_ts
256+
* at the beginning of the softirq handler. When invoked frequently, this macro
257+
* will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will
258+
* provide both RCU and RCU-Tasks quiescent states. Note that this macro
259+
* modifies its old_ts argument.
260+
*
261+
* Because regions of code that have disabled softirq act as RCU read-side
262+
* critical sections, this macro should be invoked with softirq (and
263+
* preemption) enabled.
264+
*
265+
* The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
266+
* have more chance to invoke schedule() calls and provide necessary quiescent
267+
* states. As a contrast, calling cond_resched() only won't achieve the same
268+
* effect because cond_resched() does not provide RCU-Tasks quiescent states.
269+
*/
270+
#define rcu_softirq_qs_periodic(old_ts) \
271+
do { \
272+
if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
273+
time_after(jiffies, (old_ts) + HZ / 10)) { \
274+
preempt_disable(); \
275+
rcu_softirq_qs(); \
276+
preempt_enable(); \
277+
(old_ts) = jiffies; \
278+
} \
279+
} while (0)
280+
250281
/*
251282
* Infrastructure to implement the synchronize_() primitives in
252283
* TREE_RCU and rcu_barrier_() primitives in TINY_RCU.

kernel/bpf/cpumap.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
263263
static int cpu_map_kthread_run(void *data)
264264
{
265265
struct bpf_cpu_map_entry *rcpu = data;
266+
unsigned long last_qs = jiffies;
266267

267268
complete(&rcpu->kthread_running);
268269
set_current_state(TASK_INTERRUPTIBLE);
@@ -288,10 +289,12 @@ static int cpu_map_kthread_run(void *data)
288289
if (__ptr_ring_empty(rcpu->queue)) {
289290
schedule();
290291
sched = 1;
292+
last_qs = jiffies;
291293
} else {
292294
__set_current_state(TASK_RUNNING);
293295
}
294296
} else {
297+
rcu_softirq_qs_periodic(last_qs);
295298
sched = cond_resched();
296299
}
297300

net/core/dev.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6743,6 +6743,8 @@ static int napi_threaded_poll(void *data)
67436743
void *have;
67446744

67456745
while (!napi_thread_wait(napi)) {
6746+
unsigned long last_qs = jiffies;
6747+
67466748
for (;;) {
67476749
bool repoll = false;
67486750

@@ -6767,6 +6769,7 @@ static int napi_threaded_poll(void *data)
67676769
if (!repoll)
67686770
break;
67696771

6772+
rcu_softirq_qs_periodic(last_qs);
67706773
cond_resched();
67716774
}
67726775
}

0 commit comments

Comments
 (0)