You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ktask: let helpers finish asynchronously to avoid deadlock
UEK6 testing uncovered a rare hang during deferred page init in a kvm
guest. The hang is more readily reproducible under some amounts of CPU
and memory than others; for example, on a 72-CPU system, 46-CPU guests
would hit it every ten to one hundred tries, but 44-CPU, 32-CPU, 16-CPU,
and 8-CPU guests survived hundreds of reboots. UEK5 is not affected.
The problem is a deadlock that arises because the main page init thread
(pgdatinitN) holds the spinlock node_size_lock for the whole operation.
It can happen two ways.
First, when the main thread calls into ktask, ktask queues a work per
helper thread, and when the workqueue layer wakes up a worker or
kthreadd (to create a worker), the scheduler rarely decides to place it
on the same runqueue as the main thread. The worker or kthreadd never
run because the main thread is busywaiting for either to finish.
Second, an interrupt handler can pin a worker or kthreadd, exhaust the
memory in the deferred init zone by attempting a large allocation, and
spin in deferred_grow_zone() on node_size_lock, which the main
thread can only release when kthreadd or the worker finish.
The first was seen on a uek6 kernel, and the second is theoretically
possible.
Meanwhile the rest of the system can be idle at this phase of boot, in
which case the scheduler can't move threads waiting on the runqueue
elsewhere.
The proper fix will probably involve scheduler changes, but as a stopgap
due to the timing with the uek6 release, relieve the main ktask thread
from having to wait on its helpers and avoid the deadlock by refcounting
a dynamically allocated ktask_task object. With this approach, page
init may use fewer threads than requested with the same frequency as the
hang was observed.
Thanks to Alejandro Jimenez for his assistance with this bug.
Orabug: 30835752
Reported-by: Gerald Gibson <[email protected]>
Signed-off-by: Daniel Jordan <[email protected]>
Reviewed-by: Khalid Aziz <[email protected]>
0 commit comments