Skip to content

Commit 95913d9

Browse files
Peter ZijlstraIngo Molnar
authored andcommitted
sched/core: Fix TASK_DEAD race in finish_task_switch()
So the problem this patch is trying to address is as follows: CPU0 CPU1 context_switch(A, B) ttwu(A) LOCK A->pi_lock A->on_cpu == 0 finish_task_switch(A) prev_state = A->state <-. WMB | A->on_cpu = 0; | UNLOCK rq0->lock | | context_switch(C, A) `-- A->state = TASK_DEAD prev_state == TASK_DEAD put_task_struct(A) context_switch(A, C) finish_task_switch(A) A->state == TASK_DEAD put_task_struct(A) The argument being that the WMB will allow the load of A->state on CPU0 to cross over and observe CPU1's store of A->state, which will then result in a double-drop and use-after-free. Now the comment states (and this was true once upon a long time ago) that we need to observe A->state while holding rq->lock because that will order us against the wakeup; however the wakeup will not in fact acquire (that) rq->lock; it takes A->pi_lock these days. We can obviously fix this by upgrading the WMB to an MB, but that is expensive, so we'd rather avoid that. The alternative this patch takes is: smp_store_release(&A->on_cpu, 0), which avoids the MB on some archs, but not important ones like ARM. Reported-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: <[email protected]> # v3.1+ Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Fixes: e4a52bc ("sched: Remove rq->lock from the first half of ttwu()") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
1 parent 049e6dd commit 95913d9

File tree

2 files changed

+8
-7
lines changed

2 files changed

+8
-7
lines changed

kernel/sched/core.c

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2517,11 +2517,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
25172517
* If a task dies, then it sets TASK_DEAD in tsk->state and calls
25182518
* schedule one last time. The schedule call will never return, and
25192519
* the scheduled task must drop that reference.
2520-
* The test for TASK_DEAD must occur while the runqueue locks are
2521-
* still held, otherwise prev could be scheduled on another cpu, die
2522-
* there before we look at prev->state, and then the reference would
2523-
* be dropped twice.
2524-
* Manfred Spraul <[email protected]>
2520+
*
2521+
* We must observe prev->state before clearing prev->on_cpu (in
2522+
* finish_lock_switch), otherwise a concurrent wakeup can get prev
2523+
* running on another CPU and we could rave with its RUNNING -> DEAD
2524+
* transition, resulting in a double drop.
25252525
*/
25262526
prev_state = prev->state;
25272527
vtime_task_switch(prev);

kernel/sched/sched.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1078,9 +1078,10 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
10781078
* After ->on_cpu is cleared, the task can be moved to a different CPU.
10791079
* We must ensure this doesn't happen until the switch is completely
10801080
* finished.
1081+
*
1082+
* Pairs with the control dependency and rmb in try_to_wake_up().
10811083
*/
1082-
smp_wmb();
1083-
prev->on_cpu = 0;
1084+
smp_store_release(&prev->on_cpu, 0);
10841085
#endif
10851086
#ifdef CONFIG_DEBUG_SPINLOCK
10861087
/* this is a valid case when another task releases the spinlock */

0 commit comments

Comments
 (0)