Skip to content

Commit 806486c

Browse files
gormanmIngo Molnar
authored andcommitted
sched/fair: Do not migrate if the prev_cpu is idle
wake_affine_idle() prefers to move a task to the current CPU if the wakeup is due to an interrupt. The expectation is that the interrupt data is cache hot and relevant to the waking task as well as avoiding a search. However, there is no way to determine if there was cache hot data on the previous CPU that may exceed the interrupt data. Furthermore, round-robin delivery of interrupts can migrate tasks around a socket where each CPU is under-utilised. This can interact badly with cpufreq which makes decisions based on per-cpu data. It has been observed on machines with HWP that p-states are not boosted to their maximum levels even though the workload is latency and throughput sensitive. This patch uses the previous CPU for the task if it's idle and cache-affine with the current CPU even if the current CPU is idle due to the wakup being related to the interrupt. This reduces migrations at the cost of the interrupt data not being cache hot when the task wakes. A variety of workloads were tested on various machines and no adverse impact was noticed that was outside noise. dbench on ext4 on UMA showed roughly 10% reduction in the number of CPU migrations and it is a case where interrupts are frequent for IO competions. In most cases, the difference in performance is quite small but variability is often reduced. For example, this is the result for pgbench running on a UMA machine with different numbers of clients. 4.15.0-rc9 4.15.0-rc9 baseline waprev-v1 Hmean 1 22096.28 ( 0.00%) 22734.86 ( 2.89%) Hmean 4 74633.42 ( 0.00%) 75496.77 ( 1.16%) Hmean 7 115017.50 ( 0.00%) 113030.81 ( -1.73%) Hmean 12 126209.63 ( 0.00%) 126613.40 ( 0.32%) Hmean 16 131886.91 ( 0.00%) 130844.35 ( -0.79%) Stddev 1 636.38 ( 0.00%) 417.11 ( 34.46%) Stddev 4 614.64 ( 0.00%) 583.24 ( 5.11%) Stddev 7 542.46 ( 0.00%) 435.45 ( 19.73%) Stddev 12 173.93 ( 0.00%) 171.50 ( 1.40%) Stddev 16 671.42 ( 0.00%) 680.30 ( -1.32%) CoeffVar 1 2.88 ( 0.00%) 1.83 ( 36.26%) Note that the different in performance is marginal but for low utilisation, there is less variability. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matt Fleming <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
1 parent 3b76c4a commit 806486c

File tree

1 file changed

+7
-1
lines changed

1 file changed

+7
-1
lines changed

kernel/sched/fair.c

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5700,9 +5700,15 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
57005700
* context. Only allow the move if cache is shared. Otherwise an
57015701
* interrupt intensive workload could force all tasks onto one
57025702
* node depending on the IO topology or IRQ affinity settings.
5703+
*
5704+
* If the prev_cpu is idle and cache affine then avoid a migration.
5705+
* There is no guarantee that the cache hot data from an interrupt
5706+
* is more important than cache hot data on the prev_cpu and from
5707+
* a cpufreq perspective, it's better to have higher utilisation
5708+
* on one CPU.
57035709
*/
57045710
if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
5705-
return this_cpu;
5711+
return idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
57065712

57075713
if (sync && cpu_rq(this_cpu)->nr_running == 1)
57085714
return this_cpu;

0 commit comments

Comments
 (0)