|
1 |
| -Intel P-state driver |
| 1 | +Intel P-State driver |
2 | 2 | --------------------
|
3 | 3 |
|
4 |
| -This driver provides an interface to control the P state selection for |
5 |
| -SandyBridge+ Intel processors. The driver can operate two different |
6 |
| -modes based on the processor model, legacy mode and Hardware P state (HWP) |
7 |
| -mode. |
8 |
| - |
9 |
| -In legacy mode, the Intel P-state implements two internal governors, |
10 |
| -performance and powersave, that differ from the general cpufreq governors of |
11 |
| -the same name (the general cpufreq governors implement target(), whereas the |
12 |
| -internal Intel P-state governors implement setpolicy()). The internal |
13 |
| -performance governor sets the max_perf_pct and min_perf_pct to 100; that is, |
14 |
| -the governor selects the highest available P state to maximize the performance |
15 |
| -of the core. The internal powersave governor selects the appropriate P state |
16 |
| -based on the current load on the CPU. |
17 |
| - |
18 |
| -In HWP mode P state selection is implemented in the processor |
19 |
| -itself. The driver provides the interfaces between the cpufreq core and |
20 |
| -the processor to control P state selection based on user preferences |
21 |
| -and reporting frequency to the cpufreq core. In this mode the |
22 |
| -internal Intel P-state governor code is disabled. |
23 |
| - |
24 |
| -In addition to the interfaces provided by the cpufreq core for |
25 |
| -controlling frequency the driver provides sysfs files for |
26 |
| -controlling P state selection. These files have been added to |
27 |
| -/sys/devices/system/cpu/intel_pstate/ |
28 |
| - |
29 |
| - max_perf_pct: limits the maximum P state that will be requested by |
30 |
| - the driver stated as a percentage of the available performance. The |
31 |
| - available (P states) performance may be reduced by the no_turbo |
| 4 | +This driver provides an interface to control the P-State selection for the |
| 5 | +SandyBridge+ Intel processors. |
| 6 | + |
| 7 | +The following document explains P-States: |
| 8 | +http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf |
| 9 | +As stated in the document, P-State doesn’t exactly mean a frequency. However, for |
| 10 | +the sake of the relationship with cpufreq, P-State and frequency are used |
| 11 | +interchangeably. |
| 12 | + |
| 13 | +Understanding the cpufreq core governors and policies are important before |
| 14 | +discussing more details about the Intel P-State driver. Based on what callbacks |
| 15 | +a cpufreq driver provides to the cpufreq core, it can support two types of |
| 16 | +drivers: |
| 17 | +- with target_index() callback: In this mode, the drivers using cpufreq core |
| 18 | +simply provide the minimum and maximum frequency limits and an additional |
| 19 | +interface target_index() to set the current frequency. The cpufreq subsystem |
| 20 | +has a number of scaling governors ("performance", "powersave", "ondemand", |
| 21 | +etc.). Depending on which governor is in use, cpufreq core will call for |
| 22 | +transitions to a specific frequency using target_index() callback. |
| 23 | +- setpolicy() callback: In this mode, drivers do not provide target_index() |
| 24 | +callback, so cpufreq core can't request a transition to a specific frequency. |
| 25 | +The driver provides minimum and maximum frequency limits and callbacks to set a |
| 26 | +policy. The policy in cpufreq sysfs is referred to as the "scaling governor". |
| 27 | +The cpufreq core can request the driver to operate in any of the two policies: |
| 28 | +"performance: and "powersave". The driver decides which frequency to use based |
| 29 | +on the above policy selection considering minimum and maximum frequency limits. |
| 30 | + |
| 31 | +The Intel P-State driver falls under the latter category, which implements the |
| 32 | +setpolicy() callback. This driver decides what P-State to use based on the |
| 33 | +requested policy from the cpufreq core. If the processor is capable of |
| 34 | +selecting its next P-State internally, then the driver will offload this |
| 35 | +responsibility to the processor (aka HWP: Hardware P-States). If not, the |
| 36 | +driver implements algorithms to select the next P-State. |
| 37 | + |
| 38 | +Since these policies are implemented in the driver, they are not same as the |
| 39 | +cpufreq scaling governors implementation, even if they have the same name in |
| 40 | +the cpufreq sysfs (scaling_governors). For example the "performance" policy is |
| 41 | +similar to cpufreq’s "performance" governor, but "powersave" is completely |
| 42 | +different than the cpufreq "powersave" governor. The strategy here is similar |
| 43 | +to cpufreq "ondemand", where the requested P-State is related to the system load. |
| 44 | + |
| 45 | +Sysfs Interface |
| 46 | + |
| 47 | +In addition to the frequency-controlling interfaces provided by the cpufreq |
| 48 | +core, the driver provides its own sysfs files to control the P-State selection. |
| 49 | +These files have been added to /sys/devices/system/cpu/intel_pstate/. |
| 50 | +Any changes made to these files are applicable to all CPUs (even in a |
| 51 | +multi-package system). |
| 52 | + |
| 53 | + max_perf_pct: Limits the maximum P-State that will be requested by |
| 54 | + the driver. It states it as a percentage of the available performance. The |
| 55 | + available (P-State) performance may be reduced by the no_turbo |
32 | 56 | setting described below.
|
33 | 57 |
|
34 |
| - min_perf_pct: limits the minimum P state that will be requested by |
35 |
| - the driver stated as a percentage of the max (non-turbo) |
| 58 | + min_perf_pct: Limits the minimum P-State that will be requested by |
| 59 | + the driver. It states it as a percentage of the max (non-turbo) |
36 | 60 | performance level.
|
37 | 61 |
|
38 |
| - no_turbo: limits the driver to selecting P states below the turbo |
| 62 | + no_turbo: Limits the driver to selecting P-State below the turbo |
39 | 63 | frequency range.
|
40 | 64 |
|
41 |
| - turbo_pct: displays the percentage of the total performance that |
42 |
| - is supported by hardware that is in the turbo range. This number |
| 65 | + turbo_pct: Displays the percentage of the total performance that |
| 66 | + is supported by hardware that is in the turbo range. This number |
43 | 67 | is independent of whether turbo has been disabled or not.
|
44 | 68 |
|
45 |
| - num_pstates: displays the number of pstates that are supported |
46 |
| - by hardware. This number is independent of whether turbo has |
| 69 | + num_pstates: Displays the number of P-States that are supported |
| 70 | + by hardware. This number is independent of whether turbo has |
47 | 71 | been disabled or not.
|
48 | 72 |
|
| 73 | +For example, if a system has these parameters: |
| 74 | + Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State) |
| 75 | + Max non turbo ratio: 0x17 |
| 76 | + Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio) |
| 77 | + |
| 78 | +Sysfs will show : |
| 79 | + max_perf_pct:100, which corresponds to 1 core ratio |
| 80 | + min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio |
| 81 | + no_turbo:0, turbo is not disabled |
| 82 | + num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1) |
| 83 | + turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates |
| 84 | + |
| 85 | +Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual |
| 86 | +Volume 3: System Programming Guide" to understand ratios. |
| 87 | + |
| 88 | +cpufreq sysfs for Intel P-State |
| 89 | + |
| 90 | +Since this driver registers with cpufreq, cpufreq sysfs is also presented. |
| 91 | +There are some important differences, which need to be considered. |
| 92 | + |
| 93 | +scaling_cur_freq: This displays the real frequency which was used during |
| 94 | +the last sample period instead of what is requested. Some other cpufreq driver, |
| 95 | +like acpi-cpufreq, displays what is requested (Some changes are on the |
| 96 | +way to fix this for acpi-cpufreq driver). The same is true for frequencies |
| 97 | +displayed at /proc/cpuinfo. |
| 98 | + |
| 99 | +scaling_governor: This displays current active policy. Since each CPU has a |
| 100 | +cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this |
| 101 | +is not possible with Intel P-States, as there is one common policy for all |
| 102 | +CPUs. Here, the last requested policy will be applicable to all CPUs. It is |
| 103 | +suggested that one use the cpupower utility to change policy to all CPUs at the |
| 104 | +same time. |
| 105 | + |
| 106 | +scaling_setspeed: This attribute can never be used with Intel P-State. |
| 107 | + |
| 108 | +scaling_max_freq/scaling_min_freq: This interface can be used similarly to |
| 109 | +the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies |
| 110 | +are converted to nearest possible P-State, this is prone to rounding errors. |
| 111 | +This method is not preferred to limit performance. |
| 112 | + |
| 113 | +affected_cpus: Not used |
| 114 | +related_cpus: Not used |
| 115 | + |
49 | 116 | For contemporary Intel processors, the frequency is controlled by the
|
50 |
| -processor itself and the P-states exposed to software are related to |
| 117 | +processor itself and the P-State exposed to software is related to |
51 | 118 | performance levels. The idea that frequency can be set to a single
|
52 |
| -frequency is fiction for Intel Core processors. Even if the scaling |
53 |
| -driver selects a single P state the actual frequency the processor |
| 119 | +frequency is fictional for Intel Core processors. Even if the scaling |
| 120 | +driver selects a single P-State, the actual frequency the processor |
54 | 121 | will run at is selected by the processor itself.
|
55 | 122 |
|
56 |
| -For legacy mode debugfs files have also been added to allow tuning of |
57 |
| -the internal governor algorythm. These files are located at |
58 |
| -/sys/kernel/debug/pstate_snb/ These files are NOT present in HWP mode. |
| 123 | +Tuning Intel P-State driver |
| 124 | + |
| 125 | +When HWP mode is not used, debugfs files have also been added to allow the |
| 126 | +tuning of the internal governor algorithm. These files are located at |
| 127 | +/sys/kernel/debug/pstate_snb/. The algorithm uses a PID (Proportional |
| 128 | +Integral Derivative) controller. The PID tunable parameters are: |
59 | 129 |
|
60 | 130 | deadband
|
61 | 131 | d_gain_pct
|
62 | 132 | i_gain_pct
|
63 | 133 | p_gain_pct
|
64 | 134 | sample_rate_ms
|
65 | 135 | setpoint
|
| 136 | + |
| 137 | +To adjust these parameters, some understanding of driver implementation is |
| 138 | +necessary. There are some tweeks described here, but be very careful. Adjusting |
| 139 | +them requires expert level understanding of power and performance relationship. |
| 140 | +These limits are only useful when the "powersave" policy is active. |
| 141 | + |
| 142 | +-To make the system more responsive to load changes, sample_rate_ms can |
| 143 | +be adjusted (current default is 10ms). |
| 144 | +-To make the system use higher performance, even if the load is lower, setpoint |
| 145 | +can be adjusted to a lower number. This will also lead to faster ramp up time |
| 146 | +to reach the maximum P-State. |
| 147 | +If there are no derivative and integral coefficients, The next P-State will be |
| 148 | +equal to: |
| 149 | + current P-State - ((setpoint - current cpu load) * p_gain_pct) |
| 150 | + |
| 151 | +For example, if the current PID parameters are (Which are defaults for the core |
| 152 | +processors like SandyBridge): |
| 153 | + deadband = 0 |
| 154 | + d_gain_pct = 0 |
| 155 | + i_gain_pct = 0 |
| 156 | + p_gain_pct = 20 |
| 157 | + sample_rate_ms = 10 |
| 158 | + setpoint = 97 |
| 159 | + |
| 160 | +If the current P-State = 0x08 and current load = 100, this will result in the |
| 161 | +next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State |
| 162 | +goes up by only 1. If during next sample interval the current load doesn't |
| 163 | +change and still 100, then P-State goes up by one again. This process will |
| 164 | +continue as long as the load is more than the setpoint until the maximum P-State |
| 165 | +is reached. |
| 166 | + |
| 167 | +For the same load at setpoint = 60, this will result in the next P-State |
| 168 | += 0x08 - ((60 - 100) * 0.2) = 16 |
| 169 | +So by changing the setpoint from 97 to 60, there is an increase of the |
| 170 | +next P-State from 9 to 16. So this will make processor execute at higher |
| 171 | +P-State for the same CPU load. If the load continues to be more than the |
| 172 | +setpoint during next sample intervals, then P-State will go up again till the |
| 173 | +maximum P-State is reached. But the ramp up time to reach the maximum P-State |
| 174 | +will be much faster when the setpoint is 60 compared to 97. |
| 175 | + |
| 176 | +Debugging Intel P-State driver |
| 177 | + |
| 178 | +Event tracing |
| 179 | +To debug P-State transition, the Linux event tracing interface can be used. |
| 180 | +There are two specific events, which can be enabled (Provided the kernel |
| 181 | +configs related to event tracing are enabled). |
| 182 | + |
| 183 | +# cd /sys/kernel/debug/tracing/ |
| 184 | +# echo 1 > events/power/pstate_sample/enable |
| 185 | +# echo 1 > events/power/cpu_frequency/enable |
| 186 | +# cat trace |
| 187 | +gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 |
| 188 | + scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 |
| 189 | + freq=2474476 |
| 190 | +cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 |
| 191 | + |
| 192 | + |
| 193 | +Using ftrace |
| 194 | + |
| 195 | +If function level tracing is required, the Linux ftrace interface can be used. |
| 196 | +For example if we want to check how often a function to set a P-State is |
| 197 | +called, we can set ftrace filter to intel_pstate_set_pstate. |
| 198 | + |
| 199 | +# cd /sys/kernel/debug/tracing/ |
| 200 | +# cat available_filter_functions | grep -i pstate |
| 201 | +intel_pstate_set_pstate |
| 202 | +intel_pstate_cpu_init |
| 203 | +... |
| 204 | + |
| 205 | +# echo intel_pstate_set_pstate > set_ftrace_filter |
| 206 | +# echo function > current_tracer |
| 207 | +# cat trace | head -15 |
| 208 | +# tracer: function |
| 209 | +# |
| 210 | +# entries-in-buffer/entries-written: 80/80 #P:4 |
| 211 | +# |
| 212 | +# _-----=> irqs-off |
| 213 | +# / _----=> need-resched |
| 214 | +# | / _---=> hardirq/softirq |
| 215 | +# || / _--=> preempt-depth |
| 216 | +# ||| / delay |
| 217 | +# TASK-PID CPU# |||| TIMESTAMP FUNCTION |
| 218 | +# | | | |||| | | |
| 219 | + Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func |
| 220 | + gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func |
| 221 | + gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func |
| 222 | + <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func |
0 commit comments