Skip to content

Commit b366f97

Browse files
committed
Merge branch 'pm-cpufreq'
* pm-cpufreq: (30 commits) Documentation: cpufreq: intel_pstate: enhance documentation cpufreq-dt: fix handling regulator_get_voltage() result cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC cpufreq: mt8173: migrate to use operating-points-v2 bindings cpufreq: Simplify core code related to boost support cpufreq: acpi-cpufreq: Simplify boost-related code cpufreq: Make cpufreq_boost_supported() static blackfin-cpufreq: Mark cpu_set_cclk() as static blackfin-cpufreq: Change return type of cpu_set_cclk() to that of clk_set_rate() dt: cpufreq: st: Provide bindings for ST's CPUFreq implementation cpufreq: st: Provide runtime initialised driver for ST's platforms cpufreq: mt8173: Move resources allocation into ->probe() cpufreq: intel_pstate: Account for IO wait time cpufreq: intel_pstate: Account for non C0 time cpufreq: intel_pstate: Configurable algorithm to get target pstate cpufreq: mt8173: check return value of regulator_get_voltage() call cpufreq: mt8173: remove redundant regulator_get_voltage() call cpufreq: mt8173: add CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag cpufreq: qoriq: Register cooling device based on device tree cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency ...
2 parents 7f4a370 + a032d2d commit b366f97

21 files changed

+1014
-226
lines changed
Lines changed: 199 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,222 @@
1-
Intel P-state driver
1+
Intel P-State driver
22
--------------------
33

4-
This driver provides an interface to control the P state selection for
5-
SandyBridge+ Intel processors. The driver can operate two different
6-
modes based on the processor model, legacy mode and Hardware P state (HWP)
7-
mode.
8-
9-
In legacy mode, the Intel P-state implements two internal governors,
10-
performance and powersave, that differ from the general cpufreq governors of
11-
the same name (the general cpufreq governors implement target(), whereas the
12-
internal Intel P-state governors implement setpolicy()). The internal
13-
performance governor sets the max_perf_pct and min_perf_pct to 100; that is,
14-
the governor selects the highest available P state to maximize the performance
15-
of the core. The internal powersave governor selects the appropriate P state
16-
based on the current load on the CPU.
17-
18-
In HWP mode P state selection is implemented in the processor
19-
itself. The driver provides the interfaces between the cpufreq core and
20-
the processor to control P state selection based on user preferences
21-
and reporting frequency to the cpufreq core. In this mode the
22-
internal Intel P-state governor code is disabled.
23-
24-
In addition to the interfaces provided by the cpufreq core for
25-
controlling frequency the driver provides sysfs files for
26-
controlling P state selection. These files have been added to
27-
/sys/devices/system/cpu/intel_pstate/
28-
29-
max_perf_pct: limits the maximum P state that will be requested by
30-
the driver stated as a percentage of the available performance. The
31-
available (P states) performance may be reduced by the no_turbo
4+
This driver provides an interface to control the P-State selection for the
5+
SandyBridge+ Intel processors.
6+
7+
The following document explains P-States:
8+
http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
9+
As stated in the document, P-State doesn’t exactly mean a frequency. However, for
10+
the sake of the relationship with cpufreq, P-State and frequency are used
11+
interchangeably.
12+
13+
Understanding the cpufreq core governors and policies are important before
14+
discussing more details about the Intel P-State driver. Based on what callbacks
15+
a cpufreq driver provides to the cpufreq core, it can support two types of
16+
drivers:
17+
- with target_index() callback: In this mode, the drivers using cpufreq core
18+
simply provide the minimum and maximum frequency limits and an additional
19+
interface target_index() to set the current frequency. The cpufreq subsystem
20+
has a number of scaling governors ("performance", "powersave", "ondemand",
21+
etc.). Depending on which governor is in use, cpufreq core will call for
22+
transitions to a specific frequency using target_index() callback.
23+
- setpolicy() callback: In this mode, drivers do not provide target_index()
24+
callback, so cpufreq core can't request a transition to a specific frequency.
25+
The driver provides minimum and maximum frequency limits and callbacks to set a
26+
policy. The policy in cpufreq sysfs is referred to as the "scaling governor".
27+
The cpufreq core can request the driver to operate in any of the two policies:
28+
"performance: and "powersave". The driver decides which frequency to use based
29+
on the above policy selection considering minimum and maximum frequency limits.
30+
31+
The Intel P-State driver falls under the latter category, which implements the
32+
setpolicy() callback. This driver decides what P-State to use based on the
33+
requested policy from the cpufreq core. If the processor is capable of
34+
selecting its next P-State internally, then the driver will offload this
35+
responsibility to the processor (aka HWP: Hardware P-States). If not, the
36+
driver implements algorithms to select the next P-State.
37+
38+
Since these policies are implemented in the driver, they are not same as the
39+
cpufreq scaling governors implementation, even if they have the same name in
40+
the cpufreq sysfs (scaling_governors). For example the "performance" policy is
41+
similar to cpufreq’s "performance" governor, but "powersave" is completely
42+
different than the cpufreq "powersave" governor. The strategy here is similar
43+
to cpufreq "ondemand", where the requested P-State is related to the system load.
44+
45+
Sysfs Interface
46+
47+
In addition to the frequency-controlling interfaces provided by the cpufreq
48+
core, the driver provides its own sysfs files to control the P-State selection.
49+
These files have been added to /sys/devices/system/cpu/intel_pstate/.
50+
Any changes made to these files are applicable to all CPUs (even in a
51+
multi-package system).
52+
53+
max_perf_pct: Limits the maximum P-State that will be requested by
54+
the driver. It states it as a percentage of the available performance. The
55+
available (P-State) performance may be reduced by the no_turbo
3256
setting described below.
3357

34-
min_perf_pct: limits the minimum P state that will be requested by
35-
the driver stated as a percentage of the max (non-turbo)
58+
min_perf_pct: Limits the minimum P-State that will be requested by
59+
the driver. It states it as a percentage of the max (non-turbo)
3660
performance level.
3761

38-
no_turbo: limits the driver to selecting P states below the turbo
62+
no_turbo: Limits the driver to selecting P-State below the turbo
3963
frequency range.
4064

41-
turbo_pct: displays the percentage of the total performance that
42-
is supported by hardware that is in the turbo range. This number
65+
turbo_pct: Displays the percentage of the total performance that
66+
is supported by hardware that is in the turbo range. This number
4367
is independent of whether turbo has been disabled or not.
4468

45-
num_pstates: displays the number of pstates that are supported
46-
by hardware. This number is independent of whether turbo has
69+
num_pstates: Displays the number of P-States that are supported
70+
by hardware. This number is independent of whether turbo has
4771
been disabled or not.
4872

73+
For example, if a system has these parameters:
74+
Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State)
75+
Max non turbo ratio: 0x17
76+
Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio)
77+
78+
Sysfs will show :
79+
max_perf_pct:100, which corresponds to 1 core ratio
80+
min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio
81+
no_turbo:0, turbo is not disabled
82+
num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1)
83+
turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates
84+
85+
Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual
86+
Volume 3: System Programming Guide" to understand ratios.
87+
88+
cpufreq sysfs for Intel P-State
89+
90+
Since this driver registers with cpufreq, cpufreq sysfs is also presented.
91+
There are some important differences, which need to be considered.
92+
93+
scaling_cur_freq: This displays the real frequency which was used during
94+
the last sample period instead of what is requested. Some other cpufreq driver,
95+
like acpi-cpufreq, displays what is requested (Some changes are on the
96+
way to fix this for acpi-cpufreq driver). The same is true for frequencies
97+
displayed at /proc/cpuinfo.
98+
99+
scaling_governor: This displays current active policy. Since each CPU has a
100+
cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this
101+
is not possible with Intel P-States, as there is one common policy for all
102+
CPUs. Here, the last requested policy will be applicable to all CPUs. It is
103+
suggested that one use the cpupower utility to change policy to all CPUs at the
104+
same time.
105+
106+
scaling_setspeed: This attribute can never be used with Intel P-State.
107+
108+
scaling_max_freq/scaling_min_freq: This interface can be used similarly to
109+
the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies
110+
are converted to nearest possible P-State, this is prone to rounding errors.
111+
This method is not preferred to limit performance.
112+
113+
affected_cpus: Not used
114+
related_cpus: Not used
115+
49116
For contemporary Intel processors, the frequency is controlled by the
50-
processor itself and the P-states exposed to software are related to
117+
processor itself and the P-State exposed to software is related to
51118
performance levels. The idea that frequency can be set to a single
52-
frequency is fiction for Intel Core processors. Even if the scaling
53-
driver selects a single P state the actual frequency the processor
119+
frequency is fictional for Intel Core processors. Even if the scaling
120+
driver selects a single P-State, the actual frequency the processor
54121
will run at is selected by the processor itself.
55122

56-
For legacy mode debugfs files have also been added to allow tuning of
57-
the internal governor algorythm. These files are located at
58-
/sys/kernel/debug/pstate_snb/ These files are NOT present in HWP mode.
123+
Tuning Intel P-State driver
124+
125+
When HWP mode is not used, debugfs files have also been added to allow the
126+
tuning of the internal governor algorithm. These files are located at
127+
/sys/kernel/debug/pstate_snb/. The algorithm uses a PID (Proportional
128+
Integral Derivative) controller. The PID tunable parameters are:
59129

60130
deadband
61131
d_gain_pct
62132
i_gain_pct
63133
p_gain_pct
64134
sample_rate_ms
65135
setpoint
136+
137+
To adjust these parameters, some understanding of driver implementation is
138+
necessary. There are some tweeks described here, but be very careful. Adjusting
139+
them requires expert level understanding of power and performance relationship.
140+
These limits are only useful when the "powersave" policy is active.
141+
142+
-To make the system more responsive to load changes, sample_rate_ms can
143+
be adjusted (current default is 10ms).
144+
-To make the system use higher performance, even if the load is lower, setpoint
145+
can be adjusted to a lower number. This will also lead to faster ramp up time
146+
to reach the maximum P-State.
147+
If there are no derivative and integral coefficients, The next P-State will be
148+
equal to:
149+
current P-State - ((setpoint - current cpu load) * p_gain_pct)
150+
151+
For example, if the current PID parameters are (Which are defaults for the core
152+
processors like SandyBridge):
153+
deadband = 0
154+
d_gain_pct = 0
155+
i_gain_pct = 0
156+
p_gain_pct = 20
157+
sample_rate_ms = 10
158+
setpoint = 97
159+
160+
If the current P-State = 0x08 and current load = 100, this will result in the
161+
next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State
162+
goes up by only 1. If during next sample interval the current load doesn't
163+
change and still 100, then P-State goes up by one again. This process will
164+
continue as long as the load is more than the setpoint until the maximum P-State
165+
is reached.
166+
167+
For the same load at setpoint = 60, this will result in the next P-State
168+
= 0x08 - ((60 - 100) * 0.2) = 16
169+
So by changing the setpoint from 97 to 60, there is an increase of the
170+
next P-State from 9 to 16. So this will make processor execute at higher
171+
P-State for the same CPU load. If the load continues to be more than the
172+
setpoint during next sample intervals, then P-State will go up again till the
173+
maximum P-State is reached. But the ramp up time to reach the maximum P-State
174+
will be much faster when the setpoint is 60 compared to 97.
175+
176+
Debugging Intel P-State driver
177+
178+
Event tracing
179+
To debug P-State transition, the Linux event tracing interface can be used.
180+
There are two specific events, which can be enabled (Provided the kernel
181+
configs related to event tracing are enabled).
182+
183+
# cd /sys/kernel/debug/tracing/
184+
# echo 1 > events/power/pstate_sample/enable
185+
# echo 1 > events/power/cpu_frequency/enable
186+
# cat trace
187+
gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107
188+
scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618
189+
freq=2474476
190+
cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2
191+
192+
193+
Using ftrace
194+
195+
If function level tracing is required, the Linux ftrace interface can be used.
196+
For example if we want to check how often a function to set a P-State is
197+
called, we can set ftrace filter to intel_pstate_set_pstate.
198+
199+
# cd /sys/kernel/debug/tracing/
200+
# cat available_filter_functions | grep -i pstate
201+
intel_pstate_set_pstate
202+
intel_pstate_cpu_init
203+
...
204+
205+
# echo intel_pstate_set_pstate > set_ftrace_filter
206+
# echo function > current_tracer
207+
# cat trace | head -15
208+
# tracer: function
209+
#
210+
# entries-in-buffer/entries-written: 80/80 #P:4
211+
#
212+
# _-----=> irqs-off
213+
# / _----=> need-resched
214+
# | / _---=> hardirq/softirq
215+
# || / _--=> preempt-depth
216+
# ||| / delay
217+
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
218+
# | | | |||| | |
219+
Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
220+
gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
221+
gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
222+
<idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func

Documentation/cpu-freq/pcc-cpufreq.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -159,8 +159,8 @@ to be strictly associated with a P-state.
159159

160160
2.2 cpuinfo_transition_latency:
161161
-------------------------------
162-
The cpuinfo_transition_latency field is 0. The PCC specification does
163-
not include a field to expose this value currently.
162+
The cpuinfo_transition_latency field is CPUFREQ_ETERNAL. The PCC specification
163+
does not include a field to expose this value currently.
164164

165165
2.3 cpuinfo_cur_freq:
166166
---------------------

Documentation/devicetree/bindings/arm/cpus.txt

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,23 @@ nodes to be present and contain the properties described below.
242242
Definition: Specifies the syscon node controlling the cpu core
243243
power domains.
244244

245+
- dynamic-power-coefficient
246+
Usage: optional
247+
Value type: <prop-encoded-array>
248+
Definition: A u32 value that represents the running time dynamic
249+
power coefficient in units of mW/MHz/uVolt^2. The
250+
coefficient can either be calculated from power
251+
measurements or derived by analysis.
252+
253+
The dynamic power consumption of the CPU is
254+
proportional to the square of the Voltage (V) and
255+
the clock frequency (f). The coefficient is used to
256+
calculate the dynamic power as below -
257+
258+
Pdyn = dynamic-power-coefficient * V^2 * f
259+
260+
where voltage is in uV, frequency is in MHz.
261+
245262
Example 1 (dual-cluster big.LITTLE system 32-bit):
246263

247264
cpus {
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
Binding for ST's CPUFreq driver
2+
===============================
3+
4+
ST's CPUFreq driver attempts to read 'process' and 'version' attributes
5+
from the SoC, then supplies the OPP framework with 'prop' and 'supported
6+
hardware' information respectively. The framework is then able to read
7+
the DT and operate in the usual way.
8+
9+
For more information about the expected DT format [See: ../opp/opp.txt].
10+
11+
Frequency Scaling only
12+
----------------------
13+
14+
No vendor specific driver required for this.
15+
16+
Located in CPU's node:
17+
18+
- operating-points : [See: ../power/opp.txt]
19+
20+
Example [safe]
21+
--------------
22+
23+
cpus {
24+
cpu@0 {
25+
/* kHz uV */
26+
operating-points = <1500000 0
27+
1200000 0
28+
800000 0
29+
500000 0>;
30+
};
31+
};
32+
33+
Dynamic Voltage and Frequency Scaling (DVFS)
34+
--------------------------------------------
35+
36+
This requires the ST CPUFreq driver to supply 'process' and 'version' info.
37+
38+
Located in CPU's node:
39+
40+
- operating-points-v2 : [See ../power/opp.txt]
41+
42+
Example [unsafe]
43+
----------------
44+
45+
cpus {
46+
cpu@0 {
47+
operating-points-v2 = <&cpu0_opp_table>;
48+
};
49+
};
50+
51+
cpu0_opp_table: opp_table {
52+
compatible = "operating-points-v2";
53+
54+
/* ############################################################### */
55+
/* # WARNING: Do not attempt to copy/replicate these nodes, # */
56+
/* # they are only to be supplied by the bootloader !!! # */
57+
/* ############################################################### */
58+
opp0 {
59+
/* Major Minor Substrate */
60+
/* 2 all all */
61+
opp-supported-hw = <0x00000004 0xffffffff 0xffffffff>;
62+
opp-hz = /bits/ 64 <1500000000>;
63+
clock-latency-ns = <10000000>;
64+
65+
opp-microvolt-pcode0 = <1200000>;
66+
opp-microvolt-pcode1 = <1200000>;
67+
opp-microvolt-pcode2 = <1200000>;
68+
opp-microvolt-pcode3 = <1200000>;
69+
opp-microvolt-pcode4 = <1170000>;
70+
opp-microvolt-pcode5 = <1140000>;
71+
opp-microvolt-pcode6 = <1100000>;
72+
opp-microvolt-pcode7 = <1070000>;
73+
};
74+
75+
opp1 {
76+
/* Major Minor Substrate */
77+
/* all all all */
78+
opp-supported-hw = <0xffffffff 0xffffffff 0xffffffff>;
79+
opp-hz = /bits/ 64 <1200000000>;
80+
clock-latency-ns = <10000000>;
81+
82+
opp-microvolt-pcode0 = <1110000>;
83+
opp-microvolt-pcode1 = <1150000>;
84+
opp-microvolt-pcode2 = <1100000>;
85+
opp-microvolt-pcode3 = <1080000>;
86+
opp-microvolt-pcode4 = <1040000>;
87+
opp-microvolt-pcode5 = <1020000>;
88+
opp-microvolt-pcode6 = <980000>;
89+
opp-microvolt-pcode7 = <930000>;
90+
};
91+
};

0 commit comments

Comments
 (0)