Skip to content

Commit 0e94682

Browse files
surenbaghdasaryantorvalds
authored andcommitted
psi: introduce psi monitor
Psi monitor aims to provide a low-latency short-term pressure detection mechanism configurable by users. It allows users to monitor psi metrics growth and trigger events whenever a metric raises above user-defined threshold within user-defined time window. Time window and threshold are both expressed in usecs. Multiple psi resources with different thresholds and window sizes can be monitored concurrently. Psi monitors activate when system enters stall state for the monitored psi metric and deactivate upon exit from the stall state. While system is in the stall state psi signal growth is monitored at a rate of 10 times per tracking window. Min window size is 500ms, therefore the min monitoring interval is 50ms. Max window size is 10s with monitoring interval of 1s. When activated psi monitor stays active for at least the duration of one tracking window to avoid repeated activations/deactivations when psi signal is bouncing. Notifications to the users are rate-limited to one per tracking window. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 8af0c18 commit 0e94682

File tree

5 files changed

+742
-20
lines changed

5 files changed

+742
-20
lines changed

Documentation/accounting/psi.txt

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,110 @@ as well as medium and long term trends. The total absolute stall time
6363
spikes which wouldn't necessarily make a dent in the time averages,
6464
or to average trends over custom time frames.
6565

66+
Monitoring for pressure thresholds
67+
==================================
68+
69+
Users can register triggers and use poll() to be woken up when resource
70+
pressure exceeds certain thresholds.
71+
72+
A trigger describes the maximum cumulative stall time over a specific
73+
time window, e.g. 100ms of total stall time within any 500ms window to
74+
generate a wakeup event.
75+
76+
To register a trigger user has to open psi interface file under
77+
/proc/pressure/ representing the resource to be monitored and write the
78+
desired threshold and time window. The open file descriptor should be
79+
used to wait for trigger events using select(), poll() or epoll().
80+
The following format is used:
81+
82+
<some|full> <stall amount in us> <time window in us>
83+
84+
For example writing "some 150000 1000000" into /proc/pressure/memory
85+
would add 150ms threshold for partial memory stall measured within
86+
1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
87+
would add 50ms threshold for full io stall measured within 1sec time window.
88+
89+
Triggers can be set on more than one psi metric and more than one trigger
90+
for the same psi metric can be specified. However for each trigger a separate
91+
file descriptor is required to be able to poll it separately from others,
92+
therefore for each trigger a separate open() syscall should be made even
93+
when opening the same psi interface file.
94+
95+
Monitors activate only when system enters stall state for the monitored
96+
psi metric and deactivates upon exit from the stall state. While system is
97+
in the stall state psi signal growth is monitored at a rate of 10 times per
98+
tracking window.
99+
100+
The kernel accepts window sizes ranging from 500ms to 10s, therefore min
101+
monitoring update interval is 50ms and max is 1s. Min limit is set to
102+
prevent overly frequent polling. Max limit is chosen as a high enough number
103+
after which monitors are most likely not needed and psi averages can be used
104+
instead.
105+
106+
When activated, psi monitor stays active for at least the duration of one
107+
tracking window to avoid repeated activations/deactivations when system is
108+
bouncing in and out of the stall state.
109+
110+
Notifications to the userspace are rate-limited to one per tracking window.
111+
112+
The trigger will de-register when the file descriptor used to define the
113+
trigger is closed.
114+
115+
Userspace monitor usage example
116+
===============================
117+
118+
#include <errno.h>
119+
#include <fcntl.h>
120+
#include <stdio.h>
121+
#include <poll.h>
122+
#include <string.h>
123+
#include <unistd.h>
124+
125+
/*
126+
* Monitor memory partial stall with 1s tracking window size
127+
* and 150ms threshold.
128+
*/
129+
int main() {
130+
const char trig[] = "some 150000 1000000";
131+
struct pollfd fds;
132+
int n;
133+
134+
fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
135+
if (fds.fd < 0) {
136+
printf("/proc/pressure/memory open error: %s\n",
137+
strerror(errno));
138+
return 1;
139+
}
140+
fds.events = POLLPRI;
141+
142+
if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
143+
printf("/proc/pressure/memory write error: %s\n",
144+
strerror(errno));
145+
return 1;
146+
}
147+
148+
printf("waiting for events...\n");
149+
while (1) {
150+
n = poll(&fds, 1, -1);
151+
if (n < 0) {
152+
printf("poll error: %s\n", strerror(errno));
153+
return 1;
154+
}
155+
if (fds.revents & POLLERR) {
156+
printf("got POLLERR, event source is gone\n");
157+
return 0;
158+
}
159+
if (fds.revents & POLLPRI) {
160+
printf("event triggered!\n");
161+
} else {
162+
printf("unknown event received: 0x%x\n", fds.revents);
163+
return 1;
164+
}
165+
}
166+
167+
return 0;
168+
}
169+
66170
Cgroup2 interface
67171
=================
68172

@@ -71,3 +175,6 @@ mounted, pressure stall information is also tracked for tasks grouped
71175
into cgroups. Each subdirectory in the cgroupfs mountpoint contains
72176
cpu.pressure, memory.pressure, and io.pressure files; the format is
73177
the same as the /proc/pressure/ files.
178+
179+
Per-cgroup psi monitors can be specified and used the same way as
180+
system-wide ones.

include/linux/psi.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
#include <linux/jump_label.h>
55
#include <linux/psi_types.h>
66
#include <linux/sched.h>
7+
#include <linux/poll.h>
78

89
struct seq_file;
910
struct css_set;
@@ -26,6 +27,13 @@ int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
2627
int psi_cgroup_alloc(struct cgroup *cgrp);
2728
void psi_cgroup_free(struct cgroup *cgrp);
2829
void cgroup_move_task(struct task_struct *p, struct css_set *to);
30+
31+
struct psi_trigger *psi_trigger_create(struct psi_group *group,
32+
char *buf, size_t nbytes, enum psi_res res);
33+
void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *t);
34+
35+
__poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
36+
poll_table *wait);
2937
#endif
3038

3139
#else /* CONFIG_PSI */

include/linux/psi_types.h

Lines changed: 80 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
#ifndef _LINUX_PSI_TYPES_H
22
#define _LINUX_PSI_TYPES_H
33

4+
#include <linux/kthread.h>
45
#include <linux/seqlock.h>
56
#include <linux/types.h>
7+
#include <linux/kref.h>
8+
#include <linux/wait.h>
69

710
#ifdef CONFIG_PSI
811

@@ -44,6 +47,12 @@ enum psi_states {
4447
NR_PSI_STATES = 6,
4548
};
4649

50+
enum psi_aggregators {
51+
PSI_AVGS = 0,
52+
PSI_POLL,
53+
NR_PSI_AGGREGATORS,
54+
};
55+
4756
struct psi_group_cpu {
4857
/* 1st cacheline updated by the scheduler */
4958

@@ -65,7 +74,55 @@ struct psi_group_cpu {
6574
/* 2nd cacheline updated by the aggregator */
6675

6776
/* Delta detection against the sampling buckets */
68-
u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
77+
u32 times_prev[NR_PSI_AGGREGATORS][NR_PSI_STATES]
78+
____cacheline_aligned_in_smp;
79+
};
80+
81+
/* PSI growth tracking window */
82+
struct psi_window {
83+
/* Window size in ns */
84+
u64 size;
85+
86+
/* Start time of the current window in ns */
87+
u64 start_time;
88+
89+
/* Value at the start of the window */
90+
u64 start_value;
91+
92+
/* Value growth in the previous window */
93+
u64 prev_growth;
94+
};
95+
96+
struct psi_trigger {
97+
/* PSI state being monitored by the trigger */
98+
enum psi_states state;
99+
100+
/* User-spacified threshold in ns */
101+
u64 threshold;
102+
103+
/* List node inside triggers list */
104+
struct list_head node;
105+
106+
/* Backpointer needed during trigger destruction */
107+
struct psi_group *group;
108+
109+
/* Wait queue for polling */
110+
wait_queue_head_t event_wait;
111+
112+
/* Pending event flag */
113+
int event;
114+
115+
/* Tracking window */
116+
struct psi_window win;
117+
118+
/*
119+
* Time last event was generated. Used for rate-limiting
120+
* events to one per window
121+
*/
122+
u64 last_event_time;
123+
124+
/* Refcounting to prevent premature destruction */
125+
struct kref refcount;
69126
};
70127

71128
struct psi_group {
@@ -79,11 +136,32 @@ struct psi_group {
79136
u64 avg_total[NR_PSI_STATES - 1];
80137
u64 avg_last_update;
81138
u64 avg_next_update;
139+
140+
/* Aggregator work control */
82141
struct delayed_work avgs_work;
83142

84143
/* Total stall times and sampled pressure averages */
85-
u64 total[NR_PSI_STATES - 1];
144+
u64 total[NR_PSI_AGGREGATORS][NR_PSI_STATES - 1];
86145
unsigned long avg[NR_PSI_STATES - 1][3];
146+
147+
/* Monitor work control */
148+
atomic_t poll_scheduled;
149+
struct kthread_worker __rcu *poll_kworker;
150+
struct kthread_delayed_work poll_work;
151+
152+
/* Protects data used by the monitor */
153+
struct mutex trigger_lock;
154+
155+
/* Configured polling triggers */
156+
struct list_head triggers;
157+
u32 nr_triggers[NR_PSI_STATES - 1];
158+
u32 poll_states;
159+
u64 poll_min_period;
160+
161+
/* Total stall times at the start of monitor activation */
162+
u64 polling_total[NR_PSI_STATES - 1];
163+
u64 polling_next_update;
164+
u64 polling_until;
87165
};
88166

89167
#else /* CONFIG_PSI */

kernel/cgroup/cgroup.c

Lines changed: 69 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3550,7 +3550,65 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
35503550
{
35513551
return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU);
35523552
}
3553-
#endif
3553+
3554+
static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
3555+
size_t nbytes, enum psi_res res)
3556+
{
3557+
struct psi_trigger *new;
3558+
struct cgroup *cgrp;
3559+
3560+
cgrp = cgroup_kn_lock_live(of->kn, false);
3561+
if (!cgrp)
3562+
return -ENODEV;
3563+
3564+
cgroup_get(cgrp);
3565+
cgroup_kn_unlock(of->kn);
3566+
3567+
new = psi_trigger_create(&cgrp->psi, buf, nbytes, res);
3568+
if (IS_ERR(new)) {
3569+
cgroup_put(cgrp);
3570+
return PTR_ERR(new);
3571+
}
3572+
3573+
psi_trigger_replace(&of->priv, new);
3574+
3575+
cgroup_put(cgrp);
3576+
3577+
return nbytes;
3578+
}
3579+
3580+
static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
3581+
char *buf, size_t nbytes,
3582+
loff_t off)
3583+
{
3584+
return cgroup_pressure_write(of, buf, nbytes, PSI_IO);
3585+
}
3586+
3587+
static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
3588+
char *buf, size_t nbytes,
3589+
loff_t off)
3590+
{
3591+
return cgroup_pressure_write(of, buf, nbytes, PSI_MEM);
3592+
}
3593+
3594+
static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
3595+
char *buf, size_t nbytes,
3596+
loff_t off)
3597+
{
3598+
return cgroup_pressure_write(of, buf, nbytes, PSI_CPU);
3599+
}
3600+
3601+
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
3602+
poll_table *pt)
3603+
{
3604+
return psi_trigger_poll(&of->priv, of->file, pt);
3605+
}
3606+
3607+
static void cgroup_pressure_release(struct kernfs_open_file *of)
3608+
{
3609+
psi_trigger_replace(&of->priv, NULL);
3610+
}
3611+
#endif /* CONFIG_PSI */
35543612

35553613
static int cgroup_freeze_show(struct seq_file *seq, void *v)
35563614
{
@@ -4745,18 +4803,27 @@ static struct cftype cgroup_base_files[] = {
47454803
.name = "io.pressure",
47464804
.flags = CFTYPE_NOT_ON_ROOT,
47474805
.seq_show = cgroup_io_pressure_show,
4806+
.write = cgroup_io_pressure_write,
4807+
.poll = cgroup_pressure_poll,
4808+
.release = cgroup_pressure_release,
47484809
},
47494810
{
47504811
.name = "memory.pressure",
47514812
.flags = CFTYPE_NOT_ON_ROOT,
47524813
.seq_show = cgroup_memory_pressure_show,
4814+
.write = cgroup_memory_pressure_write,
4815+
.poll = cgroup_pressure_poll,
4816+
.release = cgroup_pressure_release,
47534817
},
47544818
{
47554819
.name = "cpu.pressure",
47564820
.flags = CFTYPE_NOT_ON_ROOT,
47574821
.seq_show = cgroup_cpu_pressure_show,
4822+
.write = cgroup_cpu_pressure_write,
4823+
.poll = cgroup_pressure_poll,
4824+
.release = cgroup_pressure_release,
47584825
},
4759-
#endif
4826+
#endif /* CONFIG_PSI */
47604827
{ } /* terminate */
47614828
};
47624829

0 commit comments

Comments
 (0)