Skip to content

Commit b8c45a0

Browse files
ayalevin123davem330
authored andcommitted
devlink: Add Documentation/networking/devlink-health.txt
This patch adds a new file to add information about devlink health mechanism. Signed-off-by: Aya Levin <[email protected]> Signed-off-by: Eran Ben Elisha <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent ce019fa commit b8c45a0

File tree

1 file changed

+86
-0
lines changed

1 file changed

+86
-0
lines changed
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
The health mechanism is targeted for Real Time Alerting, in order to know when
2+
something bad had happened to a PCI device
3+
- Provide alert debug information
4+
- Self healing
5+
- If problem needs vendor support, provide a way to gather all needed debugging
6+
information.
7+
8+
The main idea is to unify and centralize driver health reports in the
9+
generic devlink instance and allow the user to set different
10+
attributes of the health reporting and recovery procedures.
11+
12+
The devlink health reporter:
13+
Device driver creates a "health reporter" per each error/health type.
14+
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
15+
or unknown (driver specific).
16+
For each registered health reporter a driver can issue error/health reports
17+
asynchronously. All health reports handling is done by devlink.
18+
Device driver can provide specific callbacks for each "health reporter", e.g.
19+
- Recovery procedures
20+
- Diagnostics and object dump procedures
21+
- OOB initial parameters
22+
Different parts of the driver can register different types of health reporters
23+
with different handlers.
24+
25+
Once an error is reported, devlink health will do the following actions:
26+
* A log is being send to the kernel trace events buffer
27+
* Health status and statistics are being updated for the reporter instance
28+
* Object dump is being taken and saved at the reporter instance (as long as
29+
there is no other dump which is already stored)
30+
* Auto recovery attempt is being done. Depends on:
31+
- Auto-recovery configuration
32+
- Grace period vs. time passed since last recover
33+
34+
The user interface:
35+
User can access/change each reporter's parameters and driver specific callbacks
36+
via devlink, e.g per error type (per health reporter)
37+
- Configure reporter's generic parameters (like: disable/enable auto recovery)
38+
- Invoke recovery procedure
39+
- Run diagnostics
40+
- Object dump
41+
42+
The devlink health interface (via netlink):
43+
DEVLINK_CMD_HEALTH_REPORTER_GET
44+
Retrieves status and configuration info per DEV and reporter.
45+
DEVLINK_CMD_HEALTH_REPORTER_SET
46+
Allows reporter-related configuration setting.
47+
DEVLINK_CMD_HEALTH_REPORTER_RECOVER
48+
Triggers a reporter's recovery procedure.
49+
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
50+
Retrieves diagnostics data from a reporter on a device.
51+
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
52+
Retrieves the last stored dump. Devlink health
53+
saves a single dump. If an dump is not already stored by the devlink
54+
for this reporter, devlink generates a new dump.
55+
dump output is defined by the reporter.
56+
DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
57+
Clears the last saved dump file for the specified reporter.
58+
59+
60+
netlink
61+
+--------------------------+
62+
| |
63+
| + |
64+
| | |
65+
+--------------------------+
66+
|request for ops
67+
|(diagnose,
68+
mlx5_core devlink |recover,
69+
|dump)
70+
+--------+ +--------------------------+
71+
| | | reporter| |
72+
| | | +---------v----------+ |
73+
| | ops execution | | | |
74+
| <----------------------------------+ | |
75+
| | | | | |
76+
| | | + ^------------------+ |
77+
| | | | request for ops |
78+
| | | | (recover, dump) |
79+
| | | | |
80+
| | | +-+------------------+ |
81+
| | health report | | health handler | |
82+
| +-------------------------------> | |
83+
| | | +--------------------+ |
84+
| | health reporter create | |
85+
| +----------------------------> |
86+
+--------+ +--------------------------+

0 commit comments

Comments
 (0)