|
| 1 | +The health mechanism is targeted for Real Time Alerting, in order to know when |
| 2 | +something bad had happened to a PCI device |
| 3 | +- Provide alert debug information |
| 4 | +- Self healing |
| 5 | +- If problem needs vendor support, provide a way to gather all needed debugging |
| 6 | + information. |
| 7 | + |
| 8 | +The main idea is to unify and centralize driver health reports in the |
| 9 | +generic devlink instance and allow the user to set different |
| 10 | +attributes of the health reporting and recovery procedures. |
| 11 | + |
| 12 | +The devlink health reporter: |
| 13 | +Device driver creates a "health reporter" per each error/health type. |
| 14 | +Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) |
| 15 | +or unknown (driver specific). |
| 16 | +For each registered health reporter a driver can issue error/health reports |
| 17 | +asynchronously. All health reports handling is done by devlink. |
| 18 | +Device driver can provide specific callbacks for each "health reporter", e.g. |
| 19 | + - Recovery procedures |
| 20 | + - Diagnostics and object dump procedures |
| 21 | + - OOB initial parameters |
| 22 | +Different parts of the driver can register different types of health reporters |
| 23 | +with different handlers. |
| 24 | + |
| 25 | +Once an error is reported, devlink health will do the following actions: |
| 26 | + * A log is being send to the kernel trace events buffer |
| 27 | + * Health status and statistics are being updated for the reporter instance |
| 28 | + * Object dump is being taken and saved at the reporter instance (as long as |
| 29 | + there is no other dump which is already stored) |
| 30 | + * Auto recovery attempt is being done. Depends on: |
| 31 | + - Auto-recovery configuration |
| 32 | + - Grace period vs. time passed since last recover |
| 33 | + |
| 34 | +The user interface: |
| 35 | +User can access/change each reporter's parameters and driver specific callbacks |
| 36 | +via devlink, e.g per error type (per health reporter) |
| 37 | + - Configure reporter's generic parameters (like: disable/enable auto recovery) |
| 38 | + - Invoke recovery procedure |
| 39 | + - Run diagnostics |
| 40 | + - Object dump |
| 41 | + |
| 42 | +The devlink health interface (via netlink): |
| 43 | +DEVLINK_CMD_HEALTH_REPORTER_GET |
| 44 | + Retrieves status and configuration info per DEV and reporter. |
| 45 | +DEVLINK_CMD_HEALTH_REPORTER_SET |
| 46 | + Allows reporter-related configuration setting. |
| 47 | +DEVLINK_CMD_HEALTH_REPORTER_RECOVER |
| 48 | + Triggers a reporter's recovery procedure. |
| 49 | +DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE |
| 50 | + Retrieves diagnostics data from a reporter on a device. |
| 51 | +DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET |
| 52 | + Retrieves the last stored dump. Devlink health |
| 53 | + saves a single dump. If an dump is not already stored by the devlink |
| 54 | + for this reporter, devlink generates a new dump. |
| 55 | + dump output is defined by the reporter. |
| 56 | +DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR |
| 57 | + Clears the last saved dump file for the specified reporter. |
| 58 | + |
| 59 | + |
| 60 | + netlink |
| 61 | + +--------------------------+ |
| 62 | + | | |
| 63 | + | + | |
| 64 | + | | | |
| 65 | + +--------------------------+ |
| 66 | + |request for ops |
| 67 | + |(diagnose, |
| 68 | + mlx5_core devlink |recover, |
| 69 | + |dump) |
| 70 | ++--------+ +--------------------------+ |
| 71 | +| | | reporter| | |
| 72 | +| | | +---------v----------+ | |
| 73 | +| | ops execution | | | | |
| 74 | +| <----------------------------------+ | | |
| 75 | +| | | | | | |
| 76 | +| | | + ^------------------+ | |
| 77 | +| | | | request for ops | |
| 78 | +| | | | (recover, dump) | |
| 79 | +| | | | | |
| 80 | +| | | +-+------------------+ | |
| 81 | +| | health report | | health handler | | |
| 82 | +| +-------------------------------> | | |
| 83 | +| | | +--------------------+ | |
| 84 | +| | health reporter create | | |
| 85 | +| +----------------------------> | |
| 86 | ++--------+ +--------------------------+ |
0 commit comments