Skip to content

Commit d1679b4

Browse files
Ganesh Goudarmpe
authored andcommitted
powerpc/eeh: Permanently disable the removed device
When a device is hot removed on powernv, the hotplug driver clears the device's state. However, on pseries, if a device is removed by phyp after reaching the error threshold, the kernel remains unaware, leading to the device not being torn down. This prevents necessary remediation actions like failover. Permanently disable the device if the presence check fails. Also, in eeh_dev_check_failure in we may consider the error as false positive if the device is hotpluged out as the get_state call returns EEH_STATE_NOT_SUPPORT and we may end up not clearing the device state, so log the event if the state is not moved to permanent failure state. Signed-off-by: Ganesh Goudar <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
1 parent 57e6700 commit d1679b4

File tree

2 files changed

+21
-3
lines changed

2 files changed

+21
-3
lines changed

arch/powerpc/kernel/eeh.c

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -506,9 +506,18 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
506506
* We will punt with the following conditions: Failure to get
507507
* PE's state, EEH not support and Permanently unavailable
508508
* state, PE is in good state.
509+
*
510+
* On the pSeries, after reaching the threshold, get_state might
511+
* return EEH_STATE_NOT_SUPPORT. However, it's possible that the
512+
* device state remains uncleared if the device is not marked
513+
* pci_channel_io_perm_failure. Therefore, consider logging the
514+
* event to let device removal happen.
515+
*
509516
*/
510517
if ((ret < 0) ||
511-
(ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
518+
(ret == EEH_STATE_NOT_SUPPORT &&
519+
dev->error_state == pci_channel_io_perm_failure) ||
520+
eeh_state_active(ret)) {
512521
eeh_stats.false_positives++;
513522
pe->false_positives++;
514523
rc = 0;

arch/powerpc/kernel/eeh_driver.c

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -865,9 +865,18 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
865865
devices++;
866866

867867
if (!devices) {
868-
pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
868+
pr_warn("EEH: Frozen PHB#%x-PE#%x is empty!\n",
869869
pe->phb->global_number, pe->addr);
870-
goto out; /* nothing to recover */
870+
/*
871+
* The device is removed, tear down its state, on powernv
872+
* hotplug driver would take care of it but not on pseries,
873+
* permanently disable the card as it is hot removed.
874+
*
875+
* In the case of powernv, note that the removal of device
876+
* is covered by pci rescan lock, so no problem even if hotplug
877+
* driver attempts to remove the device.
878+
*/
879+
goto recover_failed;
871880
}
872881

873882
/* Log the event */

0 commit comments

Comments
 (0)