Skip to content

Commit b104af5

Browse files
oohalmpe
authored andcommitted
powerpc/eeh: Check slot presence state in eeh_handle_normal_event()
When a device is surprise removed while undergoing IO we will probably get an EEH PE freeze due to MMIO timeouts and other errors. When a freeze is detected we send a recovery event to the EEH worker thread which will notify drivers, and perform recovery as needed. In the event of a hot-remove we don't want recovery to occur since there isn't a device to recover. The recovery process is fairly long due to the number of wait states (required by PCIe) which causes problems when devices are removed and replaced (e.g. hot swapping of U.2 NVMe drives). To determine if we need to skip the recovery process we can use the get_adapter_state() operation of the hotplug_slot to determine if the slot contains a device or not, and if the slot is empty we can skip recovery entirely. One thing to note is that the slot being EEH frozen does not prevent the hotplug driver from working. We don't have the EEH recovery thread remove any of the devices since it's assumed that the hotplug driver will handle tearing down the slot state. Signed-off-by: Oliver O'Halloran <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent 38ddc01 commit b104af5

File tree

1 file changed

+60
-0
lines changed

1 file changed

+60
-0
lines changed

arch/powerpc/kernel/eeh_driver.c

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
#include <linux/irq.h>
2828
#include <linux/module.h>
2929
#include <linux/pci.h>
30+
#include <linux/pci_hotplug.h>
3031
#include <asm/eeh.h>
3132
#include <asm/eeh_event.h>
3233
#include <asm/ppc-pci.h>
@@ -769,6 +770,46 @@ static void eeh_pe_cleanup(struct eeh_pe *pe)
769770
}
770771
}
771772

773+
/**
774+
* eeh_check_slot_presence - Check if a device is still present in a slot
775+
* @pdev: pci_dev to check
776+
*
777+
* This function may return a false positive if we can't determine the slot's
778+
* presence state. This might happen for for PCIe slots if the PE containing
779+
* the upstream bridge is also frozen, or the bridge is part of the same PE
780+
* as the device.
781+
*
782+
* This shouldn't happen often, but you might see it if you hotplug a PCIe
783+
* switch.
784+
*/
785+
static bool eeh_slot_presence_check(struct pci_dev *pdev)
786+
{
787+
const struct hotplug_slot_ops *ops;
788+
struct pci_slot *slot;
789+
u8 state;
790+
int rc;
791+
792+
if (!pdev)
793+
return false;
794+
795+
if (pdev->error_state == pci_channel_io_perm_failure)
796+
return false;
797+
798+
slot = pdev->slot;
799+
if (!slot || !slot->hotplug)
800+
return true;
801+
802+
ops = slot->hotplug->ops;
803+
if (!ops || !ops->get_adapter_status)
804+
return true;
805+
806+
rc = ops->get_adapter_status(slot->hotplug, &state);
807+
if (rc)
808+
return true;
809+
810+
return !!state;
811+
}
812+
772813
/**
773814
* eeh_handle_normal_event - Handle EEH events on a specific PE
774815
* @pe: EEH PE - which should not be used after we return, as it may
@@ -799,6 +840,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
799840
enum pci_ers_result result = PCI_ERS_RESULT_NONE;
800841
struct eeh_rmv_data rmv_data =
801842
{LIST_HEAD_INIT(rmv_data.removed_vf_list), 0};
843+
int devices = 0;
802844

803845
bus = eeh_pe_bus_get(pe);
804846
if (!bus) {
@@ -807,6 +849,23 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
807849
return;
808850
}
809851

852+
/*
853+
* When devices are hot-removed we might get an EEH due to
854+
* a driver attempting to touch the MMIO space of a removed
855+
* device. In this case we don't have a device to recover
856+
* so suppress the event if we can't find any present devices.
857+
*
858+
* The hotplug driver should take care of tearing down the
859+
* device itself.
860+
*/
861+
eeh_for_each_pe(pe, tmp_pe)
862+
eeh_pe_for_each_dev(tmp_pe, edev, tmp)
863+
if (eeh_slot_presence_check(edev->pdev))
864+
devices++;
865+
866+
if (!devices)
867+
goto out; /* nothing to recover */
868+
810869
eeh_pe_update_time_stamp(pe);
811870
pe->freeze_count++;
812871
if (pe->freeze_count > eeh_max_freezes) {
@@ -997,6 +1056,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
9971056
}
9981057
}
9991058

1059+
out:
10001060
/*
10011061
* Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING
10021062
* we don't want to modify the PE tree structure so we do it here.

0 commit comments

Comments
 (0)