You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Vladimir Oltean says:
====================
Fix rtnl_mutex deadlock with DPAA2 and SFP modules
This patch set deliberately targets net-next and lacks Fixes: tags due
to caution on my part.
While testing some SFP modules on the Solidrun Honeycomb LX2K platform,
I noticed that rebooting causes a deadlock:
============================================
WARNING: possible recursive locking detected
6.1.0-rc5-07010-ga9b9500ffaac-dirty #656 Not tainted
--------------------------------------------
systemd-shutdow/1 is trying to acquire lock:
ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30
but task is already holding lock:
ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(rtnl_mutex);
lock(rtnl_mutex);
*** DEADLOCK ***
May be due to missing lock nesting notation
6 locks held by systemd-shutdow/1:
#0: ffffa62db6863c70 (system_transition_mutex){+.+.}-{4:4}, at: __do_sys_reboot+0xd4/0x260
#1: ffff2f2b0176f100 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xf4/0x260
#2: ffff2f2b017be900 (&dev->mutex){....}-{4:4}, at: device_shutdown+0x104/0x260
#3: ffff2f2b017680f0 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x40/0x260
#4: ffff2f2b0e1608f0 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x40/0x260
#5: ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30
stack backtrace:
CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 6.1.0-rc5-07010-ga9b9500ffaac-dirty #656
Hardware name: SolidRun LX2160A Honeycomb (DT)
Call trace:
lock_acquire+0x68/0x84
__mutex_lock+0x98/0x460
mutex_lock_nested+0x2c/0x40
rtnl_lock+0x1c/0x30
sfp_bus_del_upstream+0x1c/0xac
phylink_destroy+0x1c/0x50
dpaa2_mac_disconnect+0x28/0x70
dpaa2_eth_remove+0x1dc/0x1f0
fsl_mc_driver_remove+0x24/0x60
device_remove+0x70/0x80
device_release_driver_internal+0x1f0/0x260
device_links_unbind_consumers+0xe0/0x110
device_release_driver_internal+0x138/0x260
device_release_driver+0x18/0x24
bus_remove_device+0x12c/0x13c
device_del+0x16c/0x424
fsl_mc_device_remove+0x28/0x40
__fsl_mc_device_remove+0x10/0x20
device_for_each_child+0x5c/0xac
dprc_remove+0x94/0xb4
fsl_mc_driver_remove+0x24/0x60
device_remove+0x70/0x80
device_release_driver_internal+0x1f0/0x260
device_release_driver+0x18/0x24
bus_remove_device+0x12c/0x13c
device_del+0x16c/0x424
fsl_mc_bus_remove+0x8c/0x10c
fsl_mc_bus_shutdown+0x10/0x20
platform_shutdown+0x24/0x3c
device_shutdown+0x15c/0x260
kernel_restart+0x40/0xa4
__do_sys_reboot+0x1e4/0x260
__arm64_sys_reboot+0x24/0x30
But fixing this appears to be not so simple. The patch set represents my
attempt to address it.
In short, the problem is that dpaa2_mac_connect() and dpaa2_mac_disconnect()
call 2 phylink functions in a row, one takes rtnl_lock() itself -
phylink_create(), and one which requires rtnl_lock() to be held by the
caller - phylink_fwnode_phy_connect(). The existing approach in the
drivers is too simple. We take rtnl_lock() when calling dpaa2_mac_connect(),
which is what results in the deadlock.
Fixing just that creates another problem. The drivers make use of
rtnl_lock() for serializing with other code paths too. I think I've
found all those code paths, and established other mechanisms for
serializing with them.
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
0 commit comments