Skip to content

Commit 050b19c

Browse files
committed
Reorder the BP loop to make manager persistence more reliable
The main loop of the background processor has this line: `peer_manager.process_events(); // Note that this may block on ChannelManager's locking` which does, indeed, sometimes block waiting on the `ChannelManager` to finish whatever its doing. Specifically, its the only place in the background processor loop that we block waiting on the `ChannelManager`, so if the `ChannelManager` is relatively busy, we may end up being blocked there most of the time. This should be fine, except today we had a user who's node was particularly slow in processing some channel updates, resulting in the background processor being blocked there (as expected). Then, when the channel updates were completed (and persisted) the next thing the background processor did was hand the user events to process, creating yet more channel updates. Ultimately, the users' node crashed before finishing the event processing. This left us with an updated monitor on disk and an outdated manager, and they lost the channel on startup. Here we simply move the above quoted line to after the normal event processing, ensuring the next thing we do after blocking on `ChannelManager` locks is persist the manager, prior to event handling.
1 parent d0f69f7 commit 050b19c

File tree

1 file changed

+13
-1
lines changed
  • lightning-background-processor/src

1 file changed

+13
-1
lines changed

lightning-background-processor/src/lib.rs

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,10 +203,22 @@ impl BackgroundProcessor {
203203
let mut have_pruned = false;
204204

205205
loop {
206-
peer_manager.process_events(); // Note that this may block on ChannelManager's locking
207206
channel_manager.process_pending_events(&event_handler);
208207
chain_monitor.process_pending_events(&event_handler);
209208

209+
// Note that the PeerManager::process_events may block on ChannelManager's locks,
210+
// hence it comes last here. When the ChannelManager finishes whatever it's doing,
211+
// we want to ensure we get into `persist_manager` as quickly as we can, especially
212+
// without running the normal event processing above and handing events to users.
213+
//
214+
// Specifically, on an *extremely* slow machine, we may see ChannelManager start
215+
// processing a message effectively at any point during this loop. In order to
216+
// minimize the time between such processing completing and persisting the updated
217+
// ChannelManager, we want to minimize methods blocking on a ChannelManager
218+
// generally, and as a fallback place such blocking only immediately before
219+
// persistence.
220+
peer_manager.process_events();
221+
210222
// We wait up to 100ms, but track how long it takes to detect being put to sleep,
211223
// see `await_start`'s use below.
212224
let await_start = Instant::now();

0 commit comments

Comments
 (0)