Skip to content

Commit 0760f99

Browse files
committed
Disallow dual-sync-async persistence without restarting
In general, we don't expect users to persist `ChannelMonitor[Update]`s both synchronously and asynchronously for a single `ChannelManager` instance. If a user has implemented asynchronous persistence, they should generally always use that, as there is then no advantage to them to occasionally persist synchronously. Even still, in 920d96e we fixed some bugs related to such operation, and noted that "there isn't much cost to supporting it". Sadly, this is not true. Specifically, the dual-sync-async persistence flow is ill-defined and difficult to define in away that a user can realistically implement. Consider the case of a `ChannelMonitorUpdate` which is persisted asynchronously and while it is still being persisted a new `ChannelMonitorUpdate` is created. If the second `ChannelMonitorUpdate` is persisted synchronously, the `ChannelManager` will be left with a single pending `ChannelMonitorUpdate` which is not the latest. If we were to then restart, the latest copy of the `ChannelMonitor` would be that without any updates, but the `ChannelManager` has a pending `ChannelMonitorUpdate` for the next update, but not the one after that. The user would then have to handle the replayed `ChannelMonitorUpdate` and then find the second `ChannelMonitorUpdate` on disk and somehow know to replay that one as well. Further, we currently have a bug in handling this scenario as we'll complete all pending post-update actions when the second `ChannelMonitorUpdate` gets persisted synchronously, even though the first `ChannelMonitorUpdate` is still pending. While we could rather trivially fix these issues, addressing the larger API question above is difficult and as we don't anticipate this use-case being important, we just disable it here. Note that we continue to support it internally as some 39 tests rely on it. We do, however, remove test_sync_async_persist_doesnt_hang which was added specifically to test for this use-case, and we now do not support it. Issue highlighted by (changes to the) chanmon_consistency fuzz target (in the next commit).
1 parent 83e9e80 commit 0760f99

File tree

4 files changed

+76
-121
lines changed

4 files changed

+76
-121
lines changed

fuzz/src/chanmon_consistency.rs

Lines changed: 44 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,8 @@ use bitcoin::secp256k1::{self, Message, PublicKey, Scalar, Secp256k1, SecretKey}
8282

8383
use lightning::io::Cursor;
8484
use lightning::util::dyn_signer::DynSigner;
85+
86+
use std::cell::RefCell;
8587
use std::cmp::{self, Ordering};
8688
use std::mem;
8789
use std::sync::atomic;
@@ -674,6 +676,9 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
674676
}};
675677
}
676678

679+
let default_mon_style = RefCell::new(ChannelMonitorUpdateStatus::Completed);
680+
let mon_style = [default_mon_style.clone(), default_mon_style.clone(), default_mon_style];
681+
677682
macro_rules! reload_node {
678683
($ser: expr, $node_id: expr, $old_monitors: expr, $keys_manager: expr, $fee_estimator: expr) => {{
679684
let keys_manager = Arc::clone(&$keys_manager);
@@ -746,6 +751,7 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
746751
Ok(ChannelMonitorUpdateStatus::Completed)
747752
);
748753
}
754+
*chain_monitor.persister.update_ret.lock().unwrap() = *mon_style[$node_id].borrow();
749755
res
750756
}};
751757
}
@@ -1393,28 +1399,22 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
13931399
// bit-twiddling mutations to have similar effects. This is probably overkill, but no
13941400
// harm in doing so.
13951401
0x00 => {
1396-
*monitor_a.persister.update_ret.lock().unwrap() =
1397-
ChannelMonitorUpdateStatus::InProgress
1402+
*mon_style[0].borrow_mut() = ChannelMonitorUpdateStatus::InProgress;
13981403
},
13991404
0x01 => {
1400-
*monitor_b.persister.update_ret.lock().unwrap() =
1401-
ChannelMonitorUpdateStatus::InProgress
1405+
*mon_style[1].borrow_mut() = ChannelMonitorUpdateStatus::InProgress;
14021406
},
14031407
0x02 => {
1404-
*monitor_c.persister.update_ret.lock().unwrap() =
1405-
ChannelMonitorUpdateStatus::InProgress
1408+
*mon_style[2].borrow_mut() = ChannelMonitorUpdateStatus::InProgress;
14061409
},
14071410
0x04 => {
1408-
*monitor_a.persister.update_ret.lock().unwrap() =
1409-
ChannelMonitorUpdateStatus::Completed
1411+
*mon_style[0].borrow_mut() = ChannelMonitorUpdateStatus::Completed;
14101412
},
14111413
0x05 => {
1412-
*monitor_b.persister.update_ret.lock().unwrap() =
1413-
ChannelMonitorUpdateStatus::Completed
1414+
*mon_style[1].borrow_mut() = ChannelMonitorUpdateStatus::Completed;
14141415
},
14151416
0x06 => {
1416-
*monitor_c.persister.update_ret.lock().unwrap() =
1417-
ChannelMonitorUpdateStatus::Completed
1417+
*mon_style[2].borrow_mut() = ChannelMonitorUpdateStatus::Completed;
14181418
},
14191419

14201420
0x08 => complete_all_monitor_updates(&monitor_a, &chan_1_id),
@@ -1722,21 +1722,8 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
17221722
0xff => {
17231723
// Test that no channel is in a stuck state where neither party can send funds even
17241724
// after we resolve all pending events.
1725-
// First make sure there are no pending monitor updates and further update
1726-
// operations complete.
1727-
*monitor_a.persister.update_ret.lock().unwrap() =
1728-
ChannelMonitorUpdateStatus::Completed;
1729-
*monitor_b.persister.update_ret.lock().unwrap() =
1730-
ChannelMonitorUpdateStatus::Completed;
1731-
*monitor_c.persister.update_ret.lock().unwrap() =
1732-
ChannelMonitorUpdateStatus::Completed;
1733-
1734-
complete_all_monitor_updates(&monitor_a, &chan_1_id);
1735-
complete_all_monitor_updates(&monitor_b, &chan_1_id);
1736-
complete_all_monitor_updates(&monitor_b, &chan_2_id);
1737-
complete_all_monitor_updates(&monitor_c, &chan_2_id);
1738-
1739-
// Next, make sure peers are all connected to each other
1725+
1726+
// First, make sure peers are all connected to each other
17401727
if chan_a_disconnected {
17411728
let init_1 = Init {
17421729
features: nodes[1].init_features(),
@@ -1769,42 +1756,65 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
17691756
}
17701757

17711758
macro_rules! process_all_events {
1772-
() => {
1759+
() => { {
1760+
let mut last_pass_no_updates = false;
17731761
for i in 0..std::usize::MAX {
17741762
if i == 100 {
17751763
panic!("It may take may iterations to settle the state, but it should not take forever");
17761764
}
1765+
// Next, make sure no monitor updates are pending
1766+
complete_all_monitor_updates(&monitor_a, &chan_1_id);
1767+
complete_all_monitor_updates(&monitor_b, &chan_1_id);
1768+
complete_all_monitor_updates(&monitor_b, &chan_2_id);
1769+
complete_all_monitor_updates(&monitor_c, &chan_2_id);
17771770
// Then, make sure any current forwards make their way to their destination
17781771
if process_msg_events!(0, false, ProcessMessages::AllMessages) {
1772+
last_pass_no_updates = false;
17791773
continue;
17801774
}
17811775
if process_msg_events!(1, false, ProcessMessages::AllMessages) {
1776+
last_pass_no_updates = false;
17821777
continue;
17831778
}
17841779
if process_msg_events!(2, false, ProcessMessages::AllMessages) {
1780+
last_pass_no_updates = false;
17851781
continue;
17861782
}
17871783
// ...making sure any pending PendingHTLCsForwardable events are handled and
17881784
// payments claimed.
17891785
if process_events!(0, false) {
1786+
last_pass_no_updates = false;
17901787
continue;
17911788
}
17921789
if process_events!(1, false) {
1790+
last_pass_no_updates = false;
17931791
continue;
17941792
}
17951793
if process_events!(2, false) {
1794+
last_pass_no_updates = false;
17961795
continue;
17971796
}
1798-
break;
1797+
if last_pass_no_updates {
1798+
// In some cases, we may generate a message to send in
1799+
// `process_msg_events`, but block sending until
1800+
// `complete_all_monitor_updates` gets called on the next
1801+
// iteration.
1802+
//
1803+
// Thus, we only exit if we manage two iterations with no messages
1804+
// or events to process.
1805+
break;
1806+
}
1807+
last_pass_no_updates = true;
17991808
}
1800-
};
1809+
} };
18011810
}
18021811

1803-
// At this point, we may be pending quiescence, so we'll process all messages to
1804-
// ensure we can complete its handshake. We'll then exit quiescence and process all
1805-
// messages again, to resolve any pending HTLCs (only irrevocably committed ones)
1806-
// before attempting to send more payments.
1812+
// We may be pending quiescence, so first process all messages to ensure we can
1813+
// complete the quiescence handshake.
18071814
process_all_events!();
1815+
1816+
// Then exit quiescence and process all messages again, to resolve any pending
1817+
// HTLCs (only irrevocably committed ones) before attempting to send more payments.
18081818
nodes[0].exit_quiescence(&nodes[1].get_our_node_id(), &chan_a_id).unwrap();
18091819
nodes[1].exit_quiescence(&nodes[0].get_our_node_id(), &chan_a_id).unwrap();
18101820
nodes[1].exit_quiescence(&nodes[2].get_our_node_id(), &chan_b_id).unwrap();

lightning/src/chain/mod.rs

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,12 @@ pub enum ChannelMonitorUpdateStatus {
209209
///
210210
/// This includes performing any `fsync()` calls required to ensure the update is guaranteed to
211211
/// be available on restart even if the application crashes.
212+
///
213+
/// If you return this variant, you cannot later return [`InProgress`] from the same instance of
214+
/// [`Persist`]/[`Watch`] without first restarting.
215+
///
216+
/// [`InProgress`]: ChannelMonitorUpdateStatus::InProgress
217+
/// [`Persist`]: chainmonitor::Persist
212218
Completed,
213219
/// Indicates that the update will happen asynchronously in the background or that a transient
214220
/// failure occurred which is being retried in the background and will eventually complete.
@@ -234,7 +240,12 @@ pub enum ChannelMonitorUpdateStatus {
234240
/// reliable, this feature is considered beta, and a handful of edge-cases remain. Until the
235241
/// remaining cases are fixed, in rare cases, *using this feature may lead to funds loss*.
236242
///
243+
/// If you return this variant, you cannot later return [`Completed`] from the same instance of
244+
/// [`Persist`]/[`Watch`] without first restarting.
245+
///
237246
/// [`InProgress`]: ChannelMonitorUpdateStatus::InProgress
247+
/// [`Completed`]: ChannelMonitorUpdateStatus::Completed
248+
/// [`Persist`]: chainmonitor::Persist
238249
InProgress,
239250
/// Indicates that an update has failed and will not complete at any point in the future.
240251
///

lightning/src/ln/chanmon_update_fail_tests.rs

Lines changed: 0 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -3341,93 +3341,6 @@ fn test_durable_preimages_on_closed_channel() {
33413341
do_test_durable_preimages_on_closed_channel(false, false, false);
33423342
}
33433343

3344-
#[test]
3345-
fn test_sync_async_persist_doesnt_hang() {
3346-
// Previously, we checked if a channel was a candidate for making forward progress based on if
3347-
// the `MonitorEvent::Completed` matched the channel's latest monitor update id. However, this
3348-
// could lead to a rare race when `ChannelMonitor`s were being persisted both synchronously and
3349-
// asynchronously leading to channel hangs.
3350-
//
3351-
// To hit this case, we need to generate a `MonitorEvent::Completed` prior to a new channel
3352-
// update, but which is only processed after the channel update.
3353-
let chanmon_cfgs = create_chanmon_cfgs(2);
3354-
let node_cfgs = create_node_cfgs(2, &chanmon_cfgs);
3355-
let node_chanmgrs = create_node_chanmgrs(2, &node_cfgs, &[None, None]);
3356-
let nodes = create_network(2, &node_cfgs, &node_chanmgrs);
3357-
3358-
let chan_id_ab = create_announced_chan_between_nodes(&nodes, 0, 1).2;
3359-
3360-
// Send two payments from A to B, then claim the first, marking the very last
3361-
// ChannelMonitorUpdate as InProgress...
3362-
let (payment_preimage_1, payment_hash_1, ..) = route_payment(&nodes[0], &[&nodes[1]], 1_000_000);
3363-
let (payment_preimage_2, payment_hash_2, ..) = route_payment(&nodes[0], &[&nodes[1]], 1_000_000);
3364-
3365-
nodes[1].node.claim_funds(payment_preimage_1);
3366-
check_added_monitors(&nodes[1], 1);
3367-
expect_payment_claimed!(nodes[1], payment_hash_1, 1_000_000);
3368-
3369-
let bs_updates = get_htlc_update_msgs(&nodes[1], &nodes[0].node.get_our_node_id());
3370-
nodes[0].node.handle_update_fulfill_htlc(nodes[1].node.get_our_node_id(), &bs_updates.update_fulfill_htlcs[0]);
3371-
expect_payment_sent(&nodes[0], payment_preimage_1, None, false, false);
3372-
nodes[0].node.handle_commitment_signed_batch_test(nodes[1].node.get_our_node_id(), &bs_updates.commitment_signed);
3373-
check_added_monitors(&nodes[0], 1);
3374-
let (as_raa, as_cs) = get_revoke_commit_msgs!(nodes[0], nodes[1].node.get_our_node_id());
3375-
3376-
nodes[1].node.handle_revoke_and_ack(nodes[0].node.get_our_node_id(), &as_raa);
3377-
check_added_monitors(&nodes[1], 1);
3378-
nodes[1].node.handle_commitment_signed_batch_test(nodes[0].node.get_our_node_id(), &as_cs);
3379-
check_added_monitors(&nodes[1], 1);
3380-
3381-
let bs_final_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id());
3382-
chanmon_cfgs[0].persister.set_update_ret(ChannelMonitorUpdateStatus::InProgress);
3383-
nodes[0].node.handle_revoke_and_ack(nodes[1].node.get_our_node_id(), &bs_final_raa);
3384-
check_added_monitors(&nodes[0], 1);
3385-
3386-
// Immediately complete the monitor update, but before the ChannelManager has a chance to see
3387-
// the MonitorEvent::Completed, create a channel update by receiving a claim on the second
3388-
// payment.
3389-
let (_, ab_update_id) = nodes[0].chain_monitor.latest_monitor_update_id.lock().unwrap().get(&chan_id_ab).unwrap().clone();
3390-
nodes[0].chain_monitor.chain_monitor.channel_monitor_updated(chan_id_ab, ab_update_id).unwrap();
3391-
3392-
nodes[1].node.claim_funds(payment_preimage_2);
3393-
check_added_monitors(&nodes[1], 1);
3394-
expect_payment_claimed!(nodes[1], payment_hash_2, 1_000_000);
3395-
3396-
let bs_updates = get_htlc_update_msgs(&nodes[1], &nodes[0].node.get_our_node_id());
3397-
nodes[0].node.handle_update_fulfill_htlc(nodes[1].node.get_our_node_id(), &bs_updates.update_fulfill_htlcs[0]);
3398-
nodes[0].node.handle_commitment_signed_batch_test(nodes[1].node.get_our_node_id(), &bs_updates.commitment_signed);
3399-
check_added_monitors(&nodes[0], 1);
3400-
3401-
// At this point, we have completed an extra `ChannelMonitorUpdate` but the `ChannelManager`
3402-
// hasn't yet seen our `MonitorEvent::Completed`. When we call
3403-
// `get_and_clear_pending_msg_events` here, the `ChannelManager` finally sees that event and
3404-
// should return the channel to normal operation.
3405-
let (as_raa, as_cs) = get_revoke_commit_msgs!(nodes[0], nodes[1].node.get_our_node_id());
3406-
3407-
// Now that we've completed our test, process the events we have queued up (which we were not
3408-
// able to check until now as they would have caused the `ChannelManager` to look at the
3409-
// pending `MonitorEvent`s).
3410-
let pending_events = nodes[0].node.get_and_clear_pending_events();
3411-
assert_eq!(pending_events.len(), 2);
3412-
if let Event::PaymentPathSuccessful { ref payment_hash, ..} = pending_events[1] {
3413-
assert_eq!(payment_hash.unwrap(), payment_hash_1);
3414-
} else { panic!(); }
3415-
if let Event::PaymentSent { ref payment_hash, ..} = pending_events[0] {
3416-
assert_eq!(*payment_hash, payment_hash_2);
3417-
} else { panic!(); }
3418-
3419-
// Finally, complete the claiming of the second payment
3420-
nodes[1].node.handle_revoke_and_ack(nodes[0].node.get_our_node_id(), &as_raa);
3421-
check_added_monitors(&nodes[1], 1);
3422-
nodes[1].node.handle_commitment_signed_batch_test(nodes[0].node.get_our_node_id(), &as_cs);
3423-
check_added_monitors(&nodes[1], 1);
3424-
3425-
let bs_final_raa = get_event_msg!(nodes[1], MessageSendEvent::SendRevokeAndACK, nodes[0].node.get_our_node_id());
3426-
nodes[0].node.handle_revoke_and_ack(nodes[1].node.get_our_node_id(), &bs_final_raa);
3427-
check_added_monitors(&nodes[0], 1);
3428-
expect_payment_path_successful!(nodes[0]);
3429-
}
3430-
34313344
fn do_test_reload_mon_update_completion_actions(close_during_reload: bool) {
34323345
// Test that if a `ChannelMonitorUpdate` completes but a `ChannelManager` isn't serialized
34333346
// before restart we run the monitor update completion action on startup.

lightning/src/ln/channelmanager.rs

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2569,6 +2569,13 @@ where
25692569
#[cfg(any(test, feature = "_test_utils"))]
25702570
pub(super) per_peer_state: FairRwLock<HashMap<PublicKey, Mutex<PeerState<SP>>>>,
25712571

2572+
/// We only support using one of [`ChannelMonitorUpdateStatus::InProgress`] and
2573+
/// [`ChannelMonitorUpdateStatus::Completed`] without restarting. Because the API does not
2574+
/// otherwise directly enforce this, we enforce it in non-test builds here by storing which one
2575+
/// is in use.
2576+
#[cfg(not(any(test, feature = "_externalize_tests")))]
2577+
monitor_update_type: AtomicUsize,
2578+
25722579
/// The set of events which we need to give to the user to handle. In some cases an event may
25732580
/// require some further action after the user handles it (currently only blocking a monitor
25742581
/// update from being handed to the user to ensure the included changes to the channel state
@@ -3312,11 +3319,19 @@ macro_rules! handle_new_monitor_update {
33123319
panic!("{}", err_str);
33133320
},
33143321
ChannelMonitorUpdateStatus::InProgress => {
3322+
#[cfg(not(any(test, feature = "_externalize_tests")))]
3323+
if $self.monitor_update_type.swap(1, Ordering::Relaxed) == 2 {
3324+
panic!("Cannot use both ChannelMonitorUpdateStatus modes InProgress and Completed without restart");
3325+
}
33153326
log_debug!($logger, "ChannelMonitor update for {} in flight, holding messages until the update completes.",
33163327
$channel_id);
33173328
false
33183329
},
33193330
ChannelMonitorUpdateStatus::Completed => {
3331+
#[cfg(not(any(test, feature = "_externalize_tests")))]
3332+
if $self.monitor_update_type.swap(2, Ordering::Relaxed) == 1 {
3333+
panic!("Cannot use both ChannelMonitorUpdateStatus modes InProgress and Completed without restart");
3334+
}
33203335
$completed;
33213336
true
33223337
},
@@ -3577,6 +3592,9 @@ where
35773592

35783593
per_peer_state: FairRwLock::new(new_hash_map()),
35793594

3595+
#[cfg(not(any(test, feature = "_externalize_tests")))]
3596+
monitor_update_type: AtomicUsize::new(0),
3597+
35803598
pending_events: Mutex::new(VecDeque::new()),
35813599
pending_events_processor: AtomicBool::new(false),
35823600
pending_background_events: Mutex::new(Vec::new()),
@@ -14747,6 +14765,9 @@ where
1474714765

1474814766
per_peer_state: FairRwLock::new(per_peer_state),
1474914767

14768+
#[cfg(not(any(test, feature = "_externalize_tests")))]
14769+
monitor_update_type: AtomicUsize::new(0),
14770+
1475014771
pending_events: Mutex::new(pending_events_read),
1475114772
pending_events_processor: AtomicBool::new(false),
1475214773
pending_background_events: Mutex::new(pending_background_events),

0 commit comments

Comments
 (0)