-
Notifications
You must be signed in to change notification settings - Fork 412
Onion error handling #146 #243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Onion error handling #146 #243
Conversation
3212322
to
593804e
Compare
@@ -1445,6 +1445,12 @@ impl ChannelManager { | |||
if let Some(mut sources) = removed_source { | |||
for htlc_with_hash in sources.drain(..) { | |||
if channel_state.is_none() { channel_state = Some(self.channel_state.lock().unwrap()); } | |||
log_info!(self, "User rejected the incoming HTLC: {}", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe add more infos on the events sequence, "Got PaymentReceived event, ..." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PaymentRecieved event will be logged when it is received. So it will be duplicative here when the user calls this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the wording is a little bit weird, though - the log is for the "user"'s benefit (ie client of the library) and you're referring to them as "user". Maybe just say "Incoming HTLC rejected" since thats not ambiguous.
if let Some(update) = channel_update { | ||
match &onion_error { | ||
&HTLCFailReason::ErrorPacket { ref err } => { | ||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm does using conditional compilation speed-up so much the tests ? Really personal, but don't find that it eases code readability..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needed not to generate additional compile warning message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe #[allow(dead_code)] ? but okay
#[cfg(not(test))] | ||
failure_code:_, | ||
data:_ } => { | ||
// we get a fail_malformed_htlc from the first hop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"first hop" can we be sure it's the first one and not a previous hop there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if my understanding is correct, fail_malformed_htlc is converted to fail_htlc by intermediate nodes. so if the origin node received fail_malformed it must be from the first hop in the route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right "here's a special malformed failure variant for the case where the peer couldn't parse it: in this case the current node instead takes action, encrypting it into a update_fail_htlc for relaying."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also close the channel here, no? If our next-hop can't parse what we're sending it we almost certainly don't have a use for the channel anymore.
src/ln/channelmanager.rs
Outdated
} | ||
|
||
const BADONION: u16 = 0x8000; | ||
const PERM: u16 = 0x4000; | ||
const NODE: u16 = 0x2000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can constify all error codes, already quoted in #49 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah.. if we introduce constant error tables, it may as well to flatten out process_onion's error_code branching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff, protocol requirements seem good, some nits.
Don't review tests (yet)
src/ln/channelmanager.rs
Outdated
// required_node_feature_mssing(PERM|NODE|3) | ||
error_code_ret = Some(error_code); | ||
|
||
// eiter from intermediate or final node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: *either
src/ln/channelmanager.rs
Outdated
is_permanent, | ||
}), !(error_code & PERM == PERM && is_from_final_node))); | ||
} | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why assign a HTLCFailChannelUpdate to res if you return it nowhere ? Shouldn't be return res there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'return' returns from the callback and res is set to what should be returned from the outer function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry effectively yes, forgot to mental compile with callback!
src/ln/channelmanager.rs
Outdated
res = Some((Some(msgs::HTLCFailChannelUpdate::ChannelClosed { | ||
short_channel_id: route_hop.short_channel_id, | ||
is_permanent: true, | ||
}), false)); | ||
}), true)); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, return_node_fail_update ?
src/ln/channelmanager.rs
Outdated
short_channel_id: route.hops[next_route_hop_ix].short_channel_id, | ||
is_permanent: true, | ||
}), true)); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, return_node_fail_update ?
src/ln/channelmanager.rs
Outdated
incoming_htlc_msat > chan_update.contents.htlc_minimum_msat | ||
11 => { | ||
onion_failure_log!("amount_below_minimum", error_code, "htlc_msat", byte_utils::slice_to_be64(&err_packet.failuremsg[2..2+8])); | ||
amt_to_forward > chan_update.contents.htlc_minimum_msat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch to amt_to_forwatd because it's error code for the outgoing link so incoming_htlc_msat wasn't accurate right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
.. from the processing node.
per spec
src/ln/channelmanager.rs
Outdated
chan_update.contents.flags & 0x01 == 0x01 | ||
14 => { // expiry_too_soon | ||
onion_failure_log!("expiry_too_soon", error_code); | ||
// alwayws valid? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*always
src/ln/channelmanager.rs
Outdated
} | ||
} | ||
} | ||
} | ||
} | ||
}).expect("Route that we sent via spontaneously grew invalid keys in the middle of it?"); | ||
res.unwrap_or((None, true)) | ||
} else { ((None, true)) } | ||
//res.unwrap_or((None, true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could go ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, all nit fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is really important stuff to finish up. As mentioned in #219 I hadn't previously really reviewed process_onion_failure so I just started by looking through the full function, not really just the changes here. I didn't get to a full detailed review and just looked at that function.
I left some comments in-line on some changes that I think may be helpful, but at a high-level I find the layout here a bit confusing - the spec speaks in terms of actions to take in response to different bits being set in the top byte, whereas this is structured in terms of actions to take in response to individual error types. Worse, spec doesn't seem to indicate we're allowed to use non-top-byte data for anything other than "debugging purposes"
src/ln/channelmanager.rs
Outdated
|
||
// when a node has sent invalid code or malformed failure msg | ||
macro_rules! return_node_fail_update { | ||
($res:ident, $node_id:expr, $retry:expr) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if you define this macro after the res var is let you can avoid passing in the $res argument.
src/ln/channelmanager.rs
Outdated
node_id: $node_id, | ||
is_permanent: true, | ||
}), $retry)); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This reads a bit confusingly - you're using this inside the callback so it just continues the keys_callback loop, but you define it outside.
src/ln/channelmanager.rs
Outdated
if error_code&0xff == 1 || error_code&0xff == 2 || error_code&0xff == 3 { | ||
match error_code { | ||
_c if _c == PERM|1 => onion_failure_log!("invalid_realm", error_code), | ||
_c if _c == NODE|2 => onion_failure_log!("temporary_node_failure", error_code), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given these are log_info! level, it may be useful to specify a bit more than the spec-given title - something like "node specified a generic catch-all node-level failure".
src/ln/channelmanager.rs
Outdated
_c if _c == NODE|2 => onion_failure_log!("temporary_node_failure", error_code), | ||
_c if _c == PERM|NODE|2 => onion_failure_log!("permanent_node_failure", error_code), | ||
_c if _c == PERM|NODE|3 => onion_failure_log!("required_node_feature_mssing", error_code), | ||
_ => return_node_fail_update!(res, route_hop.pubkey, !is_from_final_node), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given these events only happen in response to an outgoing payment (and only once per payment), I think we should make sure we have a log entry for every case here. May be worth just merging the return_node_fail_update and onion_failure_log macros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further, in this particular case, I don't see why we can't continue on to the later processing? Even if we don't understand some future error code, if they set NODE|PERM we should give up routing to them.
src/ln/channelmanager.rs
Outdated
log_trace!(self, "{}({})", $error_code_textual, $error_code); | ||
}; | ||
( $error_code_textual: expr, $error_code: expr, $reported_name: expr, $reported_value: expr ) => {{ | ||
log_info!(self, "{}({:#x}) {}({})", $error_code_textual, $error_code, $reported_name, $reported_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should bump these from info to warn or so. They indicate a user-generated send_payment failed to complete and will only happen in response to such an event and only once per event, so they clearly meet the DoS criteria.
src/ln/channelmanager.rs
Outdated
// failing incoming channel to the erring node | ||
short_channel_id: route_hop.short_channel_id, | ||
is_permanent, | ||
}), !(error_code & PERM == PERM && is_from_final_node))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can use is_permanent in place of the explicit "error_code & PERM == PERM" here and below, no?
src/ln/channelmanager.rs
Outdated
_c if _c == PERM|15 => { onion_failure_log!("unknwon_payment_hash", error_code); false }, | ||
_c if _c == PERM|16 => { onion_failure_log!("incorrect_payment_amount", error_code); false }, | ||
17 => { onion_failure_log!("final_expiry_too_soon", error_code); true }, | ||
18 if err_packet.failuremsg.len() == 6 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think we should be considering a type incorrect if they send us more bytes than we wanted - updates to the spec are expected to just append bytes where new data needs to be provided.
src/ln/channelmanager.rs
Outdated
is_permanent, | ||
}), !(error_code & PERM == PERM && is_from_final_node))); | ||
} | ||
return; | ||
} | ||
|
||
if is_from_final_node { | ||
let payment_retryable = match error_code { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec seems to indicate we should just set payment_retryable based on PERM bit and use the rest just for logging.
src/ln/channelmanager.rs
Outdated
} | ||
} | ||
} | ||
} | ||
} | ||
}).expect("Route that we sent via spontaneously grew invalid keys in the middle of it?"); | ||
res.unwrap_or((None, true)) | ||
} else { ((None, true)) } | ||
let (channel_update, payment_retryable) = res.expect("should have been set!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a node simply garbles the onion data we'll panic at this expect - we have to just move on since in such a case (presumably retrying the payment, since its likely an intermediate node garbled it) as we have no ability to know what happened.
But the spec says, Also, some error codes(unknwon_payment_hash, final_expiry_too_soon, ...) from the final node does not require a route map update, but others do. These can be only distinguished by the lower byte also. Well, so I ended up (almost) error_code based action branches. If the spec update is imminent, I'll wait and see how to clean this mess-up. |
Well specifically I was referring to the "receiving error codes" subsection, which is the thing you're implementing here https://github.com/lightningnetwork/lightning-rfc/blob/master/04-onion-routing.md#receiving-failure-codes. Indeed, if some other node is misbehaving we can ignore their crap, but, generally, we should follow the section we're doing (the spec is written in subsections applying to the nodes which (a) generate a message and (b) receive a message, and technically you're just supposed to implement the receiving part on the receiving end without considering the other part. Now that's also a very antiquated way to consider specs and the whole "be liberal in what you accept, conservative in what you send" thing has been shown to be a generally terrible idea more times than you can count, but in this case trying to parse things based on flags instead of values, where possible, does mean a bit more future compatibility). I pointed to a few specific cases in my comments, above, though, indeed, this doesn't apply universally, as you point out. |
I'm trying to simplify the logic to more closely mirror "receiving error codes" subsection. In the spec, when an error is from the final node, there is no mentioning about updating network map. But, these four, can be from a final node. When PERM bit is set ( all above except temporary_...), I think the final node should be removed from the routing DB as well as failing payment permanently. Shouldn't this be clarified in the spec? Or I am missing something and it's correct that the spec should have no opinion about the above errors from the final node. |
- blockify error log macros - return node failure macros - make error_code branching tighter - fill more error logs - log onion hash with BADONION bit - fix bug reporting wrong short_channel_id for ChannleClosed - handle amount_below_minimum return error code for debugging/testing
fix htlc_fail_async_shutdown test making expect PaymentFailureNetworkUpdate
- user friendly warning message - helper for debug field reported by the erring node
More closely mirroring the spec where the process is based on mainly the top byte of the error code.
989058b
to
07c0345
Compare
Cleaned up error handling more based on the top byte of the error code with commented issues addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't review the tests in a ton of detail, but did test them a little bit and they look awesome, well done there. I find this version way more readable, just a few small easy fixes here and there and I think this is good to go.
|
||
let (description, title) = get_onion_error_description(error_code); | ||
if debug_field_size > 0 && err_packet.failuremsg.len() >= 2 + debug_field_size { | ||
log_warn!(self, "Onion Error[{}({:#x}) {}({})] {}", title, error_code, debug_field, DebugBytes(&err_packet.failuremsg[2..2+debug_field_size]), description); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not just use the macro instead of importing DebugBytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, isnt it failuremsg[4.. instead of 2?
@@ -1445,6 +1445,12 @@ impl ChannelManager { | |||
if let Some(mut sources) = removed_source { | |||
for htlc_with_hash in sources.drain(..) { | |||
if channel_state.is_none() { channel_state = Some(self.channel_state.lock().unwrap()); } | |||
log_info!(self, "User rejected the incoming HTLC: {}", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the wording is a little bit weird, though - the log is for the "user"'s benefit (ie client of the library) and you're referring to them as "user". Maybe just say "Incoming HTLC rejected" since thats not ambiguous.
#[cfg(not(test))] | ||
failure_code:_, | ||
data:_ } => { | ||
// we get a fail_malformed_htlc from the first hop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also close the channel here, no? If our next-hop can't parse what we're sending it we almost certainly don't have a use for the channel anymore.
failure_code:_, | ||
data:_ } => { | ||
// we get a fail_malformed_htlc from the first hop | ||
// TODO: not bother trying parsing onion for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this TODO could be clarified more - something like "we should check the failure_code and check that it isnt something that indicates temporary payment failure before we give up on the channel entirely".
8|9|10 => route.hops[next_route_hop_ix].short_channel_id, // outgoing (permanent_channel_failure/required_channel_feature_missing/unknown_next_peer) | ||
_ => route_hop.short_channel_id, // incoming channel (invalid_realm & BADONION) | ||
}, | ||
is_permanent: error_code & PERM == PERM, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is always true.
})}; | ||
} | ||
else if error_code & UPDATE == UPDATE { | ||
if err_packet.failuremsg.len() >= debug_field_size + 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldnt this be >= +4?
} | ||
} | ||
|
||
res = Some((fail_channel_update, !(error_code & PERM == PERM && is_from_final_node))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get a None on fail_channel_update and !payment_failed (ie the data provided by the err'ing node was unreadable by us), shouldn't we default to failing something? eg if we get an UPDATE flag but couldn't parse the update, if is_chan_update_invalid, etc we should probably fail the channel, or if we get no readable flags we can fail the node wholesale.
11 => amt_to_forward > chan_update.contents.htlc_minimum_msat, | ||
12 => { | ||
let new_fee = amt_to_forward.checked_mul(chan_update.contents.fee_proportional_millionths as u64).and_then(|prop_fee| { (prop_fee / 1000000).checked_add(chan_update.contents.fee_base_msat as u64) }); | ||
new_fee.is_none() || incoming_htlc_msat >= new_fee.unwrap() && incoming_htlc_msat >= amt_to_forward + new_fee.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think new_fee.is_none() is a valid update-causes failure, no? It indicates the node is setting their feerate to some mega-high value causing an overflow, which probably means they're temporarily disabling the channel by setting a high fee (or at least disabling high-value forwards).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why the second condition (incoming_htlc_msat >= new_fee.unwrap()), why not just amt_to_forward.checked_add(new_fee.unwrap()) and compare?
run_onion_failure_test_with_fail_intercept(_name, test_case, nodes, route, payment_hash, callback_msg, |_|{}, callback_node, expected_retryable, expected_error_code, expected_channel_update); | ||
} | ||
|
||
// test_case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to get a bit more of a description of exactly what this does and what the callbacks mean as its not super trivial to read.
Will take as #274. |
Add test for onion error handling and some fixes