-
Notifications
You must be signed in to change notification settings - Fork 401
1507 followups #1966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1507 followups #1966
Conversation
lightning/src/ln/channelmanager.rs
Outdated
/// limit the negative effects on parallelism as much as possible. | ||
/// | ||
/// See `ChannelManager` struct-level documentation for lock order requirements. | ||
pending_peers_awaiting_removal: Mutex<HashSet<PublicKey>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if we're better of just taking the write lock of the per_peer_state
after force closing a channel and we're already disconnected with no channels left :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do either - you call self.remove_peers_awaiting_removal
in timer_tick_occurred
immediately after we iterate over the peer_peer_state
set - instead of storing the set of nodes to remove explicitly in the ChannelManager
, we can just calculate it on the fly in that same iteration. Then we can do the removal with a calculated set.
|
||
nodes[0].node.handle_accept_channel(&unkown_public_key, &accept_channel_msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I'd like to keep these msgs handling function calls with an unknown peer. Looking into if there's a good way of expecting a panic with a given message here, without having to do a separate test
for every msg handling function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, hmm, actually yea those panics are going to be annoying. Let's wrap them in in #[cfg(all(not(test), not(fuzzing)))]
, as the fuzzer will also get mad about it.
Ah snap, will fix the CI check failures. |
fa84f75
to
834ca24
Compare
Rebased on main to fix merge conflict + CI error fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM, one real comment but feel free to squash.
lightning/src/ln/channelmanager.rs
Outdated
@@ -2233,7 +2228,8 @@ where | |||
/// public, and thus should be called whenever the result is going to be passed out in a | |||
/// [`MessageSendEvent::BroadcastChannelUpdate`] event. | |||
/// | |||
/// May be called with peer_state already locked! | |||
/// Note that this function may be called without the `peer_state` corresponding to the passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how that's possible if we have a reference to the Channel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In internal_closing_signed
this is called after the channel has been removed out of the peer_state
and the lock has been drop, ie. call site has full ownership of the channel.
Though I agree that this comment is probably just confusing and not very helpful, so I'll just drop it all together 😁!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay, I'd be happy to see the comment remain, too, just with an explanation of what it means :)
lightning/src/ln/channelmanager.rs
Outdated
/// limit the negative effects on parallelism as much as possible. | ||
/// | ||
/// See `ChannelManager` struct-level documentation for lock order requirements. | ||
pending_peers_awaiting_removal: Mutex<HashSet<PublicKey>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do either - you call self.remove_peers_awaiting_removal
in timer_tick_occurred
immediately after we iterate over the peer_peer_state
set - instead of storing the set of nodes to remove explicitly in the ChannelManager
, we can just calculate it on the fly in that same iteration. Then we can do the removal with a calculated set.
|
||
nodes[0].node.handle_accept_channel(&unkown_public_key, &accept_channel_msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, hmm, actually yea those panics are going to be annoying. Let's wrap them in in #[cfg(all(not(test), not(fuzzing)))]
, as the fuzzer will also get mad about it.
Looks good after Matt's feedback! |
Any update here @ViktorTigerstrom |
Thanks for the reviews @TheBlueMatt & @valentinewallace! I've had a really hectic week, but will address the review as soon as I can later tonight or tomorrow. Sorry for the delay. |
No huge rush, just making sure it didn't slip off your radar :) |
834ca24
to
07078a5
Compare
Codecov ReportBase: 90.89% // Head: 87.24% // Decreases project coverage by
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## main #1966 +/- ##
==========================================
- Coverage 90.89% 87.24% -3.65%
==========================================
Files 99 100 +1
Lines 53026 44110 -8916
Branches 53026 44110 -8916
==========================================
- Hits 48197 38484 -9713
- Misses 4829 5626 +797
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Updated the PR with the latest feedback and pushed squashed the previous fixups. A general question about removing the disconnected peers with no channels through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general question about removing the disconnected peers with no channels through timer_tick_occurred though.
If we have a connected peer with no channels, and we shutdown the node and then restart it:
If timer_tick_occurred is called on startup before we connect to the node again, timer_tick_occurred will drop the node as we initialize the peer_state as disconnected in ChannelManager::read. Is that an issue?
No, there's no reason to keep those peers. In fact, there's no reason to write those peers out to the manager to begin with - the manager should be written as if all peers had disconnected immediately prior to the write. That doesn't have to happen now, but we should consider doing it. If you do do it here, let's go ahead and make the "does this peer merit disconnect" criteria a utility function, as it will get further changes in #1897.
Needs rebase, sadly. |
07078a5
to
05eca8a
Compare
05eca8a
to
7cd5798
Compare
Addressed feedback, rebased on main and cherry-picked df8c1cf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to squash as you go, I think.
) | ||
).unwrap_or(None); | ||
|
||
if let Some(hash_map::Entry::Occupied(mut chan)) = peer_state_opt.as_mut().map(|peer_state| peer_state.channel_by_id.entry(chan_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this doesn't need enter
, get
should suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to not change this to get
and keep it as an entry as we're calling chan.remove_entry();
when drop
is set to true further down below. If it's preferred though, I'll try refactoring this function a bit to use get
instead. Let me know if you'd like me to do so :)!
lightning/src/ln/channelmanager.rs
Outdated
@@ -5683,7 +5768,8 @@ where | |||
/// Note that that while `MessageSendEvent`s are strictly ordered per-peer, the peer order for | |||
/// the chunks of `MessageSendEvent`s for different peers is random. I.e. if the array contains | |||
/// `MessageSendEvent`s for both `node_a` and `node_b`, the `MessageSendEvent`s for `node_a` | |||
/// will randomly be placed first or last in the returned array. | |||
/// will randomly be placed first or last in the returned array. Note that order will not be | |||
/// randomized with the `no-std` feature enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean this is...kinda true? We end up using ahash
's compile-time randomness, I think, which is still random, just not randomized across calls? I'm not sure we need to bother noting this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see your point. Dropped the comment!
The reason I included this in the first place though, is that I'm a bit worried that people will write tests that validates MessageSendEvent
in specific orders without using the remove_first_msg_event_to_node
function, and then be really confused why the tests passes with the no-std
feature set, but then randomly fail with the std
feature set. Maybe this isn't really the correct place to address that worry though.
7cd5798
to
3067120
Compare
Please squash the fixup commit. |
lightning/src/ln/channelmanager.rs
Outdated
.ok_or_else(|| { | ||
#[cfg(all(not(test), not(fuzzing)))] | ||
{ | ||
debug_assert!(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we introducing this debug_assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its implicitly testing the PeerManager
correctly serializes connect/message-handle/disconnect events.
lightning/src/ln/channelmanager.rs
Outdated
let mut peer_state_lock = peer_state_mutex_opt.unwrap().lock().unwrap(); | ||
let peer_state_mutex = per_peer_state.get(counterparty_node_id) | ||
.ok_or_else(|| { | ||
#[cfg(all(not(test), not(fuzzing)))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly though let's remove the cfg flag on it and just unconditionally debug_assertfalse. We can just remove the test_api_calls_with_unkown_counterparty_node
test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed with the latest push! I kept the API functions tests in the test_api_calls_with_unkown_counterparty_node
, but let me know if you'd prefer me to drop the entire test :)!
@@ -926,7 +926,8 @@ fn do_test_monitor_update_fail_raa(test_ignore_second_cs: bool) { | |||
expect_pending_htlcs_forwardable_and_htlc_handling_failed!(nodes[1], vec![HTLCDestination::NextHopChannel { node_id: Some(nodes[2].node.get_our_node_id()), channel_id: chan_2.2 }]); | |||
check_added_monitors!(nodes[1], 1); | |||
|
|||
let mut events_3 = nodes[1].node.get_and_clear_pending_msg_events(); | |||
let mut events_3_vec = nodes[1].node.get_and_clear_pending_msg_events(); | |||
let events_3 = &mut events_3_vec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than making a mutable reference here and using two variables, could you explicitly insert the &mut
prefix in the individual call sites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately our rust 1.41.1 and 1.45.2 considers that as also taking a mutable reference borrow of the node itself and not only the list returned by get_and_clear_pending_msg_events
and therefore fails the build. I found that super confusing as well 😅. That was the CI failure that I got earlier in this PR. #1966 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, that is quite curious. But if that's how it is, I guess it'll have to be that way. In which case, could you add a comment and use slightly more descriptive variable names? Perhaps the first can be events
, and the second one mutable_events_reference
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, the following patch builds just fine for me on 1.41.1:
diff --git a/lightning/src/ln/chanmon_update_fail_tests.rs b/lightning/src/ln/chanmon_update_fail_tests.rs
index 1f7643fd6..e1ea256a6 100644
--- a/lightning/src/ln/chanmon_update_fail_tests.rs
+++ b/lightning/src/ln/chanmon_update_fail_tests.rs
@@ -926,8 +926,7 @@ fn do_test_monitor_update_fail_raa(test_ignore_second_cs: bool) {
expect_pending_htlcs_forwardable_and_htlc_handling_failed!(nodes[1], vec![HTLCDestination::NextHopChannel { node_id: Some(nodes[2].node.get_our_node_id()), channel_id: chan_2.2 }]);
check_added_monitors!(nodes[1], 1);
- let mut events_3_vec = nodes[1].node.get_and_clear_pending_msg_events();
- let events_3 = &mut events_3_vec;
+ let mut events_3 = nodes[1].node.get_and_clear_pending_msg_events();
if test_ignore_second_cs {
assert_eq!(events_3.len(), 3);
} else {
@@ -936,7 +935,7 @@ fn do_test_monitor_update_fail_raa(test_ignore_second_cs: bool) {
// Note that the ordering of the events for different nodes is non-prescriptive, though the
// ordering of the two events that both go to nodes[2] have to stay in the same order.
- let nodes_0_event = remove_first_msg_event_to_node(&nodes[0].node.get_our_node_id(), events_3);
+ let nodes_0_event = remove_first_msg_event_to_node(&nodes[0].node.get_our_node_id(), &mut events_3);
let messages_a = match nodes_0_event {
MessageSendEvent::UpdateHTLCs { node_id, mut updates } => {
assert_eq!(node_id, nodes[0].node.get_our_node_id());
@@ -950,12 +949,12 @@ fn do_test_monitor_update_fail_raa(test_ignore_second_cs: bool) {
_ => panic!("Unexpected event type!"),
};
- let nodes_2_event = remove_first_msg_event_to_node(&nodes[2].node.get_our_node_id(), events_3);
+ let nodes_2_event = remove_first_msg_event_to_node(&nodes[2].node.get_our_node_id(), &mut events_3);
let send_event_b = SendEvent::from_event(nodes_2_event);
assert_eq!(send_event_b.node_id, nodes[2].node.get_our_node_id());
let raa = if test_ignore_second_cs {
- let nodes_2_event = remove_first_msg_event_to_node(&nodes[2].node.get_our_node_id(), events_3);
+ let nodes_2_event = remove_first_msg_event_to_node(&nodes[2].node.get_our_node_id(), &mut events_3);
match nodes_2_event {
MessageSendEvent::SendRevokeAndACK { node_id, msg } => {
assert_eq!(node_id, nodes[2].node.get_our_node_id());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that should work thanks! I was doing it the other way around which caused the issue for 1.41.1 and 1.45.2, but which worked for later builds, ie:
let events_3 = &mut nodes[1].node.get_and_clear_pending_msg_events();
let nodes_0_event = remove_first_msg_event_to_node(&nodes[0].node.get_our_node_id(), events_3);
My bad, sorry! Updated this with the latest push :)
3067120
to
e1f1c45
Compare
e1f1c45
to
2f21d56
Compare
Addressed the feedback about mutable reference borrows :)! #1966 (comment) |
Nit: in the commit message, it should be spelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks very much for addressing the comments!
Avoid doing the same hashmap lookups twice in a row, when it's not needed. Refactor `claim_funds_from_hop` in order to enable this change.
Updates multiple instances of the `ChannelManager` docs related to the previous change that moved the storage of the channels to the `per_peer_state`. This docs update corrects some grammar errors and incorrect information, as well as clarifies documentation that was confusing.
When a peer disconnects but still has channels, the peer's `peer_state` entry in the `per_peer_state` is not removed by the `peer_disconnected` function. If the channels with that peer is later closed while still being disconnected (i.e. force closed), we therefore need to remove the peer from `peer_state` separately. To remove the peers separately, we push such peers to a separate HashSet that holds peers awaiting removal, and remove the peers on a timer to limit the negative effects on parallelism as much as possible.
2f21d56
to
7f6f90d
Compare
Thanks! Corrected some spelling/grammar mistakes in the commit messages. |
Closes #1945.
This PR addresses the remaining feedback raised in #1507 by @TheBlueMatt and @valentinewallace. In general this PR refactors a many aspects of #1507 which could have been done in a more suitable way, as well as fixes a memory leak.
Let me know if there are more aspects of #1507 that needs fixing :).