Skip to content

1507 followups #1966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ViktorTigerstrom
Copy link
Contributor

Closes #1945.

This PR addresses the remaining feedback raised in #1507 by @TheBlueMatt and @valentinewallace. In general this PR refactors a many aspects of #1507 which could have been done in a more suitable way, as well as fixes a memory leak.

Let me know if there are more aspects of #1507 that needs fixing :).

/// limit the negative effects on parallelism as much as possible.
///
/// See `ChannelManager` struct-level documentation for lock order requirements.
pending_peers_awaiting_removal: Mutex<HashSet<PublicKey>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if we're better of just taking the write lock of the per_peer_state after force closing a channel and we're already disconnected with no channels left :).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do either - you call self.remove_peers_awaiting_removal in timer_tick_occurred immediately after we iterate over the peer_peer_state set - instead of storing the set of nodes to remove explicitly in the ChannelManager, we can just calculate it on the fly in that same iteration. Then we can do the removal with a calculated set.


nodes[0].node.handle_accept_channel(&unkown_public_key, &accept_channel_msg);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I'd like to keep these msgs handling function calls with an unknown peer. Looking into if there's a good way of expecting a panic with a given message here, without having to do a separate test for every msg handling function call.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hmm, actually yea those panics are going to be annoying. Let's wrap them in in #[cfg(all(not(test), not(fuzzing)))], as the fuzzer will also get mad about it.

@ViktorTigerstrom
Copy link
Contributor Author

Ah snap, will fix the CI check failures.

@TheBlueMatt TheBlueMatt added this to the 0.0.114 milestone Jan 19, 2023
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-01-store-channels-per-peer-followups branch from fa84f75 to 834ca24 Compare January 20, 2023 00:24
@ViktorTigerstrom
Copy link
Contributor Author

Rebased on main to fix merge conflict + CI error fix.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM, one real comment but feel free to squash.

@@ -2233,7 +2228,8 @@ where
/// public, and thus should be called whenever the result is going to be passed out in a
/// [`MessageSendEvent::BroadcastChannelUpdate`] event.
///
/// May be called with peer_state already locked!
/// Note that this function may be called without the `peer_state` corresponding to the passed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how that's possible if we have a reference to the Channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In internal_closing_signed this is called after the channel has been removed out of the peer_state and the lock has been drop, ie. call site has full ownership of the channel.
Though I agree that this comment is probably just confusing and not very helpful, so I'll just drop it all together 😁!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, I'd be happy to see the comment remain, too, just with an explanation of what it means :)

/// limit the negative effects on parallelism as much as possible.
///
/// See `ChannelManager` struct-level documentation for lock order requirements.
pending_peers_awaiting_removal: Mutex<HashSet<PublicKey>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do either - you call self.remove_peers_awaiting_removal in timer_tick_occurred immediately after we iterate over the peer_peer_state set - instead of storing the set of nodes to remove explicitly in the ChannelManager, we can just calculate it on the fly in that same iteration. Then we can do the removal with a calculated set.


nodes[0].node.handle_accept_channel(&unkown_public_key, &accept_channel_msg);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hmm, actually yea those panics are going to be annoying. Let's wrap them in in #[cfg(all(not(test), not(fuzzing)))], as the fuzzer will also get mad about it.

@valentinewallace
Copy link
Contributor

Looks good after Matt's feedback!

@TheBlueMatt
Copy link
Collaborator

Any update here @ViktorTigerstrom

@ViktorTigerstrom
Copy link
Contributor Author

Thanks for the reviews @TheBlueMatt & @valentinewallace! I've had a really hectic week, but will address the review as soon as I can later tonight or tomorrow. Sorry for the delay.

@TheBlueMatt
Copy link
Collaborator

No huge rush, just making sure it didn't slip off your radar :)

@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2023

Codecov Report

Base: 90.89% // Head: 87.24% // Decreases project coverage by -3.65% ⚠️

Coverage data is based on head (2f21d56) compared to base (760ab65).
Patch coverage: 92.85% of modified lines in pull request are covered.

❗ Current head 2f21d56 differs from pull request most recent head 7f6f90d. Consider uploading reports for the commit 7f6f90d to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1966      +/-   ##
==========================================
- Coverage   90.89%   87.24%   -3.65%     
==========================================
  Files          99      100       +1     
  Lines       53026    44110    -8916     
  Branches    53026    44110    -8916     
==========================================
- Hits        48197    38484    -9713     
- Misses       4829     5626     +797     
Impacted Files Coverage Δ
lightning/src/ln/channelmanager.rs 85.85% <91.78%> (-1.41%) ⬇️
lightning/src/ln/chanmon_update_fail_tests.rs 97.31% <100.00%> (-0.29%) ⬇️
lightning/src/ln/functional_test_utils.rs 88.53% <100.00%> (-2.95%) ⬇️
lightning/src/ln/functional_tests.rs 97.20% <100.00%> (+0.13%) ⬆️
lightning/src/ln/payment_tests.rs 96.51% <100.00%> (-0.74%) ⬇️
lightning/src/ln/reload_tests.rs 93.35% <100.00%> (-2.22%) ⬇️
lightning/src/ln/reorg_tests.rs 96.35% <100.00%> (-3.65%) ⬇️
lightning/src/chain/chaininterface.rs 41.17% <0.00%> (-54.48%) ⬇️
lightning/src/offers/payer.rs 50.00% <0.00%> (-50.00%) ⬇️
lightning/src/util/errors.rs 32.14% <0.00%> (-49.34%) ⬇️
... and 98 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ViktorTigerstrom
Copy link
Contributor Author

ViktorTigerstrom commented Feb 2, 2023

Updated the PR with the latest feedback and pushed squashed the previous fixups.

A general question about removing the disconnected peers with no channels through timer_tick_occurred though.
If we have a connected peer with no channels, and we shutdown the node and then restart it:
If timer_tick_occurred is called on startup before we connect to the node again, timer_tick_occurred will drop the node as we initialize the peer_state as disconnected in ChannelManager::read. Is that an issue?

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question about removing the disconnected peers with no channels through timer_tick_occurred though.
If we have a connected peer with no channels, and we shutdown the node and then restart it:
If timer_tick_occurred is called on startup before we connect to the node again, timer_tick_occurred will drop the node as we initialize the peer_state as disconnected in ChannelManager::read. Is that an issue?

No, there's no reason to keep those peers. In fact, there's no reason to write those peers out to the manager to begin with - the manager should be written as if all peers had disconnected immediately prior to the write. That doesn't have to happen now, but we should consider doing it. If you do do it here, let's go ahead and make the "does this peer merit disconnect" criteria a utility function, as it will get further changes in #1897.

@TheBlueMatt
Copy link
Collaborator

Needs rebase, sadly.

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-01-store-channels-per-peer-followups branch from 07078a5 to 05eca8a Compare February 4, 2023 18:47
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-01-store-channels-per-peer-followups branch from 05eca8a to 7cd5798 Compare February 4, 2023 18:53
@ViktorTigerstrom
Copy link
Contributor Author

Addressed feedback, rebased on main and cherry-picked df8c1cf

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to squash as you go, I think.

)
).unwrap_or(None);

if let Some(hash_map::Entry::Occupied(mut chan)) = peer_state_opt.as_mut().map(|peer_state| peer_state.channel_by_id.entry(chan_id))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this doesn't need enter, get should suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to not change this to get and keep it as an entry as we're calling chan.remove_entry(); when drop is set to true further down below. If it's preferred though, I'll try refactoring this function a bit to use get instead. Let me know if you'd like me to do so :)!

@@ -5683,7 +5768,8 @@ where
/// Note that that while `MessageSendEvent`s are strictly ordered per-peer, the peer order for
/// the chunks of `MessageSendEvent`s for different peers is random. I.e. if the array contains
/// `MessageSendEvent`s for both `node_a` and `node_b`, the `MessageSendEvent`s for `node_a`
/// will randomly be placed first or last in the returned array.
/// will randomly be placed first or last in the returned array. Note that order will not be
/// randomized with the `no-std` feature enabled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this is...kinda true? We end up using ahash's compile-time randomness, I think, which is still random, just not randomized across calls? I'm not sure we need to bother noting this.

Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see your point. Dropped the comment!
The reason I included this in the first place though, is that I'm a bit worried that people will write tests that validates MessageSendEvent in specific orders without using the remove_first_msg_event_to_node function, and then be really confused why the tests passes with the no-std feature set, but then randomly fail with the std feature set. Maybe this isn't really the correct place to address that worry though.

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-01-store-channels-per-peer-followups branch from 7cd5798 to 3067120 Compare February 8, 2023 00:52
@TheBlueMatt
Copy link
Collaborator

Please squash the fixup commit.

.ok_or_else(|| {
#[cfg(all(not(test), not(fuzzing)))]
{
debug_assert!(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we introducing this debug_assert?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its implicitly testing the PeerManager correctly serializes connect/message-handle/disconnect events.

let mut peer_state_lock = peer_state_mutex_opt.unwrap().lock().unwrap();
let peer_state_mutex = per_peer_state.get(counterparty_node_id)
.ok_or_else(|| {
#[cfg(all(not(test), not(fuzzing)))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly though let's remove the cfg flag on it and just unconditionally debug_assertfalse. We can just remove the test_api_calls_with_unkown_counterparty_node test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed with the latest push! I kept the API functions tests in the test_api_calls_with_unkown_counterparty_node, but let me know if you'd prefer me to drop the entire test :)!

@@ -926,7 +926,8 @@ fn do_test_monitor_update_fail_raa(test_ignore_second_cs: bool) {
expect_pending_htlcs_forwardable_and_htlc_handling_failed!(nodes[1], vec![HTLCDestination::NextHopChannel { node_id: Some(nodes[2].node.get_our_node_id()), channel_id: chan_2.2 }]);
check_added_monitors!(nodes[1], 1);

let mut events_3 = nodes[1].node.get_and_clear_pending_msg_events();
let mut events_3_vec = nodes[1].node.get_and_clear_pending_msg_events();
let events_3 = &mut events_3_vec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than making a mutable reference here and using two variables, could you explicitly insert the &mut prefix in the individual call sites?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately our rust 1.41.1 and 1.45.2 considers that as also taking a mutable reference borrow of the node itself and not only the list returned by get_and_clear_pending_msg_events and therefore fails the build. I found that super confusing as well 😅. That was the CI failure that I got earlier in this PR. #1966 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, that is quite curious. But if that's how it is, I guess it'll have to be that way. In which case, could you add a comment and use slightly more descriptive variable names? Perhaps the first can be events, and the second one mutable_events_reference or something like that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, the following patch builds just fine for me on 1.41.1:

diff --git a/lightning/src/ln/chanmon_update_fail_tests.rs b/lightning/src/ln/chanmon_update_fail_tests.rs
index 1f7643fd6..e1ea256a6 100644
--- a/lightning/src/ln/chanmon_update_fail_tests.rs
+++ b/lightning/src/ln/chanmon_update_fail_tests.rs
@@ -926,8 +926,7 @@ fn do_test_monitor_update_fail_raa(test_ignore_second_cs: bool) {
        expect_pending_htlcs_forwardable_and_htlc_handling_failed!(nodes[1], vec![HTLCDestination::NextHopChannel { node_id: Some(nodes[2].node.get_our_node_id()), channel_id: chan_2.2 }]);
        check_added_monitors!(nodes[1], 1);
 
-       let mut events_3_vec = nodes[1].node.get_and_clear_pending_msg_events();
-       let events_3 = &mut events_3_vec;
+       let mut events_3 = nodes[1].node.get_and_clear_pending_msg_events();
        if test_ignore_second_cs {
                assert_eq!(events_3.len(), 3);
        } else {
@@ -936,7 +935,7 @@ fn do_test_monitor_update_fail_raa(test_ignore_second_cs: bool) {
 
        // Note that the ordering of the events for different nodes is non-prescriptive, though the
        // ordering of the two events that both go to nodes[2] have to stay in the same order.
-       let nodes_0_event = remove_first_msg_event_to_node(&nodes[0].node.get_our_node_id(), events_3);
+       let nodes_0_event = remove_first_msg_event_to_node(&nodes[0].node.get_our_node_id(), &mut events_3);
        let messages_a = match nodes_0_event {
                MessageSendEvent::UpdateHTLCs { node_id, mut updates } => {
                        assert_eq!(node_id, nodes[0].node.get_our_node_id());
@@ -950,12 +949,12 @@ fn do_test_monitor_update_fail_raa(test_ignore_second_cs: bool) {
                _ => panic!("Unexpected event type!"),
        };
 
-       let nodes_2_event = remove_first_msg_event_to_node(&nodes[2].node.get_our_node_id(), events_3);
+       let nodes_2_event = remove_first_msg_event_to_node(&nodes[2].node.get_our_node_id(), &mut events_3);
        let send_event_b = SendEvent::from_event(nodes_2_event);
        assert_eq!(send_event_b.node_id, nodes[2].node.get_our_node_id());
 
        let raa = if test_ignore_second_cs {
-               let nodes_2_event = remove_first_msg_event_to_node(&nodes[2].node.get_our_node_id(), events_3);
+               let nodes_2_event = remove_first_msg_event_to_node(&nodes[2].node.get_our_node_id(), &mut events_3);
                match nodes_2_event {
                        MessageSendEvent::SendRevokeAndACK { node_id, msg } => {
                                assert_eq!(node_id, nodes[2].node.get_our_node_id());

Copy link
Contributor Author

@ViktorTigerstrom ViktorTigerstrom Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that should work thanks! I was doing it the other way around which caused the issue for 1.41.1 and 1.45.2, but which worked for later builds, ie:

let events_3 = &mut nodes[1].node.get_and_clear_pending_msg_events();
let nodes_0_event = remove_first_msg_event_to_node(&nodes[0].node.get_our_node_id(), events_3);

My bad, sorry! Updated this with the latest push :)

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-01-store-channels-per-peer-followups branch from 3067120 to e1f1c45 Compare February 13, 2023 09:42
@ViktorTigerstrom
Copy link
Contributor Author

Thanks a lot for the review @ariard & @arik-so :)!
Addressed the latest feedback and squashed all fixups with the latest commit!

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-01-store-channels-per-peer-followups branch from e1f1c45 to 2f21d56 Compare February 14, 2023 00:01
@ViktorTigerstrom
Copy link
Contributor Author

ViktorTigerstrom commented Feb 14, 2023

Addressed the feedback about mutable reference borrows :)! #1966 (comment)

@arik-so
Copy link
Contributor

arik-so commented Feb 14, 2023

Nit: in the commit message, it should be spelled existence and not existance, but ¯\(ツ)

arik-so
arik-so previously approved these changes Feb 14, 2023
Copy link
Contributor

@arik-so arik-so left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks very much for addressing the comments!

Avoid doing the same hashmap lookups twice in a row, when it's not
needed. Refactor `claim_funds_from_hop` in order to enable this change.
Updates multiple instances of the `ChannelManager` docs related to the
previous change that moved the storage of the channels to the
`per_peer_state`. This docs update corrects some grammar errors and
incorrect information, as well as clarifies documentation that was
confusing.
When a peer disconnects but still has channels, the peer's `peer_state`
entry in the `per_peer_state` is not removed by the `peer_disconnected`
function. If the channels with that peer is later closed while still
being disconnected (i.e. force closed), we therefore need to remove the
peer from `peer_state` separately.

To remove the peers separately, we push such peers to a separate HashSet
that holds peers awaiting removal, and remove the peers on a timer to
limit the negative effects on parallelism as much as possible.
@ViktorTigerstrom
Copy link
Contributor Author

Nit: in the commit message, it should be spelled existence and not existance, but ¯_(ツ)_/¯

Thanks! Corrected some spelling/grammar mistakes in the commit messages.

@TheBlueMatt TheBlueMatt merged commit 33720b0 into lightningdevkit:main Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1507 followups
7 participants