-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connectd: fix accidental handling of old reconnections. #5256
connectd: fix accidental handling of old reconnections. #5256
Conversation
ACK 3fab422 Reviewed w/ @endothermicdev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to merge, just a style nit: the checks in our repo look for |
pr = peer_reconnected_htable_get(&daemon->reconnected, id); | ||
if (pr) { | ||
status_broken("Reconnected AGAIN"); | ||
peer_reconnected_htable_del(&daemon->reconnected, pr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the previous reconnection already had the destructor installed, would it not delete this twice? Or delete from htable
is idempotent?
We had multiple reports of channels being unilaterally closed because it seemed like the peer was sending old revocation numbers. Turns out, it was actually old reestablish messages! When we have a reconnection, we would put the new connection aside, and tell lightningd to close the current connection: when it did, we would restart processing of the initial reconnection. However, we could end up with *multiple* "reconnecting" connections, while waiting for an existing connection to close. Though the connections were long gone, there could still be messages queued (particularly the channel_reestablish message, which comes early on). Eventually, a normal reconnection would cause us to process one of these reconnecting connections, and channeld would see the (perhaps very old!) messages, and get confused. (I have a test which triggers this, but it also hangs the connect command, due to other issues we will fix in the next release...) Fixes: ElementsProject#5240 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I have a test which reproduces this, too, and it's been seen in the wild. It seems we can add a subd as we're closing, which causes this assert to trigger. Fixes: ElementsProject#5254 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
3fab422
to
b817e0a
Compare
Removed accidentally left BROKEN log, and added another assert removal to fix another report. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK b817e0a |
|
### Fixed | ||
|
||
- connectd: make sure we don't keep stale reconnections around. ([#5256]) | ||
- connectd: fix assert which we could trigger. ([#5256]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- connectd: fix assert which we could trigger. ([#5256]) | |
- connectd: fix assert which we could trigger. ([#5254]) |
This should be #5254 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, these refer to the PR, not the bug itself...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah damnt! Sorry just assuming wrong stuff
We had multiple reports of channels being unilaterally closed because
it seemed like the peer was sending old revocation numbers.
Turns out, it was actually old reestablish messages! When we have a
reconnection, we would put the new connection aside, and tell lightningd
to close the current connection: when it did, we would restart
processing of the initial reconnection.
However, we could end up with multiple "reconnecting" connections,
while waiting for an existing connection to close. Though the
connections were long gone, there could still be messages queued
(particularly the channel_reestablish message, which comes early on).
Eventually, a normal reconnection would cause us to process one of
these reconnecting connections, and channeld would see the (perhaps
very old!) messages, and get confused.
(I have a test which triggers this, but it also hangs the connect
command, due to other issues we will fix in the next release...)
Fixes: #5240
Fixes: #5235