Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: start htlcswitch early in the pipeline #6214

Merged
merged 3 commits into from
Feb 24, 2022

Conversation

yyforyongyu
Copy link
Member

After a recent change, a consistent failure has been showing in the test revoked_uncooperative_close_retribution_zero_value_remote_output from the build with backend=bitcoind dbbackend=postgres, lnd node was stuck at starting channel arbitrator. Here's what could happen.

Suppose a node starts with a channel breached with a state StateContractClosed,

  1. chain arbitrator starts -> channel arbitrator starts
  2. when starting channel arbitrator, we advance the channel state
  3. in that advanceState, when we begin with StateContractClosed, it will send msg to htlcswitch via c.cfg.DeliverResolutionMsg
  4. htlcswitch hasn't started yet, causing the above function never returns, and the node will be stuck

This PR fixes it by starting htlcswitch early. By checking the dependencies of htlcswitch.Start(), it makes a few db queries and must be started after Notifier since it needs to subscribe new blocks. There's also a callsite on s.chanRouter via htlcswitch.Start() -> s.ForwardPackets -> s.cfg.FetchLastChannelUpdate -> s.chanRouter.GetChannelByID. However I think this doesn't require s.chanRouter to be started since it's just querying the db.

To end the starting dependecy issue, I think we need to rely on checking states from disk and not from other subservers.

Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure we can also fix #2998 at the same time by handling StateContractClosed explicitly here:

switch c.state {
case StateDefault:
fallthrough
case StateBroadcastCommit:
fallthrough
case StateCommitmentBroadcasted:

edit: can confirm the above suggestion with this pr fixes the resolutionmsg issue. this is because if the proper trigger is set, the HtlcFailNowActions get recalculated (usually they are not for the chainTrigger), so even if lnd goes down during the hand-off, the hand-off will occur again on start-up and the switch can handle duplicates

server.go Outdated Show resolved Hide resolved
@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Feb 1, 2022

My previous comment is a little wrong: this won't fully fix #2998 but my suggestion does make it better. The switch doesn't persist them, so if they are held in a mailbox after the hand-off and lnd crashes, the switch won't get them again. The hand-off could be changed so that they're fully handled before returning but out of scope here.

@yyforyongyu
Copy link
Member Author

Pretty sure we can also fix #2998 at the same time by handling StateContractClosed explicitly here:

You mean we just fallthrough here for StateContractClosed?

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Feb 7, 2022

Pretty sure we can also fix #2998 at the same time by handling StateContractClosed explicitly here:

You mean we just fallthrough here for StateContractClosed?

I think it can be left to another pr that fixes the issue i mentioned

Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: third commit message could be changed, but code is good

server.go Outdated Show resolved Hide resolved
@lightninglabs-deploy
Copy link

@Roasbeef: review reminder

@Roasbeef Roasbeef added this to the v0.15.0 milestone Feb 24, 2022
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👑

@Roasbeef Roasbeef enabled auto-merge February 24, 2022 23:31
@Roasbeef Roasbeef disabled auto-merge February 24, 2022 23:32
@Roasbeef Roasbeef merged commit 10fba3d into lightningnetwork:master Feb 24, 2022
@yyforyongyu yyforyongyu deleted the server-start-order branch February 25, 2022 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants