-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reliable payments] router payment state machine #2761
[reliable payments] router payment state machine #2761
Conversation
e83fd93
to
2409e3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dig this approach! With the state machine, I find it much easier to follow than the prior attempts at a solution to this problem. I've completed an initial pass so far, and the main question in my mind is the size of the overlap between this new state machine and the existing control tower in the switch. At one point the control tower addressed a need within the codebase, but it seems like this new state machine can eventually subsume the responsibilities of the control tower.
78a0c81
to
1687cdc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am, as always, worried about the persistence and the work and inflexibility it may cause us in the future. Tried to mainly analyze this part of the PR.
Why don't you merge the switch pr first btw, work bottom up? Or will we have with just this PR already good working resumes of payments?
channeldb/payments.go
Outdated
binary.BigEndian.PutUint64(paymentIDBytes, paymentID) | ||
// CompletePayment overwrites the OutgoingPayment stored in the DB for the | ||
// corresponding payment hash with the completed one. | ||
func (db *DB) CompletePayment(preimage lntypes.Preimage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks unused
ce35b99
to
9d1bd42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely nicer without that intermediate persistent state.
Main comments:
- Code structure in router
- Consolidation of payment related stores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice refactor of the router/switch interactions, this should make the whole payment flow much tighter!
one question i have with new state introduced in the control tower, will we just treat all payments that are currently grounded as started but never attempted? an alternative would be to rename grounded payments as failed, and introduce a new state for initiated. not sure which is better atm
routing/router_test.go
Outdated
return nil | ||
} | ||
|
||
ctx.router.cfg.GetPaymentResult = func(paymentID uint64) ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many of these GetPaymentResult
funcs seem similar, any way we can generate the closures w/ less code duplication?
7180772
to
7aae12a
Compare
fd20644
to
ab0f77c
Compare
migrateOutgoingPayments moves the OutgoingPayments into a new bucket format where they all reside in a top-level bucket indexed by the payment hash. In this sub-bucket we store information relevant to this payment, such as the payment status. To avoid that the router resend payments that have the status InFlight (we cannot resume these payments for pre-migration payments) we delete those statuses, so only Completed payments remain in the new bucket structure.
Since we have performed a migration, the db should be in a consistent state, and we can remove the non-strict option.
This commit gives a new responsibility to the control tower, letting it populate the payment bucket structure as the payment goes through its different stages. The payment will transition states Grounded->InFlight->Success/Failed, where the CreationInfo/AttemptInfo/Preimage must be set accordingly. This will be the main driver for the router state machine.
This encapsulates all state needed to resume a payment from any point of the payment flow, and that must be shared between the different stages of the execution. This is done to prepare for breaking the send loop into smaller parts, and being able to resume the payment from any point from persistent state.
This commit makes the router use the ControlTower to drive the payment life cycle state machine, to keep track of active payments across restarts. This lets the router resume payments on startup, such that their final results can be handled and stored when ready.
On startup the router will fetch the in-flight payments from the control tower, and resume their execution.
TestRouterPaymentStateMachine tests that the router interacts as expected with the ControlTower during a payment lifecycle, such that it payment attempts are not sent twice to the switch, and results are handled after a restart.
And unexport deprecated code.
TestPaymentControlDeleteNonInFlight checks that calling DeletaPayments only deletes payments from the database that are not in-flight.
9f7b1d9
to
7cb25a5
Compare
This PR introduces a persistent state machine to the
ChannelRouter
's payment flow, ensuring we can handle payment results that are received after a restart. It also paves they way for adding cancellation of payments, and resuming payment sessions across restarts.Problem
lnd
currently runs into problems if it is restarted while an HTLC is in flight on the network. The primary reason for this is the way therouter
hands the HTLC to theswitch
. Therouter
persist no information about the payment, so when a result eventually comes back, we risk the information needed to properly populate theOutgoingPayment
in the database is lost.Solution
With the
paymentStateMachine
introduced in this PR, we store two pieces of key information:paymentID
. This is used to query theSwitch
whether the HTLC is still active. If not active we know that we can safely retry the payment attempt. If active therouter
will wait for the result to be available, and store the result to the DB.route
. This is added to the DB together with the preimage when the payment succeeds.Note
The
Switch
does not currently persist the pending payment attempts across restarts. This will be added in a follow-up PR.Replaces #2475
Builds on