-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add quiescence negotiation #2680
Conversation
319e9c4
to
1afa132
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is going in the right direction, it's ready to be cleaned up to only contain the quiescence steps and then we'll be able to do a more thorough review.
Note that you're not currently targeting the master
branch, but rather an obsolete splice branch (splices-bast-pm
), make sure to rebase on master
for the "official" PR.
I think that the way we should handle that first PR without impacting the splice prototype is that whenever CMD_SPLICE
or SpliceInit
is requested, we should look at our local feature bits: if quiescence
is unable, we go through the quiescence steps, otherwise we skip it (and use isIdle
without htlcs).
eclair-core/src/main/scala/fr/acinq/eclair/channel/Commitments.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/ChannelData.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/wire/protocol/LightningMessageCodecs.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/wire/protocol/LightningMessageTypes.scala
Outdated
Show resolved
Hide resolved
I moved the (in-progress) changes to splice to accommodate commitment txs with htlcs to splice-with-htlcs pending this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, this is a very subtle part of the codebase but you managed to figure it out on your own, with most of it working correctly!
I've made a few small changes in this branch, have a look at each commit separately to see if it makes sense and then feel free to cherry-pick them before making additional changes (it makes the overall diff with master
smaller).
There are some test scenarios that I'd like to add that are currently missing:
- remote sends
stfu
but isn't quiescent -> verify we disconnect when we reach thesplice_init
step (test the case where remote is initiator and the non-initiator case) - concurrent
stfu
(we should have a test for the deadlock that exists currently in the implementation - see comment below) - a received HTLC times out while we're waiting for quiescence, and we have the preimage in our pending commands -> verify that we force-close and extract the preimage to publish HTLC-success
- quiescence timeout reached in the middle of the splicing part of the protocol: don't test every sub-state, that would be overkill, but it's worth testing the following two cases:
- timeout while in
SpliceInProgress
- timeout while in
SpliceAborted
when we don't receive our peer'stx_abort
(I'm quite afraid of bugs where peers won't correctly echotx_abort
and the channel will be stuck, but it's now fixed by the quiescence timeout that will force a reconnection)
- timeout while in
eclair-core/src/main/scala/fr/acinq/eclair/channel/fsm/Channel.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/wire/protocol/LightningMessageTypes.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/fsm/Channel.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/fsm/Channel.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/fsm/Channel.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/test/scala/fr/acinq/eclair/channel/states/e/NormalQuiescentStateSpec.scala
Outdated
Show resolved
Hide resolved
Also, this needs a rebase ;) |
The behavior of Is this the desired behavior? It's more consistent with how |
I've added the tests you recommend:
In 1fc1c9c I detect this immediately when
There should not be a deadlock (see comment above), but still a good idea to add a test for concurrent
In d250f39 I created test "htlc timeout during quiescence negotiation with pending preimage" but it did not behave as I expected. There is something going on that I don't understand related to the commitment mechanism. Any idea why the
I added timeout tests for these states in d250f39. Question: should I test for the disconnect in both timeout tests? Is it enough to just check for |
Thanks! It was more complicated then I initially thought (as always). My notes might help someone in the future understand it too, even if they are a bit out of date now. |
- only use quiescence if both nodes signal the feature - reduce timeout to 1 min - use message type 2 for `stfu` - Use feature bits 34/35 for option_quiesce a per spec
It's not a prototype, it matches the official (in-progress) specification.
Use a boolean for the `initiator` field.
Add more traits to the `SpliceStatus` hierarchy, which lets us simplify some of the pattern matching used in the `Channel` actor. Send a `warning` and disconnect if we receive a forbidden message while we're splicing, which lets us gracefully deal with buggy peers without losing channels.
Being quiescent for too long is dangerous, because HTLCs may timeout without giving the opportunity to our peer to send us a preimage. Splicing operations shouldn't take long to complete, so we include that in the quiescence timeout, and disconnect if the splice wasn't completed in time. This makes sure we replay pending commands and avoid getting the channel stuck if the interactive-tx isn't making progress.
The spec says we must not send duplicate `stfu`, but doesn't have any requirement on the receiver. If we receive a duplicate `stfu`, it's perfectly fine to just ignore it: the protocol will either correctly complete or will be canceled by the quiescence timeout.
This function was called in only one place, so it's not avoiding code duplication. It could make sense to isolate the logic, but in the case of event handlers in the highly critical channel FSM, we usually like to inline transition and state changes to make it easier to review the code linearly (no need to jump back and forth between function calls). I also renamed `isIdle` to `isQuiescent` to remove confusion.
In channel event handlers, we try to leave all the logic that updates the current state or data directly in the event handler instead of delegating it to helper functions, otherwise it's a bit hard to see all the state transitions that can happen when receiving a given message. I changed the `handleNewSplice` function to be a pure function instead that just creates the `splice_init` message (or returns an error). I also refactored some nits.
- now returns a warning (and disconnects) instead of force closing
- initiator still remains in the negotiating state until they receive stfu from the non-initiator - also, simplify tests for forbidden messages
- fail htlc tests are sufficient
Rebased and ready for review! |
Yes, I think this is ok, because:
|
eclair-core/src/main/scala/fr/acinq/eclair/channel/ChannelData.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/ChannelData.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/ChannelData.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/channel/fsm/Channel.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me! I still need to review the tests, but otherwise I couldn't find any issue.
eclair-core/src/main/scala/fr/acinq/eclair/channel/fsm/Channel.scala
Outdated
Show resolved
Hide resolved
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## master #2680 +/- ##
==========================================
- Coverage 85.96% 85.91% -0.06%
==========================================
Files 215 215
Lines 17731 17813 +82
Branches 765 794 +29
==========================================
+ Hits 15243 15304 +61
- Misses 2488 2509 +21
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added several commits on top of your PR in https://github.com/ACINQ/eclair/tree/quiescence-bast, can you have a look at them and cherry-pick?
I'm mostly simplifying some tests, adding others, and moving stuff around to keep them grouped by "feature".
Regarding the force-close test, it was behaving correctly: when we detect that we need to force-close, we immediately generate the corresponding transactions and broadcast them. At that point, we only look at our local data, so we don't notice that we have the preimage. Then when transitioning to the CLOSING
state (look for the corresponding onTransition
), we check the pending commands DB and replay the CMD_FULFILL_HTLC
, which makes us generate a new set of force-closing transactions that include the HTLC-success transaction.
eclair-core/src/test/scala/fr/acinq/eclair/channel/states/e/NormalSplicesStateSpec.scala
Show resolved
Hide resolved
This commit contains almost no functional changes, we just group together and reorder tests, remove unused code and remove the default value for `sendInitialStfu` (which makes tests matrixes easier to read).
We were missing a test for the case where we request quiescence when we still have pending changes to apply.
There are two scenarios to test: - one of our outgoing HTLC times out - one of our incoming HTLC is close to timing out and we have the preimage
Thanks! I've cherry-picked those changes. I only added f51710e to check that one side receives a command failure during concurrent splices. I also made the change you suggested to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one more step towards a full splicing implementation! 🚀
This draft is to get feedback on the overall design; if it looks like the right direction I will move the changes needed to enable a splice while quiescent to another PR. This one should only include what is necessary for quiescence negotiation up to the point of starting the splice.