Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quiescence negotiation #2680

Merged
merged 25 commits into from
Jul 27, 2023
Merged

Add quiescence negotiation #2680

merged 25 commits into from
Jul 27, 2023

Conversation

remyers
Copy link
Contributor

@remyers remyers commented Jun 1, 2023

This draft is to get feedback on the overall design; if it looks like the right direction I will move the changes needed to enable a splice while quiescent to another PR. This one should only include what is necessary for quiescence negotiation up to the point of starting the splice.

@remyers remyers requested a review from t-bast June 1, 2023 08:44
@remyers remyers force-pushed the quiescence branch 2 times, most recently from 319e9c4 to 1afa132 Compare June 1, 2023 08:54
Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is going in the right direction, it's ready to be cleaned up to only contain the quiescence steps and then we'll be able to do a more thorough review.

Note that you're not currently targeting the master branch, but rather an obsolete splice branch (splices-bast-pm), make sure to rebase on master for the "official" PR.

I think that the way we should handle that first PR without impacting the splice prototype is that whenever CMD_SPLICE or SpliceInit is requested, we should look at our local feature bits: if quiescence is unable, we go through the quiescence steps, otherwise we skip it (and use isIdle without htlcs).

@remyers remyers changed the base branch from splices-bast-pm to master June 15, 2023 10:06
@remyers
Copy link
Contributor Author

remyers commented Jun 15, 2023

I moved the (in-progress) changes to splice to accommodate commitment txs with htlcs to splice-with-htlcs pending this PR.

Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, this is a very subtle part of the codebase but you managed to figure it out on your own, with most of it working correctly!

I've made a few small changes in this branch, have a look at each commit separately to see if it makes sense and then feel free to cherry-pick them before making additional changes (it makes the overall diff with master smaller).

There are some test scenarios that I'd like to add that are currently missing:

  • remote sends stfu but isn't quiescent -> verify we disconnect when we reach the splice_init step (test the case where remote is initiator and the non-initiator case)
  • concurrent stfu (we should have a test for the deadlock that exists currently in the implementation - see comment below)
  • a received HTLC times out while we're waiting for quiescence, and we have the preimage in our pending commands -> verify that we force-close and extract the preimage to publish HTLC-success
  • quiescence timeout reached in the middle of the splicing part of the protocol: don't test every sub-state, that would be overkill, but it's worth testing the following two cases:
    • timeout while in SpliceInProgress
    • timeout while in SpliceAborted when we don't receive our peer's tx_abort (I'm quite afraid of bugs where peers won't correctly echo tx_abort and the channel will be stuck, but it's now fixed by the quiescence timeout that will force a reconnection)

@t-bast
Copy link
Member

t-bast commented Jun 22, 2023

Also, this needs a rebase ;)

@remyers
Copy link
Contributor Author

remyers commented Jul 24, 2023

The behavior of splice-init without quiescence has changed and is now reflected in the test change in d250f39 when there is a race between the splice-init from Alice and an add-htlc from Bob. Before the add-htlc would have gone through and the splice would have been aborted. This situation now triggers a disconnect from Alice and the add-htlc does not go through.

Is this the desired behavior? It's more consistent with how splice-init with quiescence works, but could lead to unnecessary disconnects and failed add-htlc when quiescence is not negotiated.

@remyers
Copy link
Contributor Author

remyers commented Jul 24, 2023

I've added the tests you recommend:

  • remote sends stfu but isn't quiescent -> verify we disconnect when we reach the splice_init step (test the case where remote is initiator and the non-initiator case)

In 1fc1c9c I detect this immediately when stfu is received from a remote that is not quiescent; I also added tests for both initiator/non-initiator.

  • concurrent stfu (we should have a test for the deadlock that exists currently in the implementation - see comment below)

There should not be a deadlock (see comment above), but still a good idea to add a test for concurrent stfu in 7b40fce.

  • a received HTLC times out while we're waiting for quiescence, and we have the preimage in our pending commands -> verify that we force-close and extract the preimage to publish HTLC-success

In d250f39 I created test "htlc timeout during quiescence negotiation with pending preimage" but it did not behave as I expected. There is something going on that I don't understand related to the commitment mechanism. Any idea why the htlcSuccess tx is not published when the initialCommittx is first published?

  • quiescence timeout reached in the middle of the splicing part of the protocol: don't test every sub-state, that would be overkill, but it's worth testing the following two cases:
    • timeout while in SpliceInProgress
    • timeout while in SpliceAborted when we don't receive our peer's tx_abort (I'm quite afraid of bugs where peers won't correctly echo tx_abort and the channel will be stuck, but it's now fixed by the quiescence timeout that will force a reconnection)

I added timeout tests for these states in d250f39.

Question: should I test for the disconnect in both timeout tests? Is it enough to just check for InvalidSpliceNotQuiescent in one of the two tests?

@remyers remyers marked this pull request as ready for review July 24, 2023 11:09
@remyers
Copy link
Contributor Author

remyers commented Jul 24, 2023

Great work, this is a very subtle part of the codebase but you managed to figure it out on your own, with most of it working correctly!

Thanks! It was more complicated then I initially thought (as always). My notes might help someone in the future understand it too, even if they are a bit out of date now.

remyers and others added 16 commits July 24, 2023 14:51
 - only use quiescence if both nodes signal the feature
 - reduce timeout to 1 min
 - use message type 2 for `stfu`
 - Use feature bits 34/35 for option_quiesce a per spec
It's not a prototype, it matches the official (in-progress) specification.
Use a boolean for the `initiator` field.
Add more traits to the `SpliceStatus` hierarchy, which lets us simplify
some of the pattern matching used in the `Channel` actor.

Send a `warning` and disconnect if we receive a forbidden message while
we're splicing, which lets us gracefully deal with buggy peers without
losing channels.
Being quiescent for too long is dangerous, because HTLCs may timeout
without giving the opportunity to our peer to send us a preimage.

Splicing operations shouldn't take long to complete, so we include that
in the quiescence timeout, and disconnect if the splice wasn't completed
in time. This makes sure we replay pending commands and avoid getting
the channel stuck if the interactive-tx isn't making progress.
The spec says we must not send duplicate `stfu`, but doesn't have any
requirement on the receiver. If we receive a duplicate `stfu`, it's
perfectly fine to just ignore it: the protocol will either correctly
complete or will be canceled by the quiescence timeout.
This function was called in only one place, so it's not avoiding code
duplication. It could make sense to isolate the logic, but in the case
of event handlers in the highly critical channel FSM, we usually like to
inline transition and state changes to make it easier to review the code
linearly (no need to jump back and forth between function calls).

I also renamed `isIdle` to `isQuiescent` to remove confusion.
In channel event handlers, we try to leave all the logic that updates the
current state or data directly in the event handler instead of delegating
it to helper functions, otherwise it's a bit hard to see all the state
transitions that can happen when receiving a given message. I changed the
`handleNewSplice` function to be a pure function instead that just creates
the `splice_init` message (or returns an error).

I also refactored some nits.
 - now returns a warning (and disconnects) instead of force closing
 - initiator still remains in the negotiating state until they receive stfu from the non-initiator
 - also, simplify tests for forbidden messages
 - fail htlc tests are sufficient
@remyers
Copy link
Contributor Author

remyers commented Jul 24, 2023

Also, this needs a rebase ;)

Rebased and ready for review!

@t-bast
Copy link
Member

t-bast commented Jul 26, 2023

The behavior of splice-init without quiescence has changed and is now reflected in the test change in d250f39 when there is a race between the splice-init from Alice and an add-htlc from Bob. Before the add-htlc would have gone through and the splice would have been aborted. This situation now triggers a disconnect from Alice and the add-htlc does not go through.

Is this the desired behavior? It's more consistent with how splice-init with quiescence works, but could lead to unnecessary disconnects and failed add-htlc when quiescence is not negotiated.

Yes, I think this is ok, because:

  • this is an edge case that should be rare
  • we'll want to start using quiescence and deprecate the non-quiescent splices whenever possible
  • this is still providing a way out of the edge case that doesn't create a force-close

Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me! I still need to review the tests, but otherwise I couldn't find any issue.

@codecov-commenter
Copy link

codecov-commenter commented Jul 26, 2023

Codecov Report

Merging #2680 (f51710e) into master (3e43611) will decrease coverage by 0.06%.
The diff coverage is 81.34%.

❗ Current head f51710e differs from pull request most recent head 7082c4a. Consider uploading reports for the commit 7082c4a to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master    #2680      +/-   ##
==========================================
- Coverage   85.96%   85.91%   -0.06%     
==========================================
  Files         215      215              
  Lines       17731    17813      +82     
  Branches      765      794      +29     
==========================================
+ Hits        15243    15304      +61     
- Misses       2488     2509      +21     
Files Changed Coverage Δ
...in/scala/fr/acinq/eclair/channel/ChannelData.scala 100.00% <ø> (ø)
...acinq/eclair/channel/fsm/DualFundingHandlers.scala 85.71% <ø> (+0.93%) ⬆️
...q/eclair/wire/protocol/LightningMessageTypes.scala 98.73% <ø> (ø)
...in/scala/fr/acinq/eclair/channel/fsm/Channel.scala 85.66% <79.33%> (-0.33%) ⬇️
...core/src/main/scala/fr/acinq/eclair/Features.scala 100.00% <100.00%> (ø)
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala 93.29% <100.00%> (+0.02%) ⬆️
...in/scala/fr/acinq/eclair/channel/Commitments.scala 96.78% <100.00%> (+0.01%) ⬆️
.../eclair/wire/protocol/LightningMessageCodecs.scala 99.39% <100.00%> (+0.01%) ⬆️

... and 5 files with indirect coverage changes

Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added several commits on top of your PR in https://github.com/ACINQ/eclair/tree/quiescence-bast, can you have a look at them and cherry-pick?

I'm mostly simplifying some tests, adding others, and moving stuff around to keep them grouped by "feature".

Regarding the force-close test, it was behaving correctly: when we detect that we need to force-close, we immediately generate the corresponding transactions and broadcast them. At that point, we only look at our local data, so we don't notice that we have the preimage. Then when transitioning to the CLOSING state (look for the corresponding onTransition), we check the pending commands DB and replay the CMD_FULFILL_HTLC, which makes us generate a new set of force-closing transactions that include the HTLC-success transaction.

t-bast and others added 7 commits July 27, 2023 09:43
This commit contains almost no functional changes, we just group together
and reorder tests, remove unused code and remove the default value for
`sendInitialStfu` (which makes tests matrixes easier to read).
We were missing a test for the case where we request quiescence when we
still have pending changes to apply.
There are two scenarios to test:

- one of our outgoing HTLC times out
- one of our incoming HTLC is close to timing out and we have the preimage
@remyers
Copy link
Contributor Author

remyers commented Jul 27, 2023

I've added several commits on top of your PR in https://github.com/ACINQ/eclair/tree/quiescence-bast, can you have a look at them and cherry-pick?

Thanks! I've cherry-picked those changes. I only added f51710e to check that one side receives a command failure during concurrent splices.

I also made the change you suggested to NormalSplicesStateSpec to confirm an htlc was not added: 7082c4a.

Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one more step towards a full splicing implementation! 🚀

@remyers remyers merged commit 47e0b83 into ACINQ:master Jul 27, 2023
@remyers remyers deleted the quiescence branch July 27, 2023 09:43
@remyers remyers restored the quiescence branch July 27, 2023 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants