Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOLT 2: quiescence protocol (feature 34/35) option_quiesce #869

Merged
merged 3 commits into from
Jun 17, 2024

Conversation

rustyrussell
Copy link
Collaborator

A simple protocol for upgrades, splices and other things which are easier when no updates are in-flight.

02-peer-protocol.md Show resolved Hide resolved
02-peer-protocol.md Show resolved Hide resolved
rustyrussell added a commit to rustyrussell/lightning that referenced this pull request May 14, 2021
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell force-pushed the quiescence-protocol branch 2 times, most recently from 99c0499 to b96218b Compare May 24, 2021 05:21
@rustyrussell
Copy link
Collaborator Author

Added the concept of "initiator" which makes splicing far simpler.

cdecker pushed a commit to rustyrussell/lightning that referenced this pull request May 25, 2021
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
cdecker pushed a commit to rustyrussell/lightning that referenced this pull request May 26, 2021
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this pull request May 31, 2021
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this pull request May 31, 2021
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
niftynei pushed a commit to ElementsProject/lightning that referenced this pull request Jun 1, 2021
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell
Copy link
Collaborator Author

Trivial rebase on top of #880

@Crypt-iQ
Copy link
Contributor

If stfu is type 2 (even), then it needs feature-bits according to bolt-01:

  • "MUST NOT send an evenly-typed message not listed here without prior negotiation"

@rustyrussell
Copy link
Collaborator Author

If stfu is type 2 (even), then it needs feature-bits according to bolt-01:

* "MUST NOT send an evenly-typed message not listed here without prior negotiation"

Excellent point! The purpose of STFU is to support other things, but it's worth having its own feature bit IMHO anyway.

@rustyrussell rustyrussell changed the title BOLT 2: quiescence protocol. BOLT 2: quiescence protocol (feature 34/35) option_quiesce Sep 30, 2021
Copy link
Contributor

@lightning-developer lightning-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK.

I think the wording around pending can be confusing and the cases if peers ignore the protocol could be elaborated a bit.

Also as a feedback: I initially thought that this could be generalized to the case of closing the channel but there we actually do want to send more updates to resolve / settle outstanding htlcs. I am writing this to demonstrate how easy it was to get confused by the notation.

The sender of `stfu`:
- MUST NOT send `stfu` unless `option_quiesce` is negotiated.
- MUST NOT send `stfu` if any of the sender's htlc additions, htlc removals
or fee updates are pending for either peer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way how I initially understood the word pending was that there are HTLCs in flight which have not been settled or removed. Yet after going through the text I think pending means that the statemachine is not at a point where revoke_and_ack has been sent.

Maybe we could elaborate this a bit and make it more concrete?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This refers to the state machine, where changes are referred to as pending.

Concretely, this is not allowed if you have sent update_* messages, without a commitment_signed. Is that clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would have been much clearer. But maybe not necessary at that point of the document if it is clear within BOLT2 that the word pending always refers to the state machine and not to in flight HTLCs. If not I suggest to introduce such a clarification the first time the word pending is being used in BOLT 2. Also as said I eventually understood the meaning of pending so the issue seems to be not that big.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble understanding what "pending" means here.

In this flow, is the add considered pending?:

--update_add_htlc-->
--commit_sig------->

If so, is it only considered non-pending at the end of this flow?:

--update_add_htlc-->
--commit_sig------->
<--revoke_and_ack--
<--commit_sig------
--revoke_and_ack-->

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that "pending" is confusing here. I think it would be more clear to write:

  • MUST NOT send stfu if any of the sender's htlc additions, htlc removals
    or fee updates have not been counter signed or failed by our peer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that "pending" is confusing here. I think it would be more clear to write:

* MUST NOT send `stfu` if any of the sender's htlc additions, htlc removals
  or fee updates have not been counter signed or failed by our peer.

I agree. I think it's still vague.

- MUST reply with `stfu` once it can do so.

Upon disconnection:
- the channel is no longer considered quiescent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the only condition under which a channel can get out of being considered quiescent?

I guess future protocols that need a quiescent channel end with stopping the channel to be considered quiescent. But maybe this could also be done with a separate message or an additional explicit field in the stfu that resets the request to be quiescent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. For now we only use this for channel upgrade: other changes may have different needs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably missed something but why is a disconnect (of the connection) necessary for a channel upgrade? Do the peers after upgrade need to execute the routines in connection reestablishment?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether we add a new message that explicitly ends quiescence as suggested by @ProofOfKeags or leave it to be described in each downstream protocol, I think we need to say you can exit quiescence without sending a disconnect.

I do not think all downstream protocols should be forced to trigger a disconnect to leave quiescence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is addressed.

- MUST NOT send `stfu` unless `option_quiesce` is negotiated.
- MUST NOT send `stfu` if any of the sender's htlc additions, htlc removals
or fee updates are pending for either peer.
- MUST NOT send `stfu` twice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did not specify what happens if a receiver receives two stfu messages? I guess the second one SHOULD be ignored or should there be more drastic measures?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's undefined; since the sender shouldn't do that, I'd probably reply with a warning and disconnect.

- MUST set `initiator` to 1
- MUST set `channel_id` to the id of the channel to quiesce.
- MUST now consider the channel to be quiescing.
- MUST NOT send an update message after `stfu`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again what happens if the sender does?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, undefined. I would warning and disconnect, but that's not a requirement.

@t-bast
Copy link
Collaborator

t-bast commented Mar 1, 2024

You have to anyway...
So then how do you handle disconnects then? I'd imagine that the volatile state gets rolled back, no?

You're right! We would probably use the same behavior as what happens if the other peer disconnects while we're quiescent and haven't completed the inner protocol. In the splicing case, it means that if we've reached a state where we cannot abort anymore, we finalize the signatures exchange when reconnecting or we force-close (if our peer is buggy/malicious).

That means that this explicit quiescence termination step is still some additional code, but it reuses existing behavior, so it may not be as bad as I thought.

I agree with you that it is conceptually cleaner to have a way to explicitly end quiescence. I'm less strongly against this than I was in my previous comment, it's probably at least worth prototyping an implementation to see how much code it would add before deciding whether we want this or not.

Would it be best if I made a PR that targeted @rustyrussell 's branch?

Yes, that sounds good 👍. I'd also be very interested in hearing @rustyrussell's thoughts on this unquiesce message, that would help tilt the balance in either direction.

Copy link

@remyers remyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My primary suggestion is to replacing pending and acknowledge with language that is more precise.

I think we also need to add something about ending quiescence when a downstream protocol succeeds and does not disconnect.

Comment on lines +499 to +1436
easiest on channels where both commitment transactions match, and no
pending updates are in flight. We define a protocol to quiesce the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the ambiguity with 'pending' meaning that an htlc is waiting for a pre-image vs. 'pending' meaning the update has not been mutually committed to by both sides, how about this instead:

"and all proposed updates are fully committed to by both parties."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this

The sender of `stfu`:
- MUST NOT send `stfu` unless `option_quiesce` is negotiated.
- MUST NOT send `stfu` if any of the sender's htlc additions, htlc removals
or fee updates are pending for either peer.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that "pending" is confusing here. I think it would be more clear to write:

  • MUST NOT send stfu if any of the sender's htlc additions, htlc removals
    or fee updates have not been counter signed or failed by our peer.


If both sides send `stfu` simultaneously, they will both set
`initiator` to `1`, in which case the "initiator" is arbitrarily
considered to be the channel funder (the sender of `open_channel`).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we also mention open_channel2 here for completeness after rebasing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we should still add this.

### Rationale

The normal use would be to cease sending updates, then wait for all
the current updates to be acknowledged by both peers, then start
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose we replace "acknowledged" with "counter signed or rejected", or with whatever term we use in the "MUST NOT send stfu" section above to replace the ambiguous use of "pending".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

- MUST reply with `stfu` once it can do so.

Upon disconnection:
- the channel is no longer considered quiescent.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether we add a new message that explicitly ends quiescence as suggested by @ProofOfKeags or leave it to be described in each downstream protocol, I think we need to say you can exit quiescence without sending a disconnect.

I do not think all downstream protocols should be forced to trigger a disconnect to leave quiescence.

@ProofOfKeags
Copy link
Contributor

ProofOfKeags commented Apr 10, 2024

I have now come to believe that if we want this to be a standalone protocol gadget, we actually need an explicit resume message if we want to be able to test these gadgets in isolation.

At the moment I have been able to complete testing that we defer any update inducing state transitions while we are in an stfu state. To complete the testing we have to ensure that the updates that would have normally taken place while we were quiesced are run. However, there is a question where if we have no downstream protocol that can signify the termination of quiescence, we will stay in that state indefinitely. We can, of course, disconnect and test that things start up that way, however we would feasibly want to also do some live testing to ensure that once a channel is no longer quiesced, we can resume traffic and never find ourselves in an inconsistent state.

As it stands right now, without a go_on message or equivalent, we cannot complete testing, which tells me that the protocol is incomplete without it. So the time has come. I will be temporarily suspending implementation/testing work to put together a complete spec for the new message like I said I would do a few weeks back.

EDIT: spec update proposal

@remyers
Copy link

remyers commented Apr 11, 2024

I have now come to believe that if we want this to be a standalone protocol gadget, we actually need an explicit resume message if we want to be able to test these gadgets in isolation.

From our experience, tests for quiescence ended by disconnect are very valuable. A go_on message would not capture the edge cases of a disconnect. When you combine quiescence with splicing, then you would want to end quiescence when splicing succeeds or disconnect on failure. Both situations may have subtle edge cases to handle that go_on would not capture.

I think for the purpose of treating quiescence as a stand alone gadget for testing, disconnect is sufficient. However, there could be a place for go_on as an internal place-holder for testing in lieu of testing with a specific downstream protocol.

@ProofOfKeags
Copy link
Contributor

From our experience, tests for quiescence ended by disconnect are very valuable.

I am not suggesting this as a substitution. I think we need to test both. However, live resume has some different subtleties than resume via reconnect, which is why I was hoping to be able to test both situations. At the moment I'm limited to disconnect only in lieu of a complete implementation of splicing or dyncomms.

- otherwise:
- SHOULD NOT send any more update messages.
- MUST reply with `stfu` once it can do so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should add, "must now consider the channel as quiescent here as well" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do that because quiescence is only achieved once both sides have sent it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah both sides have sent it, if a peer receives stfu and has sent one before, the peer must now consider the channel as quiescent (This is already stated in the spec, line 527)

The otherwise branch(L528-530), where the peer receives but has not sent one and so replies, I think it should also be stated that peer must now consider the channel as quiescent as well, just as in the other scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's redundant as that case is already covered by the requirements on the sender

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is. I could not find anywhere in the sender perspective where it is stated that the channel should now be considered as quiescent when stfu is sent. Rather, the channel is considered as quiescing in that case, line 522. I think when stfu is sent and received that is when the channel is considered quiescent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I will add this to the sender requirements.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add it when the sender has received a reply so it might be more appropriate here, but maybe there is a better way to structure it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Is it quiescent only when you've send AND received the message? Or only when you send it?

@ProofOfKeags
Copy link
Contributor

The effort to add an explicit resume message is being abandoned after the discussion taking place here.

The core observation is that it increases the number of situations where we can result in "half-committed" state. Protocol state is considered half-committed when it can no longer be aborted (usually due to the transmission of a signature or secret) but has not yet reached a state where both parties have done so.

Currently we resolve all half-committed state issues through the channel_reestablish mechanism which is performed only at reconnection and we don't find a compelling reason at the moment to do it at any other time. Any explicit resume message will result in having to solve this problem. Further, while aesthetically displeasing, the layer violation is entirely manageable by making it explicit that a quiescence session is usable once only and that any downstream protocol that assumes quiescence must define its terminal states at which point quiescence terminates as well.

I will open up a new PR that clarifies these more subtle requirements and link it back here.

@carlaKC
Copy link
Contributor

carlaKC commented May 8, 2024

Weighing in late here, but is it worth adding an advisory note about disconnecting to exit quiescence if you need to resolve a HTLC to prevent a force close? Given that we don't settle our in-flight HTLCs, seems to me that a badly timed quiescence with some almost-expired HTLCs could be unfortunate unless you're paying attention to what's in your quiesced channel.

@ProofOfKeags
Copy link
Contributor

ProofOfKeags commented May 8, 2024

Weighing in late here, but is it worth adding an advisory note about disconnecting to exit quiescence if you need to resolve a HTLC to prevent a force close? Given that we don't settle our in-flight HTLCs, seems to me that a badly timed quiescence with some almost-expired HTLCs could be unfortunate unless you're paying attention to what's in your quiesced channel.

Yes this is an important observation.

I think we do need to add an advisory note recommending that nodes disconnect if quiescence lasts "too long" at their discretion.

Nominally a quiescence session should not take more than a handful of seconds, as any negotiation that takes place on a quiesced channel should be able to complete imminently. Given that all of the HTLCs are block bound this should be improbable.

One way to combat the edge case though would be that if a channel is quiesced and a new block comes in (which would be what usually triggers any HTLC timeout action), then we can use a new block as a cue to cycle the connection on quiesced channels. I'll need to think a bit more about the consequences of this but I think it would be simpler to implement than a quiescence controller that was fully "htlc aware"

@t-bast
Copy link
Collaborator

t-bast commented May 13, 2024

Agreed, on our side we have a plain duration-based timeout: once quiescence is initiated, we start a 90 seconds timer (that value can be overridden by the node operator) that is tied to the quiescence session (and include the quiescent action, e.g. a splice). If that timer ticks before we get out of quiescence, we disconnect.

@remyers
Copy link

remyers commented Jun 3, 2024

Using a version of Eclair modified for testing quiescence at commits 87b141f and 2fcdc4a I performed interop testing of quiescence with the LND quiescence branch ProofOfKeags:feature/stfu at commit 9225d29.

I confirmed the following basic scenarios played out as expected:

  1. LND sends stfu and Eclair sends add_update_htlc before responding with stfu
    Both nodes cross-sign the update
    Eclair times out waiting for splice_init, sends a warning and disconnects
    After reconnect the nodes settle the pending payment

  2. same as 1, but Eclair also sends update_fee after responding with stfu
    LND sends a warning and disconnects
    After reconnect the nodes settle the pending payment

  3. LND sends add_update_htlc and Eclair sends stfu before sending revoke_and_ack + commit_sig
    LND sends revoke_and_ack before sending stfu

From these tests I feel comfortable about the interoperability between LND and Eclair. I think we should be good to merge this PR after addressing clean-up comments.

rustyrussell and others added 2 commits June 11, 2024 17:14
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>




Header from folded patch 'bolt_2__set_an_initiator_in_quiescence.patch':

BOLT #2: Set an initiator in quiescence.

This is especially useful for protocols such as splicing; for
simplified commitment transactions, there is already an implied
initiator at each point, so having the negotiation at splicing
time would be redundant.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>



Header from folded patch 'option_quiesce__feature_to_support_stfu_method.patch':

option_quiesce: feature to support stfu method.

In practice, sftu is useless unless you have something (e.g. channel_upgrade)
which uses it, but adding a feature is best practice IMHO.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell
Copy link
Collaborator Author

Rebased, merged rustyrussell#15

@rustyrussell rustyrussell force-pushed the quiescence-protocol branch from a68d7b0 to b474755 Compare June 11, 2024 07:49
Copy link
Collaborator

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK b474755, I believe this is ready to go 🚀

We may want to add a paragraph to clarify #869 (comment), or that can be a follow-up PR.

…ith HTLCs pending.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell
Copy link
Collaborator Author

Added timeout requirement, as agreed.

@rustyrussell rustyrussell merged commit c562d91 into lightning:master Jun 17, 2024
Comment on lines +1468 to +1469
Both nodes:
- MUST disconnect after 60 seconds of quiescence if the HTLCs are pending.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this has already been merged but I do not believe this is a MUST. This is up to node policy and can be initiated by either side. Especially considering that time is both relative and difficult in distributed systems this ought to be a SHOULD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "pending" here? I believe it is a different definition than the one above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants