Skip to content
This repository has been archived by the owner on Feb 23, 2022. It is now read-only.

PBTS: should synchrony parameters be adaptive? #371

Closed
cason opened this issue Nov 26, 2021 · 24 comments · Fixed by tendermint/tendermint#7739
Closed

PBTS: should synchrony parameters be adaptive? #371

cason opened this issue Nov 26, 2021 · 24 comments · Fixed by tendermint/tendermint#7739

Comments

@cason
Copy link

cason commented Nov 26, 2021

In order to ensure that blocks have timestamps that reflect the actual real time at which a block was proposed, PBTS adopts two synchrony parameters for validating proposal times:

  • PRECISION bounds the different of values read from clocks of processes at the same instant of real time. In terms of validating proposals, it is meant to prevent Byzantine validators from assigning to proposal times in the future.

  • MSGDELAY bounds the time it takes for a Proposal message be delivered to all (correct) validators. In terms of validating proposals, it is meant to prevent Byzantine validators from assigning to proposals times in the past.

While meant to prevent Byzantine proposers from assigning arbitrary times to proposed values, an equivocate choice of values for this parameters can represent a liveness issue. This is particularly relevant for the MSGDELAY parameter, expected to be a conservative estimation of the maximum delay for broadcasting a Proposal. If a Tendermint network adopts a value for the MSGDELAY parameter that is likely to be violated in real executions, correct validators may reject proposals from correct proposers because they are deemed untimely. More specifically, a correct validator will assume that the proposer assigned a timestamp in the past to the proposal, a malicious behavior, while the proposer was actually correct but the Proposal took longer than the expected to be received.

An alternative to circumvent this scenario, i.e. bad choices for synchronous parameters, is to relax the assumptions as rounds of consensus fail, so that eventually the timely requirements will not prevent a correct validator from accepting a value proposed (and timestamped) by a correct proposer.

A reference of how to relax timing parameters is already present in Tendermint, as the duration of the timeouts is a increasing function of the round number. As rounds fail, and new rounds are required, the timeouts become longer, so that eventually they stop expiring prematurely. The same approach could be adopted for the synchrony requirements for PBTS, but we would need to specify how the bounds will be relaxed as the round number increases.

Another alternative, mentioned in discussions, is to restrict the adoption of timing requirements to validate proposal times to a number of rounds, and then drop them. This would represent a similar approach, conceptually speaking, but instead of slowly relaxing the requirements, they will "fully" relaxed (dropped) at some point.

@williambanfield
Copy link
Contributor

williambanfield commented Nov 29, 2021

My first response is to strongly agree that there should be a way to adapt the synchrony parameter over time. This is effectively a hard requirement. Without it, misconfigured consensus parameters can not easily be changed. I.e., if you set the MSGDELAY too low, how do you create a timely proposal for changing the MSGDELAY?

If I'm reading correctly, there are two possible proposals here:

  1. Drop the timely check after a series of rounds, perhaps configurable by the network.
  2. Increase the MSGDELAY bound by some IncreaseΔ (either linear or exponential).

While proposal (1) was partially of my idea, I'm now more in favor if of (2). (2) allows for the liveness needed to update potentially misconfigured values while not completely losing synchrony in bad rounds. It also allows you to calculate how large the synchrony bound was for each height by using the MSGDELAY and Δ value along with the round number that the block was produced at. This has a few nice properties. First, networks can tune their MSGDELAY using this information to make the network produce blocks faster by enlarging the MSGDELAY. Second, external clients can also calculate how 'skewed' the network's clock may be which may be useful for calculations using time.

@josef-widder
Copy link
Contributor

In principle there would also be option:

  1. After a series of rounds, Increase the MSGDELAY bound by some IncreaseΔ (either linear or exponential).

Proposal (1) we should rule out as for such instances there will be no relation between block time and real time (or reference time) can be ensured. (2) ensures that there is a relation. The potential difference just becomes larger.

(3) might be the most complex one to implement. But it would allow the keep the tight synchronization, e.g., in the case where the proposer of round 0 is crashed. At the same time it will ensure liveness after some rounds.

@cason
Copy link
Author

cason commented Dec 3, 2021

While I think that using adaptive synchronous parameters means giving up from synchrony, I see the need for a way to prevent a system from halting due to bad choices of parameters.

After sleeping on that matter, I tend to be more aligned with the option (3) proposed by Josef. Like option (1), it has no impact on the regular operation of the protocol, as the original (static) timely predicate will be adopted in practice. But, unlike option (1), we don't fully drop all timing bounds (and guarantees) in case of bad choices of parameters.

For me, two questions remain open:

  1. When, at each round, start relaxing the MSGDELAY parameter? Is it worth allow this to be also a parameter?
  2. Once started, how progressively this relaxation should occur?

For the first question, we might consider the number of failed rounds that indicate that is actually MSGDELAY the reason for the non success of the previous rounds. Should be something related to the number n of validators, should it be fixed?

For the first question, the possibilities are really open. We can just double MSGDELAY after reaching this round limit, then doubling it again after some other higher-numbered round (maybe the same number of rounds an in the first case)? I mentioned this mechanism because I don't think that relaxing the requirements at each new round is a good approach, again, as rounds tend to fail due to another reasons.

In terms of implementation, I don't think it would be complicated, while it can be precisely defined. I assume that something like: if r > round_limit then add to the upper bound for v.time some value that is function of MSGDELAY and r.

@josef-widder
Copy link
Contributor

OK, then let's introduce 'round_limit' and let's double the factor of 'MSGDELAY' in the timely function every 'round_limit' rounds.

I guess 'round_limit' should be the same at all validators, so we might need it to make a consensus parameter? @williambanfield, what do you think?

@cason
Copy link
Author

cason commented Dec 7, 2021

Maybe we can take the opportunity to adopt more clear/understandable names for this parameters. A first proposal:

  • CLOCK_PRECISION
  • PROPOSAL_DELAY (but, to be more precise it would be latency)
  • Some better name for 'round_limit'"

@williambanfield
Copy link
Contributor

I think that the idea of adding a 'round limit' value at which the MSGDELAY is doubled is generally a good idea. I'm not sure I think that we should actually give this to validators/networks as a tunable parameter. More tunable parameters means more confusion for developers and chains and it's reasonably unclear how, as an application developer or validator on a chain, this parameter will actually affect the chain. I think it would be fine for us to pick a reasonable value for this, or do it in terms of the size of the validator set.

A possible option would be to do it after 1/3+ of the validator set has had a chance to propose. This way, there must have been some correct process that attempted to propose but that was not able to. Thoughts?

@cason
Copy link
Author

cason commented Dec 7, 2021

Not really a suggestion, but this value or parameter must be known by light clients or anyone planning to use block times for something. Associating it to the number of validators seems reasonable, but this also means that its value will be dynamic...

@cason
Copy link
Author

cason commented Dec 7, 2021

A point for having this round limit as a parameter is to consider other uses it can have (possibly in the future), other than relaxing synchronous requirements.

@cmwaters
Copy link
Contributor

cmwaters commented Dec 8, 2021

I think it would be fine for us to pick a reasonable value for this, or do it in terms of the size of the validator set.

I do think that this makes sense and I like the idea of doubling it after 1/3+ of the validating power have proposed untimely headers.

Not really a suggestion, but this value or parameter must be known by light clients or anyone planning to use block times for something

At the moment the light client never reads the consensus params (which it definitely could do) to understand the evidence age (and derive the trusting period from) nor to understand the timing params (which it uses as part of verification). They are currently all manually set by the operator and need to be updated each time these parameters change. We could definitely consider something more dynamic in the future.

A point for having this round limit as a parameter is to consider other uses it can have (possibly in the future), other than relaxing synchronous requirements.

I think there's probably some validity to what your'e saying but I'm more of the opinion that we shouldn't be using future uses that we have currently don't have an idea about as justification for adding parameters. Adding fields is generally a lot easier than removing them so it's often better to wait until there are actual solid use cases before introducing a field.

@cason
Copy link
Author

cason commented Dec 9, 2021

doubling it after 1/3+ of the validating power have proposed untimely headers

I don't think that this proposal is possible, because it requires knowledge of why I round didn't succeed. And this, at my understanding, will be an internal derivation that will be hard to "export". Of course, we can consider an approximation of that 1/3+ by considering n/3 rounds, while n is the current number of active validators.

@josef-widder
Copy link
Contributor

If adding it as parameter adds too much complexity, can we just fix round_limit to 10?

@cason
Copy link
Author

cason commented Dec 13, 2021

If adding it as parameter adds too much complexity, can we just fix round_limit to 10?

While I agree with this solution, I still have to ask: why 10?

@williambanfield
Copy link
Contributor

I don't think that this proposal is possible, because it requires knowledge of why I round didn't succeed. And this, at my understanding, will be an internal derivation that will be hard to "export". Of course, we can consider an approximation of that 1/3+ by considering n/3 rounds, while n is the current number of active validators.

To clarify here, I don't think we need to actually know why the consensus round failed for us to use it, since it's just an estimate. Picking a value like 10 is fine with me as well. This value just exists to ensure liveness in the case of misconfiguration, not to really be theoretically perfect.

cason added a commit that referenced this issue Dec 13, 2021
@josef-widder
Copy link
Contributor

Not sure about 10 ;-) From a theoretic viewpoint liveness should be ensured for all values of round_limit, no?

We could also try to actually find a good approximation for +1/3, e.g., assuming the current voting powers, how long it takes to complete "round-robin" that includes all validators as proposer, and perhaps try to understand what that means if the number of validators goes from 150 to 200?

@ancazamfir
Copy link
Contributor

iiuc, we want to start increasing MSGDELAY after round_limit rounds.
If the chain has a good MSGDELAY value than we should rarely see this mechanism kicking in (probably in periods of high network latency).
If the chain was misconfigured with a very low MSGDELAY multiple rounds and adaptive MSGDELAY will be seen frequently. In this case, we agree that the long term solution is chain upgrade with better parameters? And the adaptive mechanism is a diagnostic and temporary solution?
Is there any user feedback we offer that could help understand the different cases? (maybe we should start taking advantage of the consensus events and offer some tooling around them?)
I know I asked this before but from my experience, chains tend to configure pretty conservative values. We frequently see for example IBC clients configured with 10sec-2hr clock drift. imho it is more likely that MSGDELAY will be many times larger than what it should. What are the implications in this case?

@cason
Copy link
Author

cason commented Dec 15, 2021

If the chain has a good MSGDELAY value than we should rarely see this mechanism kicking in (probably in periods of high network latency).

It should never. MSGDELAY was thought as a synchronous parameter, which violation means that the system does not adhere to the model.

In this case, we agree that the long term solution is chain upgrade with better parameters? And the adaptive mechanism is a diagnostic and temporary solution?

Definitely. The proposed approach is a workaround for tolerating bad consensus parameters. The point that @williambanfield pointed from the beginning is that it will not be possible to update MSGDELAY parameter if we are not able to accept blocks due to a previous bad choice for this parameter.

@cason
Copy link
Author

cason commented Dec 15, 2021

Is there any user feedback we offer that could help understand the different cases?

This is a very good point, that we should address. My opinion in this regard is that a bad MSGDELAY should put the system in clearly impractical operation, with successive heights taking tens of rounds to commit. The operator should, at this point, understand what is going on, why the system became so slow.

This was one of the reasons for which I prefer the approach of not making this parameter actually adaptive, i.e., not updating it often, after each failed round, so that when it is relaxed (after round_limit rounds) it should be noticeable by the user.

cason added a commit that referenced this issue Dec 15, 2021
cason pushed a commit that referenced this issue Dec 15, 2021
* PBTS: second version of system model

* PBTS: new model referred in algorithm spec

* PBTS: removed model discussion from algorithm spec

* PBTS: corrections on the ystem model

* PBTS: a pretty complex problem statement

* PBTS: minor fixes on the problem spefication

* PBTS: liveness part of problem specification

* PBTS: link updated, outdated note on sysmodel_v1

* Update spec/consensus/proposer-based-timestamp/pbts-algorithm_002_draft.md

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>

* Apply William's suggestions from code review

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>

* PBTS: new discussion and definition for accuracy

* Apply Josef's suggestion from code review

Co-authored-by: Josef Widder <44643235+josef-widder@users.noreply.github.com>

* PBTS: some tags added to sysmodel

* PBTS: motivation and link to Issue #371

* PBTS: fixing lint error

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
Co-authored-by: Josef Widder <44643235+josef-widder@users.noreply.github.com>
@williambanfield
Copy link
Contributor

williambanfield commented Dec 16, 2021

I came up with a simple prototype for using the amount of voting power seen on the network to determine the synchrony for the block timestamp. This prototype can be seen here:

tendermint/tendermint@017d45f

This would add the requirement that light clients wishing to verify the synchrony of a particular block be able to determine the validator 'priorities' in a trusted way at a given height. This can be queried via the /validators rpc endpoint, but I'm not sure if we consider this trusted for light clients?

The light client would then run the proposer selection algorithm starting at round 0 up to the round at which the block was decided, keeping track of the total amount of voting power seen. Every time the light client observes that 1/3+ of the voting power has proposed, it would add an additional MESSAGEDELAY to the synchrony that it assumes that the block has.

This mechanism adds some complexity to light clients over the hardcoded alternative, but has the upside of being more capable of adapting to the size of the network. I think that if we don't mind adding this complexity to light clients, this is a somewhat cleaner alternative.

@cason
Copy link
Author

cason commented Dec 20, 2021

Nice work, I have to say that his solution looks surprisingly simple, in the sense that I assumed it would require more lines, more complex tricks.

From a specification view-point, however, I would prefer to alter the TimingParams field provided to the method, since it seems to be passed by value, instead of making this computation (MESSAGEDELAY *= k) inside the method. This would allow generating a log message when we passed the round limit, and so are relaxing the timing parameters. This also would make clear to who is reading the code why we, in some circumstances, relaxing the timing parameters, as in the proposed implementation we would have thirdsSeen == 0 in most regular cases.

In other words, I would made this change more visible to the code reader, which is a least clever solution, but clearer.

@cason
Copy link
Author

cason commented Dec 20, 2021

From the light client point of view, my first question is: how much it cares about the MESSAGEDELAY parameter used?

In the regular operation (one or few rounds per height) we don't need all this complexity. But if the round at which a block was decided in not zero (or is above some limit), the client should ask himself why, and then use the method to retrieve how many times MESSAGEDELAY was relaxed.

Again, I think this solution is great, but I would make it clearly a non-regular case, something inside an if clause in order to make this clear, to possibly generate some log message, and allow the client to react to this, maybe skipping this block as it is not as reliable as a block decided in round zero (one, or two).

@josef-widder
Copy link
Contributor

I think what we should do now is to precisely specify what we can expect from a block, e.g., if the commit is for round 1 the time is within x of real-time and if the commit for round 100, the time is within y. Given that trusting-period << unbonding_period we don't need to change anything now as the difference between y and x may be in the order of seconds, while the periods differ in the order of days.

However, in a next step, once we have the precise properties of the time in the block with respect to the round number, we can plan to add complexity to the lightclient and/or the analysis of the lightclient security.

@williambanfield
Copy link
Contributor

Nice work, I have to say that his solution looks surprisingly simple, in the sense that I assumed it would require more lines, more complex tricks.

I was worried about this as well, which is part of why I did the prototype: I wanted to see if it was really that bad, although, the complexity is somewhat masked by the fact that the proposer-selection algorithm is already implemented here.

This would allow generating a log message when we passed the round limit, and so are relaxing the timing parameters.

Yeah, we should definitely log this to the user and make it clear when it changes. The updated IsTimely method was just to quickly determine if this was possible but Im very open to making it more clear.

I would make it clearly a non-regular case, something inside an if clause in order to make this clear, to possibly generate some log message, and allow the client to react to this, maybe skipping this block as it is not as reliable as a block decided in round zero (one, or two).

Agreed, this is not the regular case and we can make that obvious to users and readers of the code. I am waiting on user feedback from 1-2 application developers, but still believe that this 'thirdsSeen' mechanism is a good way forward.

@williambanfield
Copy link
Contributor

williambanfield commented Jan 12, 2022

After further consideration, I think that my solution will not work for the following reason:
My solution relies on the ProposerPriority field of the validator set. This field is returned in the response from the /validators RPC endpoint. This priority value is needed to determine how much voting power of the network proposed by some round. This means that a light client could fetch this value from an RPC endpoint. However, this field is not hashed into the validator set hash that is included in the block and is therefore not actually verifiable. Therefore, light clients could fetch this value but could not actually trust it.

Since this solution is proving to be even more difficult than I had anticipated I would advocate against using it and instead using a simple hardcoded value. To build confidence in this decision, I spoke with @ValarDragon from Osmosis who seem similarly doesn't see much issue with the hardcoded value. I also spoke with @liamsi who is still investigating but seems to be leaning in a similar direction. Let's hardcode 10 rounds before relaxation. We can re-consider during the next release if needed.

mergify bot pushed a commit to tendermint/tendermint that referenced this issue Feb 2, 2022
This change adds logic to double the message delay bound after every 10 rounds. Alternatives to this somewhat magic number were discussed. Specifically, whether or not to make '10' modifiable as a parameter was discussed. Since this behavior only exists to ensure liveness in the case that these values were poorly chosen to begin with, a method to configure this value was not created. Chains that notice many 'untimely' rounds per the [relevant metric](#7709) are expected to take action to increase the configured message delay to more accurately match the conditions of the network. 

closes: tendermint/spec#371
@williambanfield
Copy link
Contributor

Implemented in tendermint/tendermint#7739, closing this issue.

evan-forbes pushed a commit to celestiaorg/celestia-core that referenced this issue Feb 2, 2022
This change adds logic to double the message delay bound after every 10 rounds. Alternatives to this somewhat magic number were discussed. Specifically, whether or not to make '10' modifiable as a parameter was discussed. Since this behavior only exists to ensure liveness in the case that these values were poorly chosen to begin with, a method to configure this value was not created. Chains that notice many 'untimely' rounds per the [relevant metric](tendermint/tendermint#7709) are expected to take action to increase the configured message delay to more accurately match the conditions of the network. 

closes: tendermint/spec#371
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants