Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(release) - log an error if the blockchain voted for version higher than the node's #11928

Closed
wacban opened this issue Aug 13, 2024 · 8 comments · Fixed by #12289
Closed
Labels
C-good-first-issue Category: issues that are self-contained and easy for newcomers to work on.

Comments

@wacban
Copy link
Contributor

wacban commented Aug 13, 2024

During the binary release the protocol goes through a voting where it agrees to upgrade to the next protocol version in the next next epoch. For more details please see https://github.com/near/nearcore/blob/master/docs/practices/protocol_upgrade.md.

Often times some nodes are not upgraded. Those nodes crash as soon as the protocol upgrade happens. This is undesired for multiple reasons:

  • The blockchain loses some of its capacity.
  • The validator will miss blocks and chunks, lose rewards and be kicked out, losing even more rewards.
  • The crashing nodes introduce noise during the sensitive period of protocol upgrade.

The goal of this issue is to print an error message and/or otherwise signal to the node operator that the blockchain will upgrade to a version higher than the version supported by the node.

@wacban wacban added the C-good-first-issue Category: issues that are self-contained and easy for newcomers to work on. label Aug 13, 2024
@eagr
Copy link
Contributor

eagr commented Oct 22, 2024

New to NEAR, see if I'm getting this right. :)

Since voting happens in the last block every epoch, at the beginning of each epoch E, we could check protocol_version(E+1). If the result is greater than the protocol version of the client, ideally we should signal the node operator right away. About logging the error, once or repeatedly, if repeatedly, how often?

@wacban
Copy link
Contributor Author

wacban commented Oct 22, 2024

Hi @eagr! Welcome to near!

I think you got it right but just to rephrase

  • last block of E-1 - voting for the protocol_version(E+1)
  • in E+0 we can check protocol_version(E+1) <= PROTOCOL_VERSION and if not log an error

For reference this doc should be somewhat up to date:
https://github.com/near/nearcore/blob/master/docs/practices/protocol_upgrade.md

ideally we should signal the node operator right away

Yes! In a perfect world they would upgrade their node in time for E+1. Sometimes it will still be too late but often they will be able to operate without a break.

About logging the error, once or repeatedly, if repeatedly, how often?

Good question, I think we can go all in and log it on every block in E. That would also likely be the simplest.

@eagr
Copy link
Contributor

eagr commented Oct 22, 2024

last block of E+1 - voting for the protocol_version(E+1)

I think you meant to write "last block of E-1" no? If so, yeah that sums up my understanding perfectly.

Good question, I think we can go all in and log it on every block in E. That would also likely be the simplest.

I was thinking something like once every epoch, considering version V+1 must be compatible with V and each release would take a week or two, so there would be plenty of heads-up during that period no? But obviously I don't have full picture here, so I'll trust you judgement. ;)

@wacban
Copy link
Contributor Author

wacban commented Oct 22, 2024

I think you meant to write "last block of E-1" no? If so, yeah that sums up my understanding perfectly.

Yeah, sorry, fixing it..

I was thinking something like once every epoch, considering version V+1 must be compatible with V and each release would take a week or two, so there would be plenty of heads-up during that period no? But obviously I don't have full picture here, so I'll trust you judgement. ;)

The node can only notice that it's behind once the voting begins. If voting starts in E-1, that gives at best 1.5 epochs (on average) heads up. To make the implementation simple we can just wait for the voting to finish in the last block of E-1. The node would then only have this one epoch, E, to upgrade. In that case I think it's fair to error on every block. Others may have other opinions of course :) Logging error on every block also happens to be the simplest to implement so I think it's a good starting point.

We could take it a step further and have nodes announce their new protocol version even before the official voting. This would make it significantly more complex, to the point where it's not worth the effort.

@eagr
Copy link
Contributor

eagr commented Oct 23, 2024

The node can only notice that it's behind once the voting begins.

I was thinking we should issue the warning until we have the voting result. Now that I think of it, it's only a matter of time that the network will reach consensus (otherwise there’s a much bigger problem to worry about, like an attack or a rebellion), so yeah it'd be nice if we could give an even earlier heads-up. But for now, I’ll just go with the simple way.

Technically, this is not an error, isn’t a warn loud enough?

@wacban
Copy link
Contributor Author

wacban commented Oct 23, 2024

Technically, this is not an error, isn’t a warn loud enough?

I'm fine with either, I think error has a higher chance of attracting attention and triggering automation. If the node operator does not act promptly their validator will be kicked out and lose money so error may just be justified.

@eagr
Copy link
Contributor

eagr commented Oct 23, 2024

Any pointers for other issues you see fit for me? Let's keep the ball rolling ;)

github-merge-queue bot pushed a commit that referenced this issue Oct 24, 2024
closes #11928 

do we need testing for this? @wacban
@wacban
Copy link
Contributor Author

wacban commented Oct 24, 2024

Sure, let me tag you on one :)

github-merge-queue bot pushed a commit that referenced this issue Oct 24, 2024
closes #11928 

do we need testing for this? @wacban
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-good-first-issue Category: issues that are self-contained and easy for newcomers to work on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants