Add upgrade module #4233

aaronc · 2019-04-29T19:26:31Z

This is a reopening of PR #3979 which was closed when the develop branch was removed. See that PR for previous discussion.

Upgrading live chains has been previously discussed in #1079 and there is a WIP spec in #2116. Neither of these provide an actual implementation of how to coordinate a live chain upgrade on the software level. My understanding and experience with Tendermint chains is that without a software coordination mechanism, validators can easily get into inconsistent state because they all need to be stopped at precisely the same point in the state machine cycle.

This PR provides a module for performing live chain upgrades that has been developed for Regen Ledger and tested against our testnets. It may or may not be what Cosmos SDK wants, but just sharing it in case it is...

This module attempts to take a minimalist approach to coordinating a live chain upgrade and can be integrated with any governance mechanism. Here are a few of its features:

allows upgrades to be scheduled at a future block height or after a future block time
crashes the blockchain state machine in BeginBlock when an upgrade is required and doesn't allow it to restart until new software with the expected upgrade is started
provides a hook for performing state migrations once upgraded software is started
allows for custom "crash" behavior that could be used to trigger automatic installation of the new software

This PR doesn't currently include any integration with the Cosmos gov module, but that could be easily done if this upgrade method works for Cosmos hub.

Linked to github-issue with discussion and accepted design OR link to spec that describes this work (linked to issues, the specification is described in the docs which are live here: https://godoc.org/github.com/regen-network/regen-ledger/x/upgrade).
Wrote tests
Updated relevant documentation (docs/) - includes through go package docs
Added a relevant changelog entry: sdkch add [section] [stanza] [message]
rereviewed Files changed in the github PR explorer

For Admin Use:

Added appropriate labels to PR (ex. wip, ready-for-review, docs)
Reviewers Assigned
Squashed all commits, uses message "Merge pull request #XYZ: [title]" (coding standards)

codecov · 2019-04-29T19:36:52Z

Codecov Report

Merging #4233 into master will decrease coverage by 1%.
The diff coverage is 62.91%.

@@            Coverage Diff             @@
##           master    #4233      +/-   ##
==========================================
- Coverage   54.61%   53.61%   -1.01%     
==========================================
  Files         299      298       -1     
  Lines       18177    17897     -280     
==========================================
- Hits         9928     9596     -332     
- Misses       7464     7550      +86     
+ Partials      785      751      -34

aaronc · 2019-05-06T16:28:59Z

@rigelrozanski I was just thinking I'd mention after our chat last week that the approach here is really agnostic to whether it's governance or validator signaling that makes the upgrade happen. I think that really comes down to a larger governance discussion. All this module does is coordinate chain halts and restarts based on whoever calls ScheduleUpgrade. So ScheduleUpgrade could be called by the gov module, it could be called by a "validator readiness" module once an 80% threshold is reached, or maybe even some combination of the two.

ethanfrey

Nice idea and good start on providing migration/upgrade tooling.
I would love more comments and thoughts from the cosmos-sdk team, as kill the blockchain, dump start, and start a new chain doesn't seem like a viable long-term solution.

I wonder what happens to all the exchanges next time this happens....

ethanfrey · 2019-05-10T16:41:04Z

x/upgrade/doc.go

+The app must then integrate the upgrade keeper with its governance module as appropriate. The governance module
+should call ScheduleUpgrade to schedule an upgrade and ClearUpgradePlan to cancel a pending upgrade.
+
+Performing Upgrades


This is an interesting idea. And wonderful documentation. Both these package-level comments, as well as all the comments on types.

For halting, this works well. However, I think we need a more complete strategy for upgrades. I will expand on this o the issue. But yeah, this seems a nice first step.

Does this file need to get updated with the changes that have been made?

we need to eventually transition or duplicate part of this godoc to a new docs/ guideline for how to upgrade live chain. cc: @gamarin2 @hschoenburg

ethanfrey · 2019-05-10T16:43:21Z

x/upgrade/keeper.go

+
+	upgradeTime := keeper.plan.Time
+	upgradeHeight := keeper.plan.Height
+	if (!upgradeTime.IsZero() && !blockTime.Before(upgradeTime)) || upgradeHeight <= blockHeight {


Nice logic. I think the feature switch needs to be more integrated in the rest of the framework.

ethanfrey · 2019-05-10T17:02:08Z

I think this is a great beginning to the upgrade plans for cosmos-sdk.

What this does provide is a way to halt the chain (in the future at a predefined time/height schedule - smart). And a way to restart the chain when the new software is deployed. However, if I understand properly, it still requires - chain halts, validators deploy new software, chain restarts (as soon as >2/3 are live on the new code).

Also, if I want to replay the chain, in order to reproduce it, I will have to run v1 until it halts, then replace with v2, then start up again.

TL;DR: looks great. key addition is some switch so when syncing a chain later v2 of the software can dete

I think a nice addition here would be something like "feature switches". (which may be possible with the current infrastructure, using the onUpgrade callback, but I am proposing more tooling here. For example, go-ethereum has a lookup of block heights when certain features are enabled. These are then checked at points in the code to eg. enable replay protection or not

Better than storing the config files, one could store this information on-chain. This means v2 will query for "cip 21 enabled" for example and decide whether to run in v1 or v2 mode. Then, when replaying a chain, v2 will run in v1 mode until the halt point. Since the handler is already registered, it will immediately execute onUpgrade (setting the feature switch to on) and continue the rest of the blocks using the new v2 functionality.

Then we also need a way to see if users are sending v1 or v2 messages, whether models are v1 or v2. We recently added migrations support in weave and one side-effect was adding a Metadata/Schema attribute to top-level Message and Model structs (which are the ones that we first get on de-serialization). We then check this against which are enabled, possibly do on-the-fly upgrades, or simply reject false versions.

This code has not been tried out on a live chain yet, and likely has many points for improvement. Glad to get feedback and cross-pollination on such techniques.

rigelrozanski · 2019-05-10T17:44:14Z

brain dump from concepts discussed during an sdk core-dev call on chain upgrades:

folks seems to be agreeable to the idea of direct state updates as we've previously discussed
this stands good with some of the ideas proposed in this PR for init-chain upgrades
an important point raised by the team is that we need the capability to halt both tendermint and the abci-application before restarting - each version of the application will often have a new version of both for a while.
- due to this ^ we discussed that it would be positive to introduce a new middleware responsible for determining which versions of both the application and tendermint should be running and from between which blocks
- version and block range information could be loaded from a config, for dynamic upgrades the sdk could write to this config directly
  - this may have negative security implications, we ought to consider alternatives
- this middleware would exist between the tendermint-abci and the application
- this middleware would effectively require the power to shutdown and restart tendermint and application instances, potentially through running these in containers.

CC @sunnya97 @jackzampolin @jaekwon @alexanderbez

aaronc · 2019-05-15T18:05:45Z

@ethanfrey

Also, if I want to replay the chain, in order to reproduce it, I will have to run v1 until it halts, then replace with v2, then start up again.

I have been working on having a smooth solution to this using NixOS which may be a solution for nodes that are willing to run that. I could see something similar being done in a docker-based deployment scenario or even with some sort of binary manager that downloaded the correct binaries at upgrade points. Validators, however, might have more complicated setups that wouldn't be covered by these approaches. What I what suggest as a general principle is at a minimum let's create an easy replay recipe for casual users that want to run a full node. The replay path for validators may or may not be as simple.

I think a nice addition here would be something like "feature switches". (which may be possible with the current infrastructure, using the onUpgrade callback, but I am proposing more tooling here. For example, go-ethereum has a lookup of block heights when certain features are enabled. These are then checked at points in the code to eg. enable replay protection or not

I think the biggest challenge with doing a feature switch type approach is that it places quite a bit of burden on engineers to correctly code the feature switches. I actually think it would be good if projects were coded with that sort of discipline, but it might not be too realistic near term. Using some sort of binary switching (like I'm doing with NixOS) would make things a bit easier so that the exact same binary would be replayed at each phase of the upgrade.

aaronc · 2019-05-15T18:08:51Z

@rigelrozanski

an important point raised by the team is that we need the capability to halt both tendermint and the abci-application before restarting - each version of the application will often have a new version of both for a while.

Why does panic'ing in an ABCI handler like BeginBlock not halt Tendermint as well? Won't Tendermint's state machine fail to advance without some sort of non-error response from the ABCI app?

rigelrozanski · 2019-05-16T15:32:32Z

You're correct I believe it should. What are you thinking? I'll also mention a point I forgot earlier: We need the capability for the validator set to hard fork the hub without governance approval under a broken-governance or last-resort scenario. Kinda like the ability to have manual controls for the blockchain need be

ethanfrey · 2019-05-17T15:28:46Z

I think the biggest challenge with doing a feature switch type approach is that it places quite a bit of burden on engineers to correctly code the feature switches. I actually think it would be good if projects were coded with that sort of discipline, but it might not be too realistic near term. Using some sort of binary switching (like I'm doing with NixOS) would make things a bit easier so that the exact same binary would be replayed at each phase of the upgrade.

I agree and the overhead of testing multiple code paths and transitions is rather high.

I love the idea of triggering a os-level (NixOS / docker / etc) switch of the binary at some point. Like we register binaries with tags chain v1.0, chain v1.1 (maybe even self-built), and then the app (or better some higher-level supervisor) can swap them out on some trigger from the chain.

Great idea

aaronc · 2019-05-20T19:21:36Z

@rigelrozanski

You're correct I believe it should. What are you thinking?

Well just that panic'ing in the ABCI app may be all that's needed to stop both the app and Tendermint. There might not be any special changes required at the Tendermint level

I'll also mention a point I forgot earlier: We need the capability for the validator set to hard fork the hub without governance approval under a broken-governance or last-resort scenario. Kinda like the ability to have manual controls for the blockchain need be

Does Tendermint have some sort of "backdoor" that allows one to set the expected validator set outside of the ABCI process? Would that be maybe the main functionality needed to support a hard-fork?

Another similar scenario that's occurred to me is what if some indeterminism causes a consensus failure when there is no bad behavior, just bad code. I think a similar hard-fork like fix would be needed, but in this case you might need to delete the last block because it causes consensus failure on the ABCI side. I think for this just the ability to import blocks only up to a certain height would support this. Although, maybe it's not needed because the failure is only on the ABCI app side and the consensus failure can probably be fixed with an app upgrade without having to delete the failing Tendermint block.

ethanfrey · 2019-05-20T20:50:58Z

I second @aaronc here

Seems like it is doable. And if this upgrade path only covers 95% of the case. Not DAO hack revert state and fend off Etc fork craziness. For such extreme cases, some custom upgrade coordination would be needed. But that should be the exception not the rule.

How are you going to adjust the inflation rate calculation logic gracefully? State dump and chain restart? I think this proposal would work there and provide a much nicer experience

aaronc · 2019-05-20T20:58:00Z

Seems like it is doable. And if this upgrade path only covers 95% of the case. Not DAO hack revert state and fend off Etc fork craziness. For such extreme cases, some custom upgrade coordination would be needed. But that should be the exception not the rule.

Yes, this PR is for the happy path. I agree there should be some mechanism to support the unhappy path but let's make that a separate issue.

How are you going to adjust the inflation rate calculation logic gracefully? State dump and chain restart? I think this proposal would work there and provide a much nicer experience

@ethanfrey I'm not really familiar with how this happens. My hope has been to avoid state dumps because that approach causes transaction history to be lost (not viable for our use case).

ethanfrey · 2019-05-23T00:02:55Z

@ethanfrey I'm not really familiar with how this happens. My hope has been to avoid state dumps because that approach causes transaction history to be lost (not viable for our use case).

Sorry for my unclear comment. I was asking how any changes to the cosmos hub can be made in the current state? I think this proposal would allow a way of gracefully upgrading binaries at the proper locations and thus not requiring a state dump/chain restart, as was done on the last hub upgrade. So far that seems the only existing path, and I encourage the core dev team to support useful tools that cover 90+% of upgrades, rather than challenge due to some possible edge cases they would not work (and which would have to revert to current extreme upgrade path).

Basically, I am asking.... @rigelrozanski why is this proposal frozen for many weeks without any real feedback? (except a braindump saying you are generally cool with this line of thought) If there is a serious design or code error here, please point it out. If not, it would be great to have a path forward on this.

(I have also been the victim of my PRs hanging months with little to no feedback, and I think this doesn't encourage open source contributions outside of the core team. If you (cosmos/icf/all in bits) wants open source contributions from the community, it would be good to give a bit more feedback, direction, and support to such initiatives as this. I think some healthy feedback here could help evolve a very nice solution with input from all parties).

aaronc · 2019-05-28T19:16:06Z

After discussing a bit with @zmanian , one thing that is clear to me now that wasn't before is that it will be a while before the Tendermint block structure is stable. So while it may be possible to restructure state smoothly without creating a new chain, rewriting Tendermint blocks is impossible because signatures will be invalid.

So, this upgrade approach could still be useful for cases where an upgrade is doesn't involve any breaking changes on the Tendermint side.

We are planning to test this with a public https://github.com/regen-network/regen-ledger testnet, hopefully as soon as next week.

An alternate idea proposed by @AFDudley for maintaining the continuity of transactions even when a new chain needs to start from height 0 is including some reference to the block hash and chain-id of the previous chain in the genesis file of the new chain. Then some sort of transaction indexer could build up a continuous transaction history.

But again, it doesn't sound like this issue negates the usefulness of this "happy path" upgrade support in cases where it will work.

aaronc · 2019-06-18T19:44:44Z

We discussed the plans for doing a test of this upgrade module with Regen Network's testnet in our community meeting today. The planned timing is as follows:

have code ready for our community meeting next Tuesday and submit a governance proposal to do the upgrade with a pre-defined commit hash and upgrade time (not block height, but actual consensus time)
have 4 days to vote on the gov proposal
do the upgrade at the pre-determined time, possibly Mon, July 1st

We also discussed governance deciding on a predetermined time vs upgrade signalling as proposed here: #1079 (comment). It was brought up that the downside of the signalling approach is that it sort of forces validators to race to get the upgrade and could produce anxiety because you can't predict how quickly others will upgrade and if you are in the last third you could get slashed for being slow. So a pre-determined time or block-height decided in the governance process seemed preferable to those present because it gives a sense of predictability and allows for planning.

In Berlin, I chatted briefly with @ebuchman about Tendermint stability. While there are some important changes coming, it seems like there is the possibility and willingness to do this in such a way that "happy path" upgrades could still be possible. We could make that process easier to manage by using Prototool breaking change checker on the Tendermint block .proto definitions.

x/upgrade/doc.go

aaronc · 2019-07-16T15:45:20Z

Please note that this PR will soon depend on #4724 in order to perform store migrations that can't be done within the ABCI methods (because the store won't even load without these migrations). #4724 handles cases when KVStores are deleted or renamed as has happened with v0.36.0.

Also as a follow-up to our discussion last week @sunnya97, I want to point out that having a managing process that downloads new binaries would work well on top of this upgrade module approach. At a very basic level the managing process could watch stdout of the gaiad/xrnd daemon and look for UPGRADE REQUIRED messages which indicate upgrade points when a new binary is needed.

…de-module

x/upgrade/internal/types/keys.go

x/upgrade/internal/keeper/keeper.go

…de-module

golangcibot · 2019-11-05T11:18:50Z

x/upgrade/internal/types/keys.go

+
+	// QuerierKey is used to handle abci_query requests
+	QuerierKey = ModuleName
+


File is not gofmt-ed with -s (from gofmt)

golangcibot · 2019-11-05T11:18:51Z

x/upgrade/internal/keeper/keeper.go

+	upgradeHandlers map[string]types.UpgradeHandler
+}
+
+


File is not gofmt-ed with -s (from gofmt)

Suggested change

x/upgrade/client/cli/query.go

ethanfrey · 2019-11-05T12:16:52Z

@bez all issues should be addressed now. along with a few more detected while integrating with gaia.

Also, please check out cosmos/gaia#184 and run through the demo upgrade procedure (you will want a machine where ~/.gaiad and ~/.gaiacli don't have anything valuable). It should give you much more confidence this works well in production

alexanderbez

ACK

This comment has been minimized.

Sign in to view

ethanfrey reviewed May 10, 2019

View reviewed changes

aaronc force-pushed the regen-network/upgrade-module branch 2 times, most recently from 9a2b150 to 0889aa6 Compare June 3, 2019 17:18

alexanderbez mentioned this pull request Jun 5, 2019

Binary drop-in upgrade #4481

Closed

4 tasks

aaronc mentioned this pull request Jul 2, 2019

Genesis port script for 0.36 #4409

Closed

6 tasks

alexanderbez mentioned this pull request Jul 5, 2019

Node watcher app #4689

Closed

4 tasks

golangcibot reviewed Jul 8, 2019

View reviewed changes

x/upgrade/doc.go Outdated Show resolved Hide resolved

x/upgrade/doc.go Outdated Show resolved Hide resolved

ethanfrey force-pushed the regen-network/upgrade-module branch 2 times, most recently from 4b99328 to aba301f Compare July 8, 2019 17:52

aaronc mentioned this pull request Jul 9, 2019

Rebase Cosmos SDK fork against 0.36 with merged upgrade PRs and wire up in app.go regen-network/regen-ledger#62

Closed

7 tasks

ethanfrey force-pushed the regen-network/upgrade-module branch from 4d95253 to 9009405 Compare July 15, 2019 22:22

ethanfrey mentioned this pull request Jul 15, 2019

Allow substore migrations upon multistore loading #4724

Merged

5 tasks

fedekunze added the WIP label Jul 19, 2019

Sahith Reddy Narahari and others added 8 commits October 31, 2019 12:43

added proposal handler

f039c59

renamed beginblock

85ad699

changed height flag name

86184fa

added tstore key

c1233ff

removed pointers

ca1d0fb

modified keeper

05adda8

Merge remote-tracking branch 'origin/master' into regen-network/upgra…

647306a

…de-module

Update test

9a016d3

golangcibot reviewed Nov 1, 2019

View reviewed changes

x/upgrade/internal/types/keys.go Outdated Show resolved Hide resolved

x/upgrade/internal/keeper/keeper.go Outdated Show resolved Hide resolved

ethanfrey added 2 commits November 5, 2019 11:49

Merge remote-tracking branch 'origin/master' into regen-network/upgra…

5dbf5d9

…de-module

Addressed PR comments

7c0bba7

golangcibot reviewed Nov 5, 2019

View reviewed changes

ethanfrey and others added 5 commits November 5, 2019 12:55

Return header for query applied-height

d8a9c42

Fix query plan json/binary parsing

9c68111

Fixed linter issues

e874d64

Use prefix store

afc0e76

Renamed subroute in rest

7330460

golangcibot reviewed Nov 5, 2019

View reviewed changes

x/upgrade/client/cli/query.go Outdated Show resolved Hide resolved

Better error message

e0fe9be

ethanfrey mentioned this pull request Nov 5, 2019

Gaia upgrade demo cosmos/gaia#184

Merged

5 tasks

fedekunze and others added 2 commits November 5, 2019 18:26

Merge branch 'master' into regen-network/upgrade-module

b11289a

Merge branch 'master' into regen-network/upgrade-module

10af143

alexanderbez mentioned this pull request Nov 8, 2019

Upgrade Module Meta-Issue #5292

Closed

13 tasks

alexanderbez approved these changes Nov 8, 2019

View reviewed changes

alexanderbez merged commit d81d461 into cosmos:master Nov 8, 2019

fedekunze mentioned this pull request Nov 13, 2019

Live chain upgrades #1079

Closed

alexanderbez added the roadmap label Nov 30, 2019

ryanchristo deleted the regen-network/upgrade-module branch December 12, 2022 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add upgrade module #4233

Add upgrade module #4233

aaronc commented Apr 29, 2019

This comment has been minimized.

This comment has been minimized.

codecov bot commented Apr 29, 2019 •

edited

Loading

aaronc commented May 6, 2019

ethanfrey left a comment

ethanfrey May 10, 2019

jackzampolin Oct 8, 2019

fedekunze Oct 9, 2019

ethanfrey May 10, 2019

ethanfrey commented May 10, 2019

rigelrozanski commented May 10, 2019 •

edited

Loading

aaronc commented May 15, 2019 •

edited

Loading

aaronc commented May 15, 2019

rigelrozanski commented May 16, 2019

ethanfrey commented May 17, 2019

aaronc commented May 20, 2019

ethanfrey commented May 20, 2019

aaronc commented May 20, 2019 •

edited

Loading

ethanfrey commented May 23, 2019

aaronc commented May 28, 2019

aaronc commented Jun 18, 2019 •

edited

Loading

aaronc commented Jul 16, 2019

golangcibot Nov 5, 2019

golangcibot Nov 5, 2019

ethanfrey commented Nov 5, 2019

alexanderbez left a comment


		// QuerierKey is used to handle abci_query requests
		QuerierKey = ModuleName

Add upgrade module #4233

Add upgrade module #4233

Conversation

aaronc commented Apr 29, 2019

This comment has been minimized.

This comment has been minimized.

codecov bot commented Apr 29, 2019 • edited Loading

Codecov Report

aaronc commented May 6, 2019

ethanfrey left a comment

Choose a reason for hiding this comment

ethanfrey May 10, 2019

Choose a reason for hiding this comment

jackzampolin Oct 8, 2019

Choose a reason for hiding this comment

fedekunze Oct 9, 2019

Choose a reason for hiding this comment

ethanfrey May 10, 2019

Choose a reason for hiding this comment

ethanfrey commented May 10, 2019

rigelrozanski commented May 10, 2019 • edited Loading

aaronc commented May 15, 2019 • edited Loading

aaronc commented May 15, 2019

rigelrozanski commented May 16, 2019

ethanfrey commented May 17, 2019

aaronc commented May 20, 2019

ethanfrey commented May 20, 2019

aaronc commented May 20, 2019 • edited Loading

ethanfrey commented May 23, 2019

aaronc commented May 28, 2019

aaronc commented Jun 18, 2019 • edited Loading

aaronc commented Jul 16, 2019

golangcibot Nov 5, 2019

Choose a reason for hiding this comment

golangcibot Nov 5, 2019

Choose a reason for hiding this comment

ethanfrey commented Nov 5, 2019

alexanderbez left a comment

Choose a reason for hiding this comment

codecov bot commented Apr 29, 2019 •

edited

Loading

rigelrozanski commented May 10, 2019 •

edited

Loading

aaronc commented May 15, 2019 •

edited

Loading

aaronc commented May 20, 2019 •

edited

Loading

aaronc commented Jun 18, 2019 •

edited

Loading