Add upgrade module #3979

aaronc · 2019-03-26T18:32:06Z

Upgrading live chains has been previously discussed in #1079 and there is a WIP spec in #2116. Neither of these provide an actual implementation of how to coordinate a live chain upgrade on the software level. My understanding and experience with Tendermint chains is that without a software coordination mechanism, validators can easily get into inconsistent state because they all need to be stopped at precisely the same point in the state machine cycle.

This PR provides a module for performing live chain upgrades that has been developed for Regen Ledger and tested against our testnets. It may or may not be what Cosmos SDK wants, but just sharing it in case it is...

This module attempts to take a minimalist approach to coordinating a live chain upgrade and can be integrated with any governance mechanism. Here are a few of its features:

allows upgrades to be scheduled at a future block height or after a future block time
crashes the blockchain state machine in BeginBlock when an upgrade is required and doesn't allow it to restart until new software with the expected upgrade is started
provides a hook for performing state migrations once upgraded software is started
allows for custom "crash" behavior that could be used to trigger automatic installation of the new software

This PR doesn't currently include any integration with the Cosmos gov module, but that could be easily done if this upgrade method works for Cosmos hub.

Linked to github-issue with discussion and accepted design OR link to spec that describes this work (linked to issues, the specification is described in the docs which are live here: https://godoc.org/github.com/regen-network/regen-ledger/x/upgrade).
Wrote tests
Updated relevant documentation (docs/) - includes through go package docs
Added a relevant changelog entry: sdkch add [section] [stanza] [message]
rereviewed Files changed in the github PR explorer

For Admin Use:

Added appropriate labels to PR (ex. wip, ready-for-review, docs)
Reviewers Assigned
Squashed all commits, uses message "Merge pull request #XYZ: [title]" (coding standards)

codecov · 2019-03-26T19:00:39Z

Codecov Report

❗ No coverage uploaded for pull request base (develop@055d219). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff             @@
##             develop    #3979   +/-   ##
==========================================
  Coverage           ?   60.03%           
==========================================
  Files              ?      215           
  Lines              ?    15248           
  Branches           ?        0           
==========================================
  Hits               ?     9154           
  Misses             ?     5450           
  Partials           ?      644

jackzampolin · 2019-03-26T20:25:31Z

@aaronc Please take a look at this implementation of another type of gov proposal. Might make sense to rebase this work there: #3880

aaronc · 2019-03-26T21:01:43Z

@jackzampolin so this doesn't really integrate with the gov module at all currently. I was just showing how it could be integrated if so desired.

jackzampolin · 2019-03-27T14:29:02Z

x/upgrade/client/cli/query.go

+// GetQueryCmd creates a query sub-command for the upgrade module using cmdName as the name of the sub-command.
+func GetQueryCmd(cmdName string, storeName string, cdc *codec.Codec) *cobra.Command {
+ return &cobra.Command{
+ Use: cmdName,


maybe we call this list or show? the command would be gaiacli q upgrade show

So in Regen Ledger it's just xrncli query upgrade-plan with no separate sub-command, but happy to add one if that makes sense. show would work

Well here on the SDK we use the module client interface to export and automatically register CLI functionality. See gov for an example https://github.com/cosmos/cosmos-sdk/blob/develop/x/gov/client/module_client.go#L17

I was going to do that initially but I'm not exporting any tx command for this module because it should be handled by gov. I guess clients could filter nil values here: https://github.com/cosmos/cosmos-sdk/blob/develop/cmd/gaia/cmd/gaiacli/main.go#L150

aaronc · 2019-03-27T16:48:15Z

x/upgrade/keeper.go

+ blockTime := ctx.BlockHeader().Time
+ blockHeight := ctx.BlockHeight()
+
+ plan, havePlan := keeper.GetUpgradePlan(ctx)


I was originally caching this result on the Keeper to avoid having to decode from the store each block, but not sure that's best practice. Is there a standard way to cache frequently read values from the store in memory?

@jackzampolin ?

I think there are a couple of modules that maintain their own caches, but there is no standard way this is done currently. Its a TODO

rigelrozanski · 2019-03-29T19:17:46Z

interesting ideas, I'll dive into this some more - p.s. test_cover is failing

aaronc · 2019-03-29T19:25:39Z

p.s. test_cover is failing

noticed that @rigelrozanski. appears to be some floating point issue unrelated to any code this PR touches:

REQUEST GET http://localhost:37119/staking/delegators/cosmos1pmt8hl36aeup925dktv4dnmjyku02vl8d9v7cl/delegations
E[2019-03-29|02:19:56.347] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-03-29|02:19:56.378] Stopped accept routine, as transport is closed module=p2p numPeers=0
--- FAIL: TestBonding (4.78s)
    require.go:157: 
        	Error Trace:	lcd_test.go:537
        	Error:      	Not equal: 
        	            	expected: 30000000.000000000000000000
        	            	actual  : 29990322.580645161290322581
        	            	
        	            	Diff:
        	Test:       	TestBonding
E[2019-03-29|02:19:57.467] Couldn't connect to any seeds                module=p2p 
LADDR tcp://0.0.0.0:44005

rigelrozanski · 2019-04-03T00:23:20Z

@aaronc that bug is due to other non-determinism within the lcd test (floats are not used). That bug has since been fixed on develop

…100%

Because a cache is being kept on the keeper itself currently, it must get passed around as a pointer so that the cache remains consistent.

aaronc · 2019-04-04T16:25:36Z

Okay, rebased this against develop and now the tests pass.

I am trying to do a few tests of this version of the module on a testnet before moving this out of draft state by the way. I had successfully tested a previous version of this module on an older testnet, but I've been running into some issues with my current testnet that are slowing me down...

rigelrozanski · 2019-04-05T02:35:20Z

cool - yeah I think this feature requires a further conversations (over a call) with @jaekwon on the line. There are many design considerations for this type of a feature to be on mainnet and we obviously don't want to mess this up.

… keeper Upon testing, halting the chain by methods other than a panic (such as os.Exit) actually causes issues between nodes because the other nodes instead of processing their own doShutdowners to trigger upgrades actually just hang trying to connect after enough nodes exit. The onUpgrader allows nodes to trigger some process before the panic, and willUpgrader allows nodes to prepare for upgrades before they actually need to be applied. Still needs more testing

…nup CLI

x/upgrade/client/cli/query.go

aaronc · 2019-04-18T14:07:07Z

In the last community dev call @jackzampolin requested that I do a little write-up on how Regen Ledger is using this upgrade module including the devops side of things.

So first of all, our chain is using SetWillUpgrader and SetOnUpgrader to run scripts in the config/ dir if they have been setup on the node. So when an upgrade is scheduled, if there is a file config/prepare-upgrade, the node will run that in a separate go routine. When an upgrade is required, if there is a file config/do-upgrade, that will get run in the background. So this lets the node operator use the config dir to define the exact upgrade behavior.

Now, in addition to that we are defining a default upgrade process for nodes using NixOS. For those of you who aren't familiar NixOS is a "purely functional Linux distribution". What this means is that given the same set of configuration files, the same exact system configuration/Linux build should be generated. This allows for an easy deterministic upgrade process and also easy rollbacks. All parts of the system configuration are identified by hashes. So if you have a git repo that points to your configuration files, you can pass NixOS a specific git commit and have it predictably build that configuration. So basically our prepare-upgrade and do-upgrade scripts take commit hashes for the regen-ledgerrepo (that have been stored in Info field of upgrade.Plan) and build the configuration pointed to at that hash. We also intend to pass hashes that point to the specific version of NixOS that nodes should upgrade to so that everything is very predictable.

One thing that this takes care of is the issues new nodes will run into that start to sync with the existing network. If the app's state machine changes substantially in an upgrade (which is to be expected), somebody that wants to spin up a new node and replay state starting from zero, can't just do that with the latest binary. That latest binary will do things that didn't happen with the initial binary and produce a different app state. So you need some sort of "meta-process" that applies upgrades on the new node just like they were applied in a node that had been running since genesis and was upgraded sequentially. I'm sure there are other ways of doing this, but this NixOS approach should theoretically be able to handle this without problems. Basically if you want to create a new node, you use the "genesis" NixOS config files, start your node and then this upgrade process will automatically apply all of the system config changes since genesis.

rigelrozanski · 2019-04-18T19:31:26Z

Conclusions based on a phone conversation with @aaronc.

This final product of this intended design would have:

A new software upgrade proposal type which would specify the block height for the chain to halt at and provide information as to the intended upgrade binary hash
- this is in conflict to the upgrade signalling mechanism described in Live chain upgrades #1079
The creation of a module ("chain-upgrader") which which would be responsible for upgrading legacy types within the genesis file and create a new genesis file compatible with the upgraded software version
- this would executed during init-chain in the new binary
The option for a script to be executed at the chain-halt (unique to each validator) to aid a smooth upgrade process.

CC @alexanderbez @jaekwon

aaronc · 2019-04-18T21:18:32Z

The creation of a module ("chain-upgrader") which which would be responsible for upgrading legacy types within the genesis file and create a new genesis file compatible with the upgraded software version

this would executed during init-chain in the new binary

@rigelrozanski So the way this currently works in this PR is a bit different. There's no need to export a genesis file and create a new genesis file. Am I missing something for why this is needed or are we just imagining different procedures?

Maybe it would be helpful if I write out the sequence of operations I've been envisioning and which is currently implemented:

the dev community works on a software upgrade, let's call it gaia-3 for example, and in their updated software they add a function which performs any necessary state migrations and register it with the upgrade module like this:

	app.upgradeKeeper.SetUpgradeHandler("gaia-3", func(ctx sdk.Context, plan upgrade.Plan) {
		// perform state migrations
	})

once the new version is ready, somebody submits an upgrade proposal:

gaiacli tx gov submit-upgrade-proposal --name gaia-3 --height 1234567 --info '{"commit":"abcdef12345678"}' --from abc

if the community approves the proposal, validators prepare the upgrade on their nodes. If they have an automated script that can handle this upgrade, they can put it in config/do-upgrade
when height 1234567 is reached, the current binary will halt and stop functioning
validators either manually perform the upgrade or if they've setup a script in config/do-upgrade, that gets called with the env vars UPGRADE_NAME=gaia-3 and UPGRADE_INFO='{"commit":"abcdef12345678"}'
once the new binary is started that contains the gaia-3 handler, this handler gets called right away in BeginBlock and any state migrations that are needed get applied

alexanderbez · 2019-04-20T14:31:19Z

Interesting. I wonder how this fits into the model of param change proposals. The gist from that is, governance now has an internal router for proposals. Each proposal has a Content which gets executed by any module's handler. I imagine the upgrader module would implement such a Handler, but I'm still not quite getting what this handler would do.

rigelrozanski · 2019-04-21T01:07:39Z

@aaronc the current thinking behind upgrades is to have all types held in state upgraded by dumping state to genesis json, making changes, then loading from the next height... however, if I understand you correctly rather than dumping json, the legacy upgrade handler which would need to be executed at initialization of the second binary could simply load the old state then overwrite all state elements which require a type change, aka do the conversion without a full dump - I like this... is this what's being proposed?

aaronc · 2019-04-22T13:24:13Z

I imagine the upgrader module would implement such a Handler, but I'm still not quite getting what this handler would do.

@alexanderbez so if governance approved an upgrade, the Handler would call upgradeKeeper.ScheduleUpgrade. You would also probably want a cancel upgrade proposal.

aaronc · 2019-04-22T13:27:45Z

however, if I understand you correctly rather than dumping json, the legacy upgrade handler which would need to be executed at initialization of the second binary could simply load the old state then overwrite all state elements which require a type change, aka do the conversion without a full dump - I like this... is this what's being proposed?

@rigelrozanski Exactly. You could of course use this halt behavior to dump state too. But if you have the ability to coordinate halts and also to coordinate a "migration callback", then I don't see any reason why you can't just continue with the same state and transaction history from data/ with the new binary. It's been working for me so far!

jackzampolin reviewed Mar 27, 2019

View reviewed changes

aaronc commented Mar 27, 2019

View reviewed changes

aaronc force-pushed the regen-network/upgrade-module branch from 6a8efff to 7fc01b1 Compare March 28, 2019 14:28

aaronc added a commit to regen-network/regen-ledger that referenced this pull request Mar 28, 2019

#15 use upgrade module from Cosmos SDK PR cosmos/cosmos-sdk#3979

fa47015

aaronc added 4 commits April 4, 2019 11:57

Add upgrade module

a9ae75f

Added pending changelog entry

89ef052

Add cache behavior to upgrade Keeper, bring package test coverage to …

29c5962

…100%

Convert Keeper to an interface and return underlying struct as pointer

cbcec49

Because a cache is being kept on the keeper itself currently, it must get passed around as a pointer so that the cache remains consistent.

aaronc force-pushed the regen-network/upgrade-module branch from e7d1e85 to cbcec49 Compare April 4, 2019 15:58

aaronc added 3 commits April 16, 2019 20:33

Fix upgrade keeper tests

b429635

Add the ability to retrieve the height of an applied upgrade and clea…

dce1806

…nup CLI

golangcibot reviewed Apr 17, 2019

View reviewed changes

x/upgrade/client/cli/query.go Outdated Show resolved Hide resolved

aaronc added 3 commits April 17, 2019 17:14

Fix lint error

d368a70

Rename upgrade query height -> query applied-height

13fdf0c

Update upgrade module docs

38b28f8

rigelrozanski mentioned this pull request Apr 18, 2019

Live chain upgrades #1079

Closed

ebuchman closed this Apr 26, 2019

aaronc mentioned this pull request Apr 29, 2019

Add upgrade module #4233

Merged

5 tasks

Beardev118 pushed a commit to RegenNetwork/regen-ledger that referenced this pull request Oct 3, 2023

#15 use upgrade module from Cosmos SDK PR cosmos/cosmos-sdk#3979

f498155

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add upgrade module #3979

Add upgrade module #3979

aaronc commented Mar 26, 2019 •

edited

Loading

codecov bot commented Mar 26, 2019 •

edited

Loading

jackzampolin commented Mar 26, 2019 •

edited

Loading

aaronc commented Mar 26, 2019 •

edited

Loading

jackzampolin Mar 27, 2019

aaronc Mar 27, 2019 •

edited

Loading

jackzampolin Mar 27, 2019

aaronc Mar 27, 2019

aaronc Mar 27, 2019

aaronc Mar 27, 2019

jackzampolin Mar 27, 2019

rigelrozanski commented Mar 29, 2019

aaronc commented Mar 29, 2019

rigelrozanski commented Apr 3, 2019

aaronc commented Apr 4, 2019

rigelrozanski commented Apr 5, 2019

aaronc commented Apr 18, 2019 •

edited

Loading

rigelrozanski commented Apr 18, 2019

aaronc commented Apr 18, 2019 •

edited

Loading

alexanderbez commented Apr 20, 2019 •

edited

Loading

rigelrozanski commented Apr 21, 2019 •

edited

Loading

aaronc commented Apr 22, 2019 •

edited

Loading

aaronc commented Apr 22, 2019 •

edited

Loading

Add upgrade module #3979

Add upgrade module #3979

Conversation

aaronc commented Mar 26, 2019 • edited Loading

codecov bot commented Mar 26, 2019 • edited Loading

Codecov Report

jackzampolin commented Mar 26, 2019 • edited Loading

aaronc commented Mar 26, 2019 • edited Loading

jackzampolin Mar 27, 2019

Choose a reason for hiding this comment

aaronc Mar 27, 2019 • edited Loading

Choose a reason for hiding this comment

jackzampolin Mar 27, 2019

Choose a reason for hiding this comment

aaronc Mar 27, 2019

Choose a reason for hiding this comment

aaronc Mar 27, 2019

Choose a reason for hiding this comment

aaronc Mar 27, 2019

Choose a reason for hiding this comment

jackzampolin Mar 27, 2019

Choose a reason for hiding this comment

rigelrozanski commented Mar 29, 2019

aaronc commented Mar 29, 2019

rigelrozanski commented Apr 3, 2019

aaronc commented Apr 4, 2019

rigelrozanski commented Apr 5, 2019

aaronc commented Apr 18, 2019 • edited Loading

rigelrozanski commented Apr 18, 2019

aaronc commented Apr 18, 2019 • edited Loading

alexanderbez commented Apr 20, 2019 • edited Loading

rigelrozanski commented Apr 21, 2019 • edited Loading

aaronc commented Apr 22, 2019 • edited Loading

aaronc commented Apr 22, 2019 • edited Loading

aaronc commented Mar 26, 2019 •

edited

Loading

codecov bot commented Mar 26, 2019 •

edited

Loading

jackzampolin commented Mar 26, 2019 •

edited

Loading

aaronc commented Mar 26, 2019 •

edited

Loading

aaronc Mar 27, 2019 •

edited

Loading

aaronc commented Apr 18, 2019 •

edited

Loading

aaronc commented Apr 18, 2019 •

edited

Loading

alexanderbez commented Apr 20, 2019 •

edited

Loading

rigelrozanski commented Apr 21, 2019 •

edited

Loading

aaronc commented Apr 22, 2019 •

edited

Loading

aaronc commented Apr 22, 2019 •

edited

Loading