Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upgrade module #3979

Closed

Conversation

aaronc
Copy link
Member

@aaronc aaronc commented Mar 26, 2019

Upgrading live chains has been previously discussed in #1079 and there is a WIP spec in #2116. Neither of these provide an actual implementation of how to coordinate a live chain upgrade on the software level. My understanding and experience with Tendermint chains is that without a software coordination mechanism, validators can easily get into inconsistent state because they all need to be stopped at precisely the same point in the state machine cycle.

This PR provides a module for performing live chain upgrades that has been developed for Regen Ledger and tested against our testnets. It may or may not be what Cosmos SDK wants, but just sharing it in case it is...

This module attempts to take a minimalist approach to coordinating a live chain upgrade and can be integrated with any governance mechanism. Here are a few of its features:

  • allows upgrades to be scheduled at a future block height or after a future block time
  • crashes the blockchain state machine in BeginBlock when an upgrade is required and doesn't allow it to restart until new software with the expected upgrade is started
  • provides a hook for performing state migrations once upgraded software is started
  • allows for custom "crash" behavior that could be used to trigger automatic installation of the new software

This PR doesn't currently include any integration with the Cosmos gov module, but that could be easily done if this upgrade method works for Cosmos hub.

  • Linked to github-issue with discussion and accepted design OR link to spec that describes this work (linked to issues, the specification is described in the docs which are live here: https://godoc.org/github.com/regen-network/regen-ledger/x/upgrade).
  • Wrote tests
  • Updated relevant documentation (docs/) - includes through go package docs
  • Added a relevant changelog entry: sdkch add [section] [stanza] [message]
  • rereviewed Files changed in the github PR explorer

For Admin Use:

  • Added appropriate labels to PR (ex. wip, ready-for-review, docs)
  • Reviewers Assigned
  • Squashed all commits, uses message "Merge pull request #XYZ: [title]" (coding standards)

@codecov
Copy link

codecov bot commented Mar 26, 2019

Codecov Report

❗ No coverage uploaded for pull request base (develop@055d219). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff             @@
##             develop    #3979   +/-   ##
==========================================
  Coverage           ?   60.03%           
==========================================
  Files              ?      215           
  Lines              ?    15248           
  Branches           ?        0           
==========================================
  Hits               ?     9154           
  Misses             ?     5450           
  Partials           ?      644

@jackzampolin
Copy link
Member

jackzampolin commented Mar 26, 2019

@aaronc Please take a look at this implementation of another type of gov proposal. Might make sense to rebase this work there: #3880

@aaronc
Copy link
Member Author

aaronc commented Mar 26, 2019

@jackzampolin so this doesn't really integrate with the gov module at all currently. I was just showing how it could be integrated if so desired.

// GetQueryCmd creates a query sub-command for the upgrade module using cmdName as the name of the sub-command.
func GetQueryCmd(cmdName string, storeName string, cdc *codec.Codec) *cobra.Command {
return &cobra.Command{
Use: cmdName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we call this list or show? the command would be gaiacli q upgrade show

Copy link
Member Author

@aaronc aaronc Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in Regen Ledger it's just xrncli query upgrade-plan with no separate sub-command, but happy to add one if that makes sense. show would work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well here on the SDK we use the module client interface to export and automatically register CLI functionality. See gov for an example https://github.com/cosmos/cosmos-sdk/blob/develop/x/gov/client/module_client.go#L17

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to do that initially but I'm not exporting any tx command for this module because it should be handled by gov. I guess clients could filter nil values here: https://github.com/cosmos/cosmos-sdk/blob/develop/cmd/gaia/cmd/gaiacli/main.go#L150

blockTime := ctx.BlockHeader().Time
blockHeight := ctx.BlockHeight()

plan, havePlan := keeper.GetUpgradePlan(ctx)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally caching this result on the Keeper to avoid having to decode from the store each block, but not sure that's best practice. Is there a standard way to cache frequently read values from the store in memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a couple of modules that maintain their own caches, but there is no standard way this is done currently. Its a TODO

@aaronc aaronc force-pushed the regen-network/upgrade-module branch from 6a8efff to 7fc01b1 Compare March 28, 2019 14:28
aaronc added a commit to regen-network/regen-ledger that referenced this pull request Mar 28, 2019
@rigelrozanski
Copy link
Contributor

interesting ideas, I'll dive into this some more - p.s. test_cover is failing

@aaronc
Copy link
Member Author

aaronc commented Mar 29, 2019

p.s. test_cover is failing

noticed that @rigelrozanski. appears to be some floating point issue unrelated to any code this PR touches:

REQUEST GET http://localhost:37119/staking/delegators/cosmos1pmt8hl36aeup925dktv4dnmjyku02vl8d9v7cl/delegations
E[2019-03-29|02:19:56.347] Not stopping BlockPool -- have not been started yet module=blockchain impl=BlockPool
E[2019-03-29|02:19:56.378] Stopped accept routine, as transport is closed module=p2p numPeers=0
--- FAIL: TestBonding (4.78s)
    require.go:157: 
        	Error Trace:	lcd_test.go:537
        	Error:      	Not equal: 
        	            	expected: 30000000.000000000000000000
        	            	actual  : 29990322.580645161290322581
        	            	
        	            	Diff:
        	Test:       	TestBonding
E[2019-03-29|02:19:57.467] Couldn't connect to any seeds                module=p2p 
LADDR tcp://0.0.0.0:44005

@rigelrozanski
Copy link
Contributor

@aaronc that bug is due to other non-determinism within the lcd test (floats are not used). That bug has since been fixed on develop

Because a cache is being kept on the keeper itself currently, it must
get passed around as a pointer so that the cache remains consistent.
@aaronc aaronc force-pushed the regen-network/upgrade-module branch from e7d1e85 to cbcec49 Compare April 4, 2019 15:58
@aaronc
Copy link
Member Author

aaronc commented Apr 4, 2019

Okay, rebased this against develop and now the tests pass.

I am trying to do a few tests of this version of the module on a testnet before moving this out of draft state by the way. I had successfully tested a previous version of this module on an older testnet, but I've been running into some issues with my current testnet that are slowing me down...

@rigelrozanski
Copy link
Contributor

cool - yeah I think this feature requires a further conversations (over a call) with @jaekwon on the line. There are many design considerations for this type of a feature to be on mainnet and we obviously don't want to mess this up.

… keeper

Upon testing, halting the chain by methods other than a panic (such as os.Exit) actually causes issues between nodes because the other nodes instead of processing their own doShutdowners to trigger upgrades actually just hang trying to connect after enough nodes exit. The onUpgrader allows nodes to trigger some process before the panic, and willUpgrader allows nodes to prepare for upgrades before they actually need to be applied. Still needs more testing
@aaronc
Copy link
Member Author

aaronc commented Apr 18, 2019

In the last community dev call @jackzampolin requested that I do a little write-up on how Regen Ledger is using this upgrade module including the devops side of things.

So first of all, our chain is using SetWillUpgrader and SetOnUpgrader to run scripts in the config/ dir if they have been setup on the node. So when an upgrade is scheduled, if there is a file config/prepare-upgrade, the node will run that in a separate go routine. When an upgrade is required, if there is a file config/do-upgrade, that will get run in the background. So this lets the node operator use the config dir to define the exact upgrade behavior.

Now, in addition to that we are defining a default upgrade process for nodes using NixOS. For those of you who aren't familiar NixOS is a "purely functional Linux distribution". What this means is that given the same set of configuration files, the same exact system configuration/Linux build should be generated. This allows for an easy deterministic upgrade process and also easy rollbacks. All parts of the system configuration are identified by hashes. So if you have a git repo that points to your configuration files, you can pass NixOS a specific git commit and have it predictably build that configuration. So basically our prepare-upgrade and do-upgrade scripts take commit hashes for the regen-ledgerrepo (that have been stored in Info field of upgrade.Plan) and build the configuration pointed to at that hash. We also intend to pass hashes that point to the specific version of NixOS that nodes should upgrade to so that everything is very predictable.

One thing that this takes care of is the issues new nodes will run into that start to sync with the existing network. If the app's state machine changes substantially in an upgrade (which is to be expected), somebody that wants to spin up a new node and replay state starting from zero, can't just do that with the latest binary. That latest binary will do things that didn't happen with the initial binary and produce a different app state. So you need some sort of "meta-process" that applies upgrades on the new node just like they were applied in a node that had been running since genesis and was upgraded sequentially. I'm sure there are other ways of doing this, but this NixOS approach should theoretically be able to handle this without problems. Basically if you want to create a new node, you use the "genesis" NixOS config files, start your node and then this upgrade process will automatically apply all of the system config changes since genesis.

@rigelrozanski
Copy link
Contributor

Conclusions based on a phone conversation with @aaronc.

This final product of this intended design would have:

  • A new software upgrade proposal type which would specify the block height for the chain to halt at and provide information as to the intended upgrade binary hash
  • The creation of a module ("chain-upgrader") which which would be responsible for upgrading legacy types within the genesis file and create a new genesis file compatible with the upgraded software version
    • this would executed during init-chain in the new binary
  • The option for a script to be executed at the chain-halt (unique to each validator) to aid a smooth upgrade process.

CC @alexanderbez @jaekwon

@aaronc
Copy link
Member Author

aaronc commented Apr 18, 2019

  • The creation of a module ("chain-upgrader") which which would be responsible for upgrading legacy types within the genesis file and create a new genesis file compatible with the upgraded software version

    • this would executed during init-chain in the new binary

@rigelrozanski So the way this currently works in this PR is a bit different. There's no need to export a genesis file and create a new genesis file. Am I missing something for why this is needed or are we just imagining different procedures?

Maybe it would be helpful if I write out the sequence of operations I've been envisioning and which is currently implemented:

  1. the dev community works on a software upgrade, let's call it gaia-3 for example, and in their updated software they add a function which performs any necessary state migrations and register it with the upgrade module like this:
	app.upgradeKeeper.SetUpgradeHandler("gaia-3", func(ctx sdk.Context, plan upgrade.Plan) {
		// perform state migrations
	})
  1. once the new version is ready, somebody submits an upgrade proposal:
gaiacli tx gov submit-upgrade-proposal --name gaia-3 --height 1234567 --info '{"commit":"abcdef12345678"}' --from abc
  1. if the community approves the proposal, validators prepare the upgrade on their nodes. If they have an automated script that can handle this upgrade, they can put it in config/do-upgrade
  2. when height 1234567 is reached, the current binary will halt and stop functioning
  3. validators either manually perform the upgrade or if they've setup a script in config/do-upgrade, that gets called with the env vars UPGRADE_NAME=gaia-3 and UPGRADE_INFO='{"commit":"abcdef12345678"}'
  4. once the new binary is started that contains the gaia-3 handler, this handler gets called right away in BeginBlock and any state migrations that are needed get applied

@alexanderbez
Copy link
Contributor

alexanderbez commented Apr 20, 2019

Interesting. I wonder how this fits into the model of param change proposals. The gist from that is, governance now has an internal router for proposals. Each proposal has a Content which gets executed by any module's handler. I imagine the upgrader module would implement such a Handler, but I'm still not quite getting what this handler would do.

@rigelrozanski
Copy link
Contributor

rigelrozanski commented Apr 21, 2019

@aaronc the current thinking behind upgrades is to have all types held in state upgraded by dumping state to genesis json, making changes, then loading from the next height... however, if I understand you correctly rather than dumping json, the legacy upgrade handler which would need to be executed at initialization of the second binary could simply load the old state then overwrite all state elements which require a type change, aka do the conversion without a full dump - I like this... is this what's being proposed?

@aaronc
Copy link
Member Author

aaronc commented Apr 22, 2019

I imagine the upgrader module would implement such a Handler, but I'm still not quite getting what this handler would do.

@alexanderbez so if governance approved an upgrade, the Handler would call upgradeKeeper.ScheduleUpgrade. You would also probably want a cancel upgrade proposal.

@aaronc
Copy link
Member Author

aaronc commented Apr 22, 2019

however, if I understand you correctly rather than dumping json, the legacy upgrade handler which would need to be executed at initialization of the second binary could simply load the old state then overwrite all state elements which require a type change, aka do the conversion without a full dump - I like this... is this what's being proposed?

@rigelrozanski Exactly. You could of course use this halt behavior to dump state too. But if you have the ability to coordinate halts and also to coordinate a "migration callback", then I don't see any reason why you can't just continue with the same state and transaction history from data/ with the new binary. It's been working for me so far!

@ebuchman ebuchman closed this Apr 26, 2019
@aaronc aaronc mentioned this pull request Apr 29, 2019
5 tasks
Beardev118 pushed a commit to RegenNetwork/regen-ledger that referenced this pull request Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants