Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-store upgrades that need to happen outside of BeginBlock #7

Closed
aaronc opened this issue Oct 24, 2019 · 41 comments · Fixed by cosmos/cosmos-sdk#5500
Closed
Labels
Milestone

Comments

@aaronc
Copy link
Member

aaronc commented Oct 24, 2019

Context

The upgrade handler for the upgrade module is able to handle migrations that can happen within BeginBlock. Certain root multi-store migrations need to happen outside of BeginBlock - specifically renaming or deleting store keys. The basic support for doing this was added in cosmos#4724. This functionality needs to be integrated into the upgrade module.

Acceptance Criteria

Given that multi-store StoreUpgrades are needed and an upgrade is happening
When the new binary starts
Then the StoreUpgrades will be performed at the correct upgrade height before the ABCI app starts

Proposed Design

  • In the upgrade.Keeper BeginBlock method, write a file $DAEMON_HOME/data/upgrade-needed.json file to disk with the upgrade plan serialized at panic time with the actual upgrade height written in the file in the case of time-based upgrades.
  • Change the upgrade.Keeper SetUpgradeHandler` method to:
 	SetUpgradeHandler(name string, upgradeHandler types.UpgradeHandler, storeUpgrades []storytypes.StoreUpgrades)

so that whenever store key renames/deletions are needed they can be registered with the upgrade handler

  • Add a hook to the upgrade module that configures BaseApp and the multi-store before starting the ABCI app. The multi-store upgrades should only be performed when there is an upgrade-info.json present and the version of the store matches the upgrade height in the json file. This will prevent store upgrades from happening too early.

Notes

  • the current design of BaseApp.UpgradeableStoreLoader likely doesn't do what is required as this would require that the store upgrades are written to disk outside of the binary, presumably by the upgrade module. The actual behavior would likely be that the new binary contains a handler for the desired StoreUpgrades as mentioned above. Likely some hook between BaseApp and the upgrade.Keeper is neded.
@ethanfrey
Copy link

I think the problem is the store upgrades should be written when halting the old binary, but are not known until producing the new binary. The upgrade handler is only present in the new binary and is a suboptimal place to store it.

I would propose adding a field to the SoftwareUpgradeProposal to store the story types.StoreUpgrades and then when the old app hits the halt height (and prints the log message and info to stdout), it will also write the upgrade-needed.json file to disk

@aaronc
Copy link
Member Author

aaronc commented Oct 28, 2019

Given that we have multi-week voting periods, we do not want to enforce the requirement that the binary and store upgrades are all known in advance of voting.

The new binary should have full knowledge of what is needed to perform the upgrade and the upgrade plan should require zero knowledge of the upgrade, but that info can be specified when available. Otherwise, social consensus can be used to coordinate the new binary.

I have edited the proposed design above with more clear specification. Please see the text in bold above.

@ethanfrey
Copy link

The new app cannot even mount the store until the store upgrades have been executed. You cannot even read the version of the store safely without digging into root multistory internals. The simple load returns error (panicked until my pr)

You have to do this before the new binary touches the store. Meaning it would do it blind

@aaronc
Copy link
Member Author

aaronc commented Oct 28, 2019 via email

@aaronc
Copy link
Member Author

aaronc commented Nov 6, 2019

@ethanfrey I am not clear whether or not the StoreUpgrades applied in loadVersion are applied on disk before or after calling Commit. My initial read was that the upgrades get applied transactionally in the ABCI Commit method. But I also have not dug enough into the details to know if I'm wrong or not.

So I would say there are two options:

  • if it is true that the store renames and deletions only get applied when Commit happens, then my proposed approach should work
  • if store renames and deletions happen outside of a transaction and are irreversible, then a backup of the data dir should happen before the upgrade and aborting the upgrades should restore the backup

Either way, I think it would be possible to use this signature:

 	SetUpgradeHandler(name string, upgradeHandler types.UpgradeHandler, storeUpgrades []storytypes.StoreUpgrades)

which allows the new binary to be the sole source of truth about what happens in an upgrade

In terms of implementation, I think we could simply test whether or not applying upgrades is transactional or not and if it isn't then we just need to create or document a data/ dir backup procedure which is not a bad idea in either case.

@ethanfrey
Copy link

You are correct that loadVersion does not call Commit and this should not be written to disk until the end of the block. I just read through the code.

It does have significant gas cost - we should double-check the infinite gas meter is set for these operations.

The proposed change to SetUpgradeHandler makes a odd binding between the upgrade module and baseapp, which should not (and maybe cannot) exist. A straight forward solution would be to make 2 calls in the app constructor.

The first to call upgradeKeeper.SetUpgradeHandler(...) as now.

And a second call to set up the migrations (anywhere in the same init function, after baseapp is created, but before the store is loaded):

   storeMigrations := StoreLoaderWithUpgrade(&storetypes.StoreUpgrades{
    Renamed: []storetypes.StoreRename{{
      OldKey: "foo",
      NewKey: "bar",
    }},
    Deleted: []string{"not_needed"},
   })
   app.SetStoreLoader(storeMigrations)

I understand your concern about relying on a json file on disk that is not strongly tied to the binary itself. This approach would tie it to a binary. And the upgrade BeginBlocker will panic (avoiding commiting the migration) if this binary is launched before the plan is active due to the SetUpgradeHandler call.

@ethanfrey
Copy link

One issue with this approach is it may run multiple times. I make a binary v2 that contains a new handler and a new migration.

I then launch it after the planned upgrade time and it does the proper thing. Awesome. A week later, I restart the binary and it will then attempt to re-run the storeMigrations. This can only be resolved by the binary somehow knowing if it ran them before - eg. only doing it in BeginBlock or having some other flag in the db (which leaves a chicken-and-egg problem) or by leaving info on disk - outside of the whole db (which brings up it's own problems).

The original design of yours was to dump the json file when the old binary panic'd and use that when loading, deleting after use. This is implemented and would work unless the user messes with those files manually.

I would note that the following statements of leave your intention less than clear:

Either way, I think it would be possible to use this signature:
SetUpgradeHandler(name string, upgradeHandler types.UpgradeHandler, storeUpgrades []storytypes.StoreUpgrades)
which allows the new binary to be the sole source of truth about what happens in an upgrade

In the upgrade.Keeper BeginBlock method, write a file $DAEMON_HOME/data/upgrade-needed.json file to disk with the upgrade plan serialized at panic time with the actual upgrade height written in the file in the case of time-based upgrades

The multi-store upgrades should only be performed when there is an upgrade-info.json present and the version of the store matches the upgrade height in the json file. This will prevent store upgrades from happening too early.

My understanding is that the upgrade-info.json is now just a flag to control the execution of the migrations, and the upgrades themselves are in the binary. Is that correct?

So we could try to read the file and extract a height, something like:

if height, err := readUpgradeInfo(homeDir); err == nil {
  storeMigrations := StoreLoaderWithHeightLimitedUpgrade(height, &storetypes.StoreUpgrades{ /* ... */ })
   app.SetStoreLoader(storeMigrations)
}

and then define a StoreLoaderWithHeightLimitedUpgrade loader that will check if height == ms.CommitInfo().Height, then ether behave as StoreLoaderWithUprgade or DefaultStoreLoader

Is that a correct understanding? If does still have a file on disk, but the actual migrations are now defined in the binary rather than in the upgrade plan, which is more secure and testable.

@aaronc
Copy link
Member Author

aaronc commented Nov 7, 2019

My understanding is that the upgrade-info.json is now just a flag to control the execution of the migrations, and the upgrades themselves are in the binary. Is that correct?

So we could try to read the file and extract a height, something like:

if height, err := readUpgradeInfo(homeDir); err == nil {
  storeMigrations := StoreLoaderWithHeightLimitedUpgrade(height, &storetypes.StoreUpgrades{ /* ... */ })
   app.SetStoreLoader(storeMigrations)
}

and then define a StoreLoaderWithHeightLimitedUpgrade loader that will check if height == ms.CommitInfo().Height, then ether behave as StoreLoaderWithUprgade or DefaultStoreLoader

Is that a correct understanding? If does still have a file on disk, but the actual migrations are now defined in the binary rather than in the upgrade plan, which is more secure and testable.

Yes, that's what I was thinking

@anilcse
Copy link
Collaborator

anilcse commented Nov 20, 2019

@aaronc I think the conclusion is needed for this to implement the solution. My understanding based on the discussion is,

  1. The old binary/software proposal will/should not have any information about store migrations and only new binary does. The old binary dumps the data to upgrade-info.json before panic.

  2. The proposed solution is having two calls, as @ethanfrey suggested:

The proposed change to SetUpgradeHandler makes a odd binding between the upgrade module and baseapp, which should not (and maybe cannot) exist. A straight forward solution would be to make 2 calls in the app constructor.

The first to call upgradeKeeper.SetUpgradeHandler(...) as now.

And a second call to set up the migrations (anywhere in the same init function, after baseapp is created, but before the store is loaded):

   storeMigrations := StoreLoaderWithUpgrade(&storetypes.StoreUpgrades{
    Renamed: []storetypes.StoreRename{{
      OldKey: "foo",
      NewKey: "bar",
    }},
    Deleted: []string{"not_needed"},
   })
   app.SetStoreLoader(storeMigrations)

I understand your concern about relying on a json file on disk that is not strongly tied to the binary itself. This approach would tie it to a binary. And the upgrade BeginBlocker will panic (avoiding commiting the migration) if this binary is launched before the plan is active due to the SetUpgradeHandler call.

  1. We don't need to create data/ dir backup as applying upgrades are transactional

  2. We should only consider the upgrade height for StoreMigrations and shouldn't do anything if it doesn't match. Since the upgrade height will not match the current height, restarting binary will not affect StoreMigrations after the upgrade, With this, even if a user deletes upgrade-info.json later, it shouldn't be a problem. If there's no upgrade plan, there will be no StoreMigrations

Please confirm if this what you are thinking and update the proposed design accordingly.

@aaronc
Copy link
Member Author

aaronc commented Nov 21, 2019

Thanks for this write-up @anilcse. I agree with 1, 3, and 4. I'm not sure I really understand 2. Would the solution in 2 mean that we don't really on the upgrade.json file on disk? I'd like to see this proposed solution fleshed out a little bit more. Maybe that's something for @ethanfrey to comment on.

@ethanfrey
Copy link

This is correct. Let me explain 2 a bit more.

On panic, old binary will dump some file to disk, with the upgrade name and height.

On start of new binary, in the app constructor, we (1) SetUpgradeHandler and (2) set a StoreLoader to be used to possibly migrate the root store on load.

This StoreLoader will be called every time the new binary is launched, and we want to ensure it only happens once. We do have access to query the CommiInfo of the root multistore before executing the StoreLoader, so we can read the height of the on-disk store.

My proposal is then to code this UpgradeMigrationStoreLoader (or whatever you want to name it), such that it has the logic in v2 to do the v1->v2 migration (as Aaron suggested). But to ensure it is only run once, it checks for the existence of upgrade-info.json on disk and that the height in this file is the same as the current height. Only then it will run, otherwise it will behave as default loader.

Are we all clear here?

Minor point:
When it matches, we can choose to delete the upgrade-info.json file. This would make it clearer that the upgrade is no longer needed. However it is non-transactional (rm not reverted if the upgrade handler fails, while all db changes are reverted). So this might cause some issues if the upgrade handler fails and we want to try again (very unlikely case). Both cases seem safe to me, @aaronc if you have an opinion, please let us know.

Other point:
I assume that a v3 binary only has upgrade path from v2->v3. If it is supposed to store v1->v2 and v2->v3 migrations, then this design might get more complicated. I think that case is not supported and makes no real sense, but I want confirmation on this point from @aaronc as well

@anilcse
Copy link
Collaborator

anilcse commented Nov 21, 2019

I think we shouldn't delete upgrade-info.json while/after the upgrade attempt, as it is non-transactional. May be we can remove it when user re-run the binary (v2) later, the binary checks for upgrade-info.json for the upgrade height details and it won't match if the upgrade was successful. May be that time, we can delete upgrade-info.json as it is of no use after successful upgrade.

@anilcse
Copy link
Collaborator

anilcse commented Nov 21, 2019

Other point:
I assume that a v3 binary only has upgrade path from v2->v3. If it is supposed to store v1->v2 and v2->v3 migrations, then this design might get more complicated. I think that case is not supported and makes no real sense, but I want confirmation on this point from @aaronc as well

I couldn't think of a case where new binary requires multiple upgrades information. So, may be we can eliminate this case for now. @aaronc what do you think?

@ethanfrey
Copy link

I think we shouldn't delete upgrade-info.json while/after the upgrade attempt, as it is non-transactional. May be we can remove it when user re-run the binary (v2) later, the binary checks for upgrade-info.json for the upgrade height details and it won't match if the upgrade was successful. May be that time, we can delete upgrade-info.json as it is of no use after successful upgrade.

Sure, good idea.

If present and the height matches current block height: Apply the store migration as defined in the (new) binary.

If present and the height is in the past: Delete the file, and do default load.

If present and height is in the future, or not present: Do default load.

@sahith-narahari
Copy link

@aaronc @ethanfrey how about we intialise upgrade-info.json when binary runs for first time and only update it when proposal is passed(height and storekeys are added) and plan is executed(height and storekeys are removed). Just to make it more in accordance with existing app.toml. I think this will be better over creating and deleting entire file based on upgrade plans

@ethanfrey
Copy link

So, if i understand correctly, you would ensure the upgrade.json is an existent but more or less empty file at init After that, you update the file in the upgrade handler, when a store migration is planned. Otherwise, on startup:

If present and the height matches current block height: Apply the store migration as defined in the (new) binary.

Otherwise (height != current block height): Do default load

@sahith-narahari
Copy link

Yes, on point. The values of height and storeUpgrades will be empty when intitalised

@ethanfrey
Copy link

Sounds good to me. Simpler is better

@sahith-narahari
Copy link

@aaronc do you see any issues with this approach

@anilcse
Copy link
Collaborator

anilcse commented Dec 26, 2019

So we could try to read the file and extract a height, something like:

if height, err := readUpgradeInfo(homeDir); err == nil {
  storeMigrations := StoreLoaderWithHeightLimitedUpgrade(height, &storetypes.StoreUpgrades{ /* ... */ })
   app.SetStoreLoader(storeMigrations)
}

and then define a StoreLoaderWithHeightLimitedUpgrade loader that will check if height == ms.CommitInfo().Height, then ether behave as StoreLoaderWithUprgade or DefaultStoreLoader

@ethanfrey is there a way to get currentHeight() inside StoreLoaderWithHeightLimitedUpgrade, I don't think we can access it directly as the stores are not loaded yet. Can we depend on priv-validator-state.json

@anilcse
Copy link
Collaborator

anilcse commented Dec 26, 2019

Alternate solution I could think of is, we should write the info to same upgrade-info.json on successful upgrade. Something like, upgradeStatus: "success". So we can check if the status is present before we consider migrating stores on restarting the binary.

But there will be a problem if user deletes the upgrade-info.json manually and restarts the new binary. We can avoid it by considering the Height parameter we are dumping to upgrade-info.json before panic'ing the old binary. If the Height is not present, we shouldn't consider the store-migrations.

@aaronc @ethanfrey Any issues with this approach?

@ethanfrey
Copy link

@anilcse quick answer from my phone...

I'm pretty sure when running the loader that you have access to root multi store. There is a method like LatestCommit(), which returns the height and hash of the last committed block. I use that for checking height.

@anilcse
Copy link
Collaborator

anilcse commented Dec 26, 2019

Oh okay. Thanks @ethanfrey . I hope ms.LastCommitID().Version refers to the Height.

@anilcse
Copy link
Collaborator

anilcse commented Dec 26, 2019

@ethanfrey, popped up with following idea while working on the implementation.
May be it's not a best way to rely on upgrade-info.json as it is getting created when old binary panics and there's no otherway to know if users have not messed up with it.
Instead, we can have both height and StoreUpgrades information from new binary itself (i.e., SetStoreLoader).

In that case, we should just get to change

func StoreLoaderWithUpgrade(upgrades *storetypes.StoreUpgrades) StoreLoader {
	return func(ms sdk.CommitMultiStore) error {
		return ms.LoadLatestVersionAndUpgrade(upgrades)
	}
}

to

func StoreLoaderWithUpgrade(upgradeInfo *storetypes.UpgradeInfo) StoreLoader {
	return func(ms sdk.CommitMultiStore) error {
                if upgradeInfo.Height == ms.LastCommitID().Version {
		        return ms.LoadLatestVersionAndUpgrade(upgradeInfo.StoreUpgrades)
                }
                return DefaultStoreLoader(ms)
	}
}

// Where
type UpgradeInfo struct {
	Height        int64         `json:"height"`
	StoreUpgrades StoreUpgrades `json:"store_upgrades"`
}

What do you think?

@anilcse
Copy link
Collaborator

anilcse commented Dec 27, 2019

Oops, this doesn't work for time-based upgrades. Missed it totally

@ethanfrey
Copy link

Oh okay. Thanks @ethanfrey . I hope ms.LastCommitID().Version refers to the Height

Yes it does.

@anilcse
Copy link
Collaborator

anilcse commented Dec 28, 2019

@ethanfrey is there a way possible to get $DAEMON_HOME ? Can we do the following?

Have SetRootDir and GetRootDir in server/utils.go

func SetRootDir (dir string) {
	rootDir = dir
}

func GetRootDir () string {
	return rootDir
}

And, setting the rootDir from server/start.go#startInProcess()

cfg := ctx.Config
home := cfg.RootDir

SetRootDir(home)

@ethanfrey
Copy link

I don't think you need $DAEMON_HOME, that is only for cosmosd. What you likely want is the home directory for gaiad, xrnd, etc. This can be read from the config. Note that actually changing this value in a command is not recommended, as it is used in a "PreRun" step to load various info, and changing it later will not have the desired effect (and be rather undefined).

I would simply read it from the config as needed. This can be done via viper.GetString("home"). In fact this is the exact approach the core sdk team uses: https://github.com/cosmos/cosmos-sdk/blob/master/server/start.go#L92

@anilcse
Copy link
Collaborator

anilcse commented Dec 31, 2019

We can use viper.GetString("home") but as they mentioned in skip-upgrade PR review, viper should only sit in cli. Accessing viper inside handler or keeper may not look good. We actually added this and removed later thinking for an alternative.

@ethanfrey
Copy link

My only idea then is to do the same as in the skip-upgrade pr, where the list of heights is parsed in the cli and passed into the keeper on construction. We can pass in home as an argument in a constructor, right?

@anilcse
Copy link
Collaborator

anilcse commented Jan 2, 2020

I think this should not be handled/set at keeper. $HOME should be accessible throughout the app. In later stages as well, if other modules require this path, they shouldn't change keeper implementation. I still feel, server/utils.go is the right place to store/get $HOME.

@anilcse
Copy link
Collaborator

anilcse commented Jan 4, 2020

Looks like we thought too much about what happens for store migrations on restarting the binary. I think it is straight forward and no extra work is needed.
Scene:
Let's say, there are planned storeUpgrades (updates & delete) and the upgrade was successful.

Case - 1
User never restarts the binary, so nothing invokes store migrations again

Case - 2
User tries to restart the binary for some reason. Now, the binary initiates the storeMigrations again. But since the migrations are already taken place, it can't find old records any more. Let's say, bank is updated to banker, now the binary looks for bank again and fails becuase it will not be present. Same for delete too.

In either of the cases, height doesn't matter. @aaronc @ethanfrey works?

@sahith-narahari
Copy link

@aaronc @ethanfrey I've a question regarding storeUpgrades, in a scenario where a chain breaks after few blocks on the upgraded chain, don't we need a way to revert back the changed stores. Very unlikely to happen on cosmos, but let's say there's some bug in x tranaction which may not be known till it's executed on new chain(after 10-20 blocks maybe).

@sahith-narahari
Copy link

The most probable decision will be to revert the chain running old binary, and I don't see our upgrade module handling this scenario. Would like to know your take on this.

@anilcse
Copy link
Collaborator

anilcse commented Jan 6, 2020

As discussed, it might be tough to handle this rollback directly from upgrade module. One workaround we can do is, supplying StoreUpgrades from new_key to old_key for old binary and it will rename all the new_keys to old_keys.

With this, if v2 renamed a store from x to y and failed at block-10, then v1 should be regenerated with StoreUpgrades info in it. So, v1 will rename the store from y to x

P.S: We cannot rollback deletes with this.

@ethanfrey
Copy link

it can't find old records any more. Let's say, bank is updated to banker, now the binary looks for bank again and fails becuase it will not be present. Same for delete too.

This is exactly the situation we want to avoid. If we manually restart the binary after an upgrade and this causes an error/panic, then we have a problem. The whole design around keeping this height is to allow this to be only applied once, and if it errors, then it really is an error and we should abort the binary and force operator intervention

@ethanfrey
Copy link

As discussed, it might be tough to handle this rollback directly from upgrade module. One workaround we can do is, supplying StoreUpgrades from new_key to old_key for old binary and it will rename all the new_keys to old_keys.

All the reads and writes to the database are done on the deliverStore (I am 95% sure, but please double-check). This store is a temporary cache on top of the state and writes/deletes to it do not touch disk directly Nothing changes on disk until commit is invoked.

Thus a panic or other abort that doesn't allow us to reach the commit handler will effectively rollback the state, cuz it was an uncommitted change.

@anilcse
Copy link
Collaborator

anilcse commented Jan 7, 2020

As discussed, it might be tough to handle this rollback directly from upgrade module. One workaround we can do is, supplying StoreUpgrades from new_key to old_key for old binary and it will rename all the new_keys to old_keys.

All the reads and writes to the database are done on the deliverStore (I am 95% sure, but please double-check). This store is a temporary cache on top of the state and writes/deletes to it do not touch disk directly Nothing changes on disk until commit is invoked.

Thus a panic or other abort that doesn't allow us to reach the commit handler will effectively rollback the state, cuz it was an uncommitted change.

Yes. In that case as well, we cannot rollback if some tx got invoked and failed at 10th block after the upgrade. We might be able to rollback the 10th block but eventually it fails on every other tx (of same msg type). It cannot work properly.

@anilcse
Copy link
Collaborator

anilcse commented Jan 7, 2020

Lemme describe the situation a bit more.

  • Let's say some upgrade happened at height : 400,000
  • Let's assume there is some issue with withdraw-rewards tx but it wasn't tested before.
  • Someone tries to execute withdraw-rewards tx, and it failed due to some bug in store migrations.
  • The binary cannot work for withdraw-rewards and the upgrade has to be reverted / fix the bug and do another upgrade. Second one might take time . So quick solution could be running the old binary. But it cannot work coz of store migrations.

A work around is as mentioned before, need to generate new version of old binary with storeMigrations info to revert store migrations

@ethanfrey
Copy link

Yes. In that case as well, we cannot rollback if some tx got invoked and failed at 10th block after the upgrade. We might be able to rollback the 10th block but eventually it fails on every other tx (of same msg type). It cannot work properly.

You are right, there is no path there. We assumed that a failed migration would error in the first block. Some minor inconsistency would cause issues if we need to deal with it a long time later.

For this case, reverting to old binary is not possible with any upgrade mechanism. The only solution I know of (and which was proposed in Cosmos Validator slacks for their mainnet upgrade), is if an issue is detected in eg. 100 blocks, then rollback to the pre-update state and continue with old binary there.

If we want to add this ability for inplace upgrades, we should export genesis file before performing any upgrade - with or without store migrations.

@anilcse
Copy link
Collaborator

anilcse commented Jan 7, 2020

For this case, reverting to old binary is not possible with any upgrade mechanism. The only solution I know of (and which was proposed in Cosmos Validator slacks for their mainnet upgrade), is if an issue is detected in eg. 100 blocks, then rollback to the pre-update state and continue with old binary there.

That would cause even bigger issues I think. What happens to all the transactions that took place after the upgrade? I doubt if pre-update state is different from exported state before the upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants