Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chain reaches consensus failure on binary rebuild and restart #543

Closed
4 tasks
adamewozniak opened this issue Feb 13, 2022 · 29 comments · Fixed by #1018
Closed
4 tasks

chain reaches consensus failure on binary rebuild and restart #543

adamewozniak opened this issue Feb 13, 2022 · 29 comments · Fixed by #1018
Labels
C:x/leverage T:Bug A regression or bug in the Umee codebase.

Comments

@adamewozniak
Copy link
Collaborator

adamewozniak commented Feb 13, 2022

Summary of Bug

When running the beta version of the chain locally, rebuilding the binary in beta mode, stopping the chain, and then restarting, it reaches consensus failure via an error from the x/leverage module :

ERR CONSENSUS FAILURE!!! err="-0.000002092846270928 years: negative time elapsed since last interest time" module=consensus ...

Not sure if this is a major issue, but still worth jotting down

Version

Please provide the output of the following commands:

  • $ umeed version
    price-feeder/v0.1.0-c66f8961
  • $ go version
    go version go1.17.3 darwin/arm64
  • $ uname -a
    Darwin Adams-MacBook-Pro-2.local 21.3.0 Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000 arm64

Steps to Reproduce

Steps to reproduce the behavior:

  1. Launch beta umee network
UMEE_ENABLE_BETA=true starport chain serve -c ./starport.ci.beta.yml -v --reset-once
  1. Rebuild binary with beta support
UMEE_ENABLE_BETA=true make install
  1. Stop & Restart umee network
UMEE_ENABLE_BETA=true starport chain serve -c ./starport.ci.beta.yml -v

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@alexanderbez
Copy link
Contributor

Interesting. @toteki does anything stick out here?

@alexanderbez alexanderbez added C:x/leverage T:Bug A regression or bug in the Umee codebase. labels Feb 14, 2022
@alexanderbez
Copy link
Contributor

@adamewozniak when you rebuilt, did you change any code?

@toteki
Copy link
Member

toteki commented Feb 14, 2022

We were talking about it. and the only way for this to happen is for the LastInterestTime in state to be greater than ctx.BlockTime().Unix() - it comes down to the behavior of BlockTime and EndBlocker (during which AccrueAllInterest sets LastInterestTime to ctx.BlockTime().Unix() after accruing interest using the difference of the two.)

The error message posted, by the way, comes out to be a 3 minute time reversal, longer than a block time, and might be consistent with the time the chain was stopped before relaunch. Still can't figure out why BlockTime would reverse though.

@adamewozniak
Copy link
Collaborator Author

@adamewozniak when you rebuilt, did you change any code?

Nope, you should be able to repro this just using those console commands & no change

@RafilxTenfen
Copy link
Contributor

@adamewozniak This still happen? I couldn't reproduce

@adamewozniak
Copy link
Collaborator Author

@RafilxTenfen I'm now able to repo it by running starport, making a change to force starport to rebuild, and then restarting starport. Let me know if you can repro that way, same issue :

CONSENSUS FAILURE!!! err="-0.000000697615423642 years: negative time elapsed since last interest time" 

cc : @toteki

@toteki
Copy link
Member

toteki commented Mar 27, 2022

I suspect this is just blockTime behaving weirdly when starport restarts, but we should see if we run into it during our chain upgrade tests. If that happens, it might be worth treating unexpected elapsed times differently.

@adamewozniak
Copy link
Collaborator Author

adamewozniak commented Mar 27, 2022

@toteki I think we should be worried about upgrading v2 -> v3, since the chain would be required to stop and restart similarly to this. Our upgrade tests should include v2.0.0 -> v2.0.1 (when we cut it), I think that would replicate this better - but agreed, it's probably just blocktime behaving weirdly.

Edit: Looks like it also only happens once, e.g., if you change the codeblock out to this :

	// calculate time elapsed since last interest accrual (measured in years for APR math)
	yearsElapsed := sdk.NewDec(currentTime - prevInterestTime).QuoInt64(types.SecondsPerYear)
	if yearsElapsed.IsNegative() {
		// Because this action is not caused by a message, logging and
		// events are here instead of msg_server.go
		k.Logger(ctx).Debug(
			"negative time elapsed",
			"block_height", fmt.Sprintf("%d", ctx.BlockHeight()),
			"unix_time", fmt.Sprintf("%d", currentTime),
			"yearsElapsed", yearsElapsed.String(),
		)
		return nil
	}

You get the "negative time elapsed" message, and then it continues as normal

@toteki
Copy link
Member

toteki commented Apr 1, 2022

We could do that - but negative time elapsed is a panic because if it happens for any amount of time (worst case: if block time were ever zero for some reason) then unjustified interest (worst case: 50 years worth) will accrue instantly on the next block with no error message.

I want to be sure that it's not a starport-only problem before considering removing the current safety. Has anyone run into the bug without starport during this week's gov upgrade testing?

@adamewozniak
Copy link
Collaborator Author

I don't believe this has been seen in a while. Please confirm / close, either @dreamcodez, @RafilxTenfen, or @toteki

@toteki
Copy link
Member

toteki commented May 25, 2022

Been a while - I suspect it's because our local testing has evolved beyond relaunching with starport (e.g. noob scripts and multinode now).

That would mean that blocktime moving backwards was just a starport edge case we don't have a problem.

@toteki toteki closed this as completed May 25, 2022
@adamewozniak
Copy link
Collaborator Author

Reopening this since @RafilxTenfen has seen it while trying to do a fork of our testnet

@adamewozniak adamewozniak reopened this Jun 5, 2022
@adamewozniak
Copy link
Collaborator Author

Closing this back up, @RafilxTenfen was able to get the fork past this just by updating the genesis file :^)

@RafilxTenfen
Copy link
Contributor

Hey, I was able to reproduce this with an single node
Here are the steps:

  • go to main on umee or git checkout 2efa932069272c9fcbfd8ab80e02b7141f2f393b
  • build the binary make build
  • copy the new binary to cp ./build/umeed ../umee-infra/noob/umeed-builds/umee-main
  • go to this branch on umee-infra
  • enter the noob folder
  • run the script ./leverage-halt.sh umeed-builds/umeed-main
  • Look at the logs in another shell tail -f ./data/umeetest-1.n0.log

Expected error
image

@RafilxTenfen RafilxTenfen reopened this Jun 8, 2022
@RafilxTenfen RafilxTenfen changed the title Beta-enabled chain reaches consensus failure on binary rebuild and restart chain reaches consensus failure on binary rebuild and restart Jun 10, 2022
@RafilxTenfen
Copy link
Contributor

So, this is basically happening, because when we export a genesis it saves a json as the genesis state, this genesis state has the time at which the genesis was generated ex.:
"genesis_time":"2022-06-10T15:09:14.61567517Z"
When loading the tendermint state it uses the genesis file when there is no state from *.db and then it sets the context LastBlockTime with the genesis time
LastBlockTime: genDoc.GenesisTime
and then the context header time which is used by our leverage.keeper AccurateAllInterest is filled with the time from the GenesisTime which is before than our last_interest_time saved in the genesis state

@RafilxTenfen
Copy link
Contributor

In my opinion, tendermint should save the last block time in the state as well...
I will test the same thing with gaia to see if their code will produce a block.time < previousblock.time

One idea to solve this would be for us to use block heigh difference instead of time, do you see a way to use this instead of the time lapsed? @toteki

@toteki
Copy link
Member

toteki commented Jun 10, 2022

Possible approaches:

Option 0: Remove the error and allow negative time to elapse. Bad because it would allow double-interest over the same time period.
Option 1: Fix BlockTime behavior on export. We would need to open an issue on cosmos sdk probably (we can do that anyway) but shouldn't rely on it for our fix.
Option 2: Use block height. This would lead to inaccuracies each block (6.5sec being treated as 5sec on average) and reduce yields. Also interest would fail to accrue over long periods (e.g. 7 days) where chain is down.
Option 3: Modify AccrueAllInterest to treat "negative time elapsed" as "zero time elapsed", aborting the function without modifying state until an EndBlock where BlockTime > LastInterestTime.

@RafilxTenfen
Copy link
Contributor

Possible approaches:

Option 0: Remove the error and allow negative time to elapse. Bad because it would allow double-interest over the same time period. Option 1: Fix BlockTime behavior on export. We would need to open an issue on cosmos sdk probably (we can do that anyway) but shouldn't rely on it for our fix. Option 2: Use block height. This would lead to inaccuracies each block (6.5sec being treated as 5sec on average) and reduce yields. Also interest would fail to accrue over long periods (e.g. 7 days) where chain is down. Option 3: Modify AccrueAllInterest to treat "negative time elapsed" as "zero time elapsed", aborting the function without modifying state until an EndBlock where BlockTime > LastInterestTime.

I liked the Option 3 ;D

@RafilxTenfen
Copy link
Contributor

I was able to reproduce with gaia v702
image

@RafilxTenfen
Copy link
Contributor

Related to #1533

@robert-zaremba
Copy link
Member

and then the context header time which is used by our leverage.keeper AccurateAllInterest is filled with the time from the GenesisTime which is before than our last_interest_time saved in the genesis state

So our genesis is not consistent. Instead of adding an exception logic in AccurateAllInterest we should fix the genesis export:

  • I don't see any issue with BlockTime - this is correct. It's the last block time.
  • When exporting the genesis, we should correctly set the LastBlockTime to the last a value from the state.
  • When importing genesis we should see if LastBlockTime is defined in genesis. Only if it's not defined we fallback to genDoc.GenesisTime

@RafilxTenfen
Copy link
Contributor

RafilxTenfen commented Jun 15, 2022

and then the context header time which is used by our leverage.keeper AccurateAllInterest is filled with the time from the GenesisTime which is before than our last_interest_time saved in the genesis state

So our genesis is not consistent. Instead of adding an exception logic in AccurateAllInterest we should fix the genesis export:

  • I don't see any issue with BlockTime - this is correct. It's the last block time.
  • When exporting the genesis, we should correctly set the LastBlockTime to the last a value from the state.
  • When importing genesis we should see if LastBlockTime is defined in genesis. Only if it's not defined we fallback to genDoc.GenesisTime

But the exporting / importing stuffs from blocks is done in tendermint/cosmos-sdk layer, right?
and at the moment we don't save the LastBlockTime, by the way, I think that saving this information in the exported genesis would solve this ;D

@brianosaurus
Copy link
Contributor

I'm unsure that will do it. We're having consensus failures in other places. Can you post the stack trace and error message? As text (not an image)? I would like to see it.

@robert-zaremba
Copy link
Member

But the exporting / importing stuffs from blocks is done in tendermint/cosmos-sdk layer, right?

No. We control how to handle import / export. https://github.com/umee-network/umee/blob/main/x/leverage/genesis.go

@toteki
Copy link
Member

toteki commented Jun 15, 2022

But the exporting / importing stuffs from blocks is done in tendermint/cosmos-sdk layer, right?

No. We control how to handle import / export. https://github.com/umee-network/umee/blob/main/x/leverage/genesis.go

The inaccurate time isn't in leverage export - it's coming from BlockTime() itself going backwards. That's also why Rafael was able to reproduce with gaiad

@robert-zaremba
Copy link
Member

oh, I thought we have our own LastBlockTime in leverage.

@RafilxTenfen
Copy link
Contributor

RafilxTenfen commented Jun 15, 2022

I'm unsure that will do it. We're having consensus failures in other places. Can you post the stack trace and error message? As text (not an image)? I would like to see it.

On gaiad is not going to have an consensus failure, but the next block after exported genesis happen will have the genesis block time, you can reproduce this

You can reproduce the consensus failure, running theses steps

The error as text

@robert-zaremba
Copy link
Member

I see, I didn't look at cosmos/gaia#1533

In that case I agree: let's:

  • change log to Warn
  • add todo to solve it once the bug will be resolved by Tendermint.

@RafilxTenfen
Copy link
Contributor

I see, I didn't look at cosmos/gaia#1533

In that case I agree: let's:

  • change log to Warn
  • add todo to solve it once the bug will be resolved by Tendermint.

#1018 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:x/leverage T:Bug A regression or bug in the Umee codebase.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants