Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [Cosmovisor] Upgrade is applied immediately when using the --upgrade-height flag #19227

Closed
1 task done
Tracked by #20067
dasanchez opened this issue Jan 24, 2024 · 19 comments · Fixed by #20585
Closed
1 task done
Tracked by #20067
Assignees
Labels
C:Cosmovisor Issues and PR related to Cosmovisor T:Bug

Comments

@dasanchez
Copy link

dasanchez commented Jan 24, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Cosmovisor restarts with the new binary immediately after running add-upgrade with the --upgrade-height flag.

Cosmos SDK Version

Cosmovisor v1.5.0

How to reproduce?

  1. Start a Gaia v13 chain with Cosmovisor v1.5.0
  2. Download a Gaia v14 binary to gaiad
  3. Run cosmovisor add-upgrade v14 gaiad --upgrade-height 300 at any point before reaching block 300.
  4. The node will start right away with the downloaded binary.

I used this script to run steps 2 and 3:

#!/bin/bash

export DAEMON_NAME=gaiad
export DAEMON_HOME=/home/gaia/.gaia
export DAEMON_DATA_BACKUP_DIR=/home/gaia/backup
height=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.last_block_height')
version=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.version')
echo "The node is running $version at height $height."
wget https://github.com/cosmos/gaia/releases/download/v14.0.0-rc0/gaiad-v14.0.0-rc0-linux-amd64 -q -O gaiad
chmod +x gaiad
command="cosmovisor add-upgrade v14 gaiad --upgrade-height 300"
echo "$command"
$command
echo "Waiting ten seconds..."
sleep 10
height=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.last_block_height')
version=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.version')
echo "The node is running $version at height $height."

Upgrade info:

cat ~/.gaia/cosmovisor/upgrades/v14/upgrade-info.json 
{"name":"v14","time":"0001-01-01T00:00:00Z","height":300}

Pre-upgrade cosmos_sdk_version: v0.45.16 (Gaia v13.0.2)
Post-upgrade cosmos_sdk_version: v0.45.16 (Gaia v14.0.0-rc0)

@julienrbrt
Copy link
Member

julienrbrt commented Jan 24, 2024

Can you dump the upgrade info here too? And the sdk versions as well.

@github-project-automation github-project-automation bot moved this to 👀 To Do in Cosmos-SDK Jan 24, 2024
@dasanchez
Copy link
Author

dasanchez commented Jan 24, 2024

@julienrbrt
Upgrade info:

cat ~/.gaia/cosmovisor/upgrades/v14/upgrade-info.json 
{"name":"v14","time":"0001-01-01T00:00:00Z","height":300}

Pre-upgrade cosmos_sdk_version: v0.45.16 (Gaia v13.0.2)
Post-upgrade cosmos_sdk_version: v0.45.16 (Gaia v14.0.0-rc0)

@julienrbrt
Copy link
Member

Thanks, last question, what gives gaiad status just before add-upgrade ?

@julienrbrt julienrbrt self-assigned this Jan 24, 2024
@dasanchez
Copy link
Author

dasanchez commented Jan 24, 2024

@julienrbrt gaiad status returns this before running add-upgrade:

{"NodeInfo":{"protocol_version":{"p2p":"8","block":"11","app":"0"},"id":"af3e918c3e2533ceb3be8c48c3068bf6682fb67c","listen_addr":"tcp://0.0.0.0:26656","network":"testnet","version":"0.34.29","channels":"40202122233038606100","moniker":"cosmos-node","other":{"tx_index":"on","rpc_address":"tcp://0.0.0.0:26657"}},"SyncInfo":{"latest_block_hash":"760D16D9D591C769184A2A4F659AB730C80494BD84633FC0451EFABE524307F4","latest_app_hash":"67F695FF4B37D5D12F9A64B83F32679DE289F3AC3A62427CC2EEFC84B61C1634","latest_block_height":"8","latest_block_time":"2024-01-24T19:12:39.769313361Z","earliest_block_hash":"067188F2ADF5931924FD7562506F7DF457C2329F2A1AF00BB777E12FEA273A29","earliest_app_hash":"E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855","earliest_block_height":"1","earliest_block_time":"2024-01-24T17:08:52.990568831Z","catching_up":false},"ValidatorInfo":{"Address":"51A4AFE4B661E141691029CABF023E3C976FF28E","PubKey":{"type":"tendermint/PubKeyEd25519","value":"kkk4KLWvqb8AKcJJcO5VpBvbOqfds7XqqZ6HRxzZPz0="},"VotingPower":"8000"}}

@julienrbrt
Copy link
Member

julienrbrt commented Jan 24, 2024

Weird, then it should be parsed properly: https://github.com/cosmos/cosmos-sdk/blob/main/tools/cosmovisor/scanner.go#167

@julienrbrt julienrbrt moved this from 👀 To Do to ✍ In Progress in Cosmos-SDK Jan 29, 2024
@julienrbrt julienrbrt mentioned this issue Apr 17, 2024
7 tasks
@fmorency
Copy link
Contributor

I also got bitten by that issue!

@julienrbrt julienrbrt added the C:Cosmovisor Issues and PR related to Cosmovisor label May 30, 2024
@julienrbrt
Copy link
Member

I also got bitten by that issue!

:/ I am re-prioritizing cosmovisor work. I'll re-investigate this.

@steven-varga
Copy link

I am observing the same behaviour with dydxv4 cosmos scheduled for Jul 25th 2024, 07:50:07+00:00 UTC (in 10 hours) Anyone happened to know if downgrade is an option?

@BillyJunID
Copy link

I have encountered this issue with the Warden testnet as well.
command used:

DAEMON_NAME=wardend
DAEMON_HOME=/home/warden/.warden
UPGRADE_HEIGHT=1534500

rm -rf ${HOME}/upgrade
mkdir ${HOME}/upgrade
cd ${HOME}/upgrade
wget https://github.com/warden-protocol/wardenprotocol/releases/download/v0.4.0/wardend_Linux_x86_64.zip
unzip wardend_Linux_x86_64.zip
rm wardend_Linux_x86_64.zip CHANGELOG.md LICENSE README.md
chmod +x ${HOME}/upgrade/wardend
cosmovisor add-upgrade "v040" ${HOME}/upgrade/wardend --upgrade-height ${UPGRADE_HEIGHT}

After executing cosmovisor add-upgrade, the node immediately restarts and uses the new binary, resulting in a crash.

@freak12techno
Copy link
Contributor

I faced this yesterday on Neutron emergency upgrade on 2 nodes. Might be because I used --force flag as there was already an upgrade-info.json file there.

@julienrbrt can we get this prioritised please? as this is quite dangerous

@Bosco-2019
Copy link

Seeing this same issue on Cosmos Provider testnet, the issue seems to be random as I have an identical node both used with horcrux one worked fine the other tried applying the upgrade immediately.

@fish2plain
Copy link

experiencing the same issue on current theta and provider testnet.

@julienrbrt
Copy link
Member

Looking at this again right now!

@pserdiuk-everstake
Copy link

pserdiuk-everstake commented Aug 8, 2024

@julienrbrt I see a potential problem here, in scanner.go

	result, err := exec.Command(fw.currentBin, "status").Output() //nolint:gosec // we want to execute the status command
	if err != nil {
		return 0, err
	}
	// file exist but too early in height
	currentHeight, _ := fw.checkHeight()
	if currentHeight != 0 && currentHeight < info.Height {
		return false
	}

Avoid launching the node binary like that as much as possible, especially without any command-line options. A more thorough way to get data from the running node is to parse config/config.toml once, find its current RPC endpoint, and send an RPC request each time you need to get the current block height. E.g. some of my shell code:

# Function to get the local HTTP RPC endpoint from the node's configuration
# TODO: This is currently limited to Neutron. Cosmos SDK exposes a different "config" command (`config get config rpc.laddr`), will need to change this function to recognize that
get_local_rpc_endpoint() { "$NODE_BINARY" config --home "$NODE_DIR" | jq -er .node | sed s/tcp/http/; }

# Function to send a call to the HTTP RPC endpoint
call_rpc_endpoint() { curl -s --connect-timeout 5 --max-time 20 --retry 5 --retry-delay 0 --retry-max-time 5 "$1/$2" | jq -er .result; }

# Function to get the latest block height via HTTP RPC
get_latest_block_height() { call_rpc_endpoint "$1" "status?" | jq -er .sync_info.latest_block_height; }

# Function to check if we have proper access to the local node
validate_local_node() {
  LOCAL_RPC_ENDPOINT=""
  # Check if the node's binary is readable and executable and if the node's directory exists and is readable
  if [ -d "$NODE_DIR" ] && [ -r "$NODE_DIR" ] && [ -x "$NODE_BINARY" ] && [ -r "$NODE_BINARY" ]; then
    LOCAL_RPC_ENDPOINT=$(get_local_rpc_endpoint); [ -z "$LOCAL_RPC_ENDPOINT" ] && print_error_and_exit "node_error"
    return 0;
  else return 1; fi
}

# Main function to compare heights between the local node and the remote RPC
compare_heights() {
  local_height=$(get_latest_block_height "$LOCAL_RPC_ENDPOINT")
  remote_height=$(get_latest_block_height "$REMOTE_RPC_ENDPOINT")
  { [ -z "$local_height" ] || [ -z "$remote_height" ]; } && print_error_and_exit "curl_error"
  delta=$((remote_height - local_height))
  [ "$delta" -le "$NODE_LAG" ] && \
  { printf "Node %s is not lagging, delta with %s is %s, local height is %s, remote height is %s\n" \
  "$LOCAL_RPC_ENDPOINT" "$REMOTE_RPC_ENDPOINT" "$delta" "$local_height" "$remote_height"; exit 0; } || \
  { printf "Node %s is lagging, delta with %s is %s, local height is %, remote height is %s!\n" \
  "$LOCAL_RPC_ENDPOINT" "$REMOTE_RPC_ENDPOINT" "$delta" "$local_height" "$remote_height"; exit 1; }
}

This is significantly faster because it avoids launching another executable, even if for a moment (I'm still doing that but you can use something like a mandatory DAEMON_RPC_ENDPOINT variable instead). It is compatible with nodes that run the RPC server on a different port, which gaiad status won't pick up by itself without --node. If there are multiple nodes on one machine, this code can end up requesting the status of the node that runs on the default RPC port and using that value of latest_block_height to check if the upgrade should be applied immediately to a different node. It also avoids possible side effects related to outputs and exit codes of the status command. For example, in some earlier versions of CosmosSDK I noticed that this command was printing its output to stderr, not stdout, as did the main logger. Finally, it could resolve your TODO problem related to test binaries :) Of course, I don't know if either of these situations is the culprit behind this specific problem, but refactoring this part of the code to use an RPC call is a good idea anyway and will aid in any further debugging.

@julienrbrt julienrbrt linked a pull request Aug 9, 2024 that will close this issue
12 tasks
@github-project-automation github-project-automation bot moved this from 🤸‍♂️ In Progress to 🥳 Done in Cosmos-SDK Aug 12, 2024
@freak12techno
Copy link
Contributor

@julienrbrt when can we expect cosmovisor v1.6.0 which I assume will include that fix?

@julienrbrt
Copy link
Member

@julienrbrt when can we expect cosmovisor v1.6.0 which I assume will include that fix?

Hey, now: https://github.com/cosmos/cosmos-sdk/releases/tag/tools%2Fcosmovisor%2Fv1.6.0

@Bosco-2019
Copy link

@julienrbrt unfortunately this has not solved the problem. I just tried this on neutron manual upgrade.

@julienrbrt
Copy link
Member

@julienrbrt unfortunately this has not solved the problem. I just tried this on neutron manual upgrade.

Hi,
Could you post more info? (cosmovisor version and reproducing steps)
During my tests it was working fine.

@Bosco-2019
Copy link

Bosco-2019 commented Aug 22, 2024

Hi, Could you post more info? (cosmovisor version and reproducing steps) During my tests it was working fine.

cosmovisor version: v1.6.0

cosmovisor add-upgrade v4.2.1 ~/go/bin/neutrond --force --upgrade-height 13711950

EDIT: Scratch that @julienrbrt I see where i messed up. While i did upgrade cosmovisor I didn't restart the process before running the add-upgrade command so I was still actually on v1.5.0 when running the command. I have just tested again with a fake upgrade using the same binaries and it did indeed stop the node at the correct height to apply the upgrade.

Thank you.

@tac0turtle tac0turtle removed this from Cosmos-SDK Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:Cosmovisor Issues and PR related to Cosmovisor T:Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants