Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaiad stuck on round number blocks #1958

Closed
5 tasks
Tracked by #2617
simonecid opened this issue Dec 13, 2022 · 11 comments
Closed
5 tasks
Tracked by #2617

Gaiad stuck on round number blocks #1958

simonecid opened this issue Dec 13, 2022 · 11 comments
Assignees
Labels
scope: comet-bft type: bug Issues that need priority attention -- something isn't working

Comments

@simonecid
Copy link

simonecid commented Dec 13, 2022

Summary of Bug

Gaiad got stuck on block number 12800000, about a month ago, and 13200000. Prior to getting stuck the log reported Dialling peer failures errors.
After restarting gaiad, the log reported gaiad being stuck on

executed block height=12800000 module=consensus num_invalid_txs=0 num_valid_txs=1
executed block height=13200000 module=consensus num_invalid_txs=0 num_valid_txs=1

The only way to restore node functionality was to wipe its storage and sync it from an existing snapshot. During the catchup process the node had no problem processing the block on which it was stuck before.

Version

gaiad v7.1.0, v8.0.0, v9.0.0

Steps to Reproduce

Start the gaiad from an existing snapshot:

gaiad start --home /var/lib/atom --x-crisis-skip-assert-invariants

Configuration

Default configuration with custom pruning and quicksync:

pruning = "custom"

# These are applied if and only if the pruning strategy is custom.
pruning-keep-recent = "400000" # 390k blocks per month
pruning-keep-every = "0" 
pruning-interval = "100000" # cleans at a week interval
min-retain-blocks = 400000

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
  • Is a spike necessary to map out how the issue should be approached?
@mpoke mpoke added this to Cosmos Hub Jan 20, 2023
@github-project-automation github-project-automation bot moved this to 🩹 Triage in Cosmos Hub Jan 20, 2023
@mpoke mpoke added the type: bug Issues that need priority attention -- something isn't working label Jan 20, 2023
@mpoke
Copy link
Contributor

mpoke commented Mar 9, 2023

@simonecid Could you please let us know if you still experiencing the same issue with Gaia v8.0.1.

@mpoke mpoke moved this from 🩹 Triage to 🛑 Blocked in Cosmos Hub Mar 9, 2023
@simonecid
Copy link
Author

@mpoke thank you for your answer.
I currently run v8.0.0 and I had the issue on this version, too. I generally have this problem once per month, so I will update to v9.0.0 on Wednesday with the chain upgrade and will let you know if gaiad gets stuck in the next 30 days.

@simonecid
Copy link
Author

Hello @mpoke , unfortunately my gaiad v9.0.0 instance got stuck on block 14600000 and I had to reinitialise it from a snapshot. So, sadly, the issue persists on v9.0.0

@mpoke mpoke moved this from 🛑 Blocked to 📥 Todo in Cosmos Hub Apr 2, 2023
@mmulji-ic mmulji-ic moved this from 📥 Todo to 🏗 In progress in Cosmos Hub Apr 4, 2023
@mmulji-ic
Copy link
Contributor

Hi @adizere could you take a look at this issue also?

@mmulji-ic mmulji-ic moved this from 🏗 In progress to 🛑 Blocked in Cosmos Hub Apr 4, 2023
@adizere
Copy link

adizere commented Jun 5, 2023

@simonecid do you use this node to serve (g)RPC queries?

@simonecid
Copy link
Author

Hello @adizere , yes, we serve gRPC queries.

@adizere
Copy link

adizere commented Jun 5, 2023

I see, then you're possibly experiencing the same issue as Osmosis nodes did here: cometbft/cometbft#815. This is a subtle problem due to both SDK/app and Comet ABCI locking design. On the Comet side, we will improve the locking granularity cometbft/cometbft#88. On the SDK/app side, better scoping of how (g)RPC queries are handled is needed and is being investigated; not sure this is tracked anywhere.

To confirm indeed you're experiencing the same problem as above, look for this line in the logs:

5:04AM INF Timed out dur=1000 height=9527566 module=consensus round=0 step=5

Please get back to us if you're able to confirm this log line appears when the node gets stuck.

@simonecid
Copy link
Author

simonecid commented Jun 5, 2023

Hello @adizere , thank you for the prompt reply.

I have a partial copy of the logs prior to an incident in which the node got stuck on block 13800000 .
The earliest block recorded in the log is 13796462 and I can see:

7:11AM INF Timed out dur=3033.461354 height=13796462 module=consensus round=0 step=1

then for every block after that I can see the Timed out error until we fetched the block 13800000.
By then our logs started to be full of:

2:22PM INF Dialing peer address={"id":"8707282f51ebfba828c08a7316ca84ed5667a0f5","ip":"74.118.142.175","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"05024a29b6fb85197a3ed876e69faaea63b74c1b","ip":"35.201.248.155","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"fe0f79db9664f07bf770b5eb22693af530111014","ip":"3.238.104.92","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"793d2f2f855c5837d53a3f28a309c1bfa0774dea","ip":"5.9.230.130","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"fb7222250e62c55cc69dec6bc8aa8807f859215c","ip":"23.81.180.228","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"ab7bb8cc561f8ec01905c5f60786a53c7f5f02ad","ip":"18.209.62.136","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"5809988aea556dc842fb3a9c7b6cd1010f3b75d9","ip":"54.236.4.166","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"ccb18e6b2b46ed778394e05c0ef8c5062caf0afc","ip":"35.215.38.153","port":26656} module=p2p

so that fits with the cometbft issue you linked.

I did not have this log at the time I opened this issue, I will attach it to this comment.
I presume that the immediate mitigation is to not use gRPC until the problem is solved?

Thank you,
Simone

error_log_cosmoshub_clean.log

@adizere
Copy link

adizere commented Jun 5, 2023

Thank you also!

I presume that the immediate mitigation is to not use gRPC until the problem is solved?
Correct. To rephrase: preliminary mitigation is to rate-limit at least (disable entirely, at most) g/RPC access to this node. At least for a couple of days and assess if the problem still appears. That would be a great first step!

@mmulji-ic
Copy link
Contributor

Thank you also!

I presume that the immediate mitigation is to not use gRPC until the problem is solved?
Correct. To rephrase: preliminary mitigation is to rate-limit at least (disable entirely, at most) g/RPC access to this node. At least for a couple of days and assess if the problem still appears. That would be a great first step!

@simonecid , I hope the above mitigation helps, we've released v10, do let us know whether you still encounter this problem.

@mpoke
Copy link
Contributor

mpoke commented Sep 11, 2023

Closing this issue due to lack of activity. @simonecid please let us know if the problem persists.

@mpoke mpoke closed this as completed Sep 11, 2023
@github-project-automation github-project-automation bot moved this from 🛑 F3: OnHold to 👍 F4: Assessment in Cosmos Hub Sep 11, 2023
@mpoke mpoke moved this from 👍 F4: Assessment to ✅ Done in Cosmos Hub Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope: comet-bft type: bug Issues that need priority attention -- something isn't working
Projects
Status: ✅ Done
Development

No branches or pull requests

4 participants