Gaiad stuck on round number blocks #1958

simonecid · 2022-12-13T15:33:18Z

Summary of Bug

Gaiad got stuck on block number 12800000, about a month ago, and 13200000. Prior to getting stuck the log reported Dialling peer failures errors.
After restarting gaiad, the log reported gaiad being stuck on

executed block height=12800000 module=consensus num_invalid_txs=0 num_valid_txs=1

executed block height=13200000 module=consensus num_invalid_txs=0 num_valid_txs=1

The only way to restore node functionality was to wipe its storage and sync it from an existing snapshot. During the catchup process the node had no problem processing the block on which it was stuck before.

Version

gaiad v7.1.0, v8.0.0, v9.0.0

Steps to Reproduce

Start the gaiad from an existing snapshot:

gaiad start --home /var/lib/atom --x-crisis-skip-assert-invariants

Configuration

Default configuration with custom pruning and quicksync:

pruning = "custom"

# These are applied if and only if the pruning strategy is custom.
pruning-keep-recent = "400000" # 390k blocks per month
pruning-keep-every = "0" 
pruning-interval = "100000" # cleans at a week interval
min-retain-blocks = 400000

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned
Is a spike necessary to map out how the issue should be approached?

The text was updated successfully, but these errors were encountered:

mpoke · 2023-03-09T09:33:30Z

@simonecid Could you please let us know if you still experiencing the same issue with Gaia v8.0.1.

simonecid · 2023-03-13T11:55:14Z

@mpoke thank you for your answer.
I currently run v8.0.0 and I had the issue on this version, too. I generally have this problem once per month, so I will update to v9.0.0 on Wednesday with the chain upgrade and will let you know if gaiad gets stuck in the next 30 days.

simonecid · 2023-03-27T08:50:04Z

Hello @mpoke , unfortunately my gaiad v9.0.0 instance got stuck on block 14600000 and I had to reinitialise it from a snapshot. So, sadly, the issue persists on v9.0.0

mmulji-ic · 2023-04-04T12:16:12Z

Hi @adizere could you take a look at this issue also?

adizere · 2023-06-05T07:13:13Z

@simonecid do you use this node to serve (g)RPC queries?

simonecid · 2023-06-05T07:42:51Z

Hello @adizere , yes, we serve gRPC queries.

adizere · 2023-06-05T08:13:30Z

I see, then you're possibly experiencing the same issue as Osmosis nodes did here: cometbft/cometbft#815. This is a subtle problem due to both SDK/app and Comet ABCI locking design. On the Comet side, we will improve the locking granularity cometbft/cometbft#88. On the SDK/app side, better scoping of how (g)RPC queries are handled is needed and is being investigated; not sure this is tracked anywhere.

To confirm indeed you're experiencing the same problem as above, look for this line in the logs:

5:04AM INF Timed out dur=1000 height=9527566 module=consensus round=0 step=5

Please get back to us if you're able to confirm this log line appears when the node gets stuck.

simonecid · 2023-06-05T08:38:23Z

Hello @adizere , thank you for the prompt reply.

I have a partial copy of the logs prior to an incident in which the node got stuck on block 13800000 .
The earliest block recorded in the log is 13796462 and I can see:

7:11AM INF Timed out dur=3033.461354 height=13796462 module=consensus round=0 step=1

then for every block after that I can see the Timed out error until we fetched the block 13800000.
By then our logs started to be full of:

2:22PM INF Dialing peer address={"id":"8707282f51ebfba828c08a7316ca84ed5667a0f5","ip":"74.118.142.175","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"05024a29b6fb85197a3ed876e69faaea63b74c1b","ip":"35.201.248.155","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"fe0f79db9664f07bf770b5eb22693af530111014","ip":"3.238.104.92","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"793d2f2f855c5837d53a3f28a309c1bfa0774dea","ip":"5.9.230.130","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"fb7222250e62c55cc69dec6bc8aa8807f859215c","ip":"23.81.180.228","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"ab7bb8cc561f8ec01905c5f60786a53c7f5f02ad","ip":"18.209.62.136","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"5809988aea556dc842fb3a9c7b6cd1010f3b75d9","ip":"54.236.4.166","port":26656} module=p2p
2:22PM INF Dialing peer address={"id":"ccb18e6b2b46ed778394e05c0ef8c5062caf0afc","ip":"35.215.38.153","port":26656} module=p2p

so that fits with the cometbft issue you linked.

I did not have this log at the time I opened this issue, I will attach it to this comment.
I presume that the immediate mitigation is to not use gRPC until the problem is solved?

Thank you,
Simone

error_log_cosmoshub_clean.log

adizere · 2023-06-05T09:35:12Z

Thank you also!

I presume that the immediate mitigation is to not use gRPC until the problem is solved?
Correct. To rephrase: preliminary mitigation is to rate-limit at least (disable entirely, at most) g/RPC access to this node. At least for a couple of days and assess if the problem still appears. That would be a great first step!

mmulji-ic · 2023-06-27T09:23:25Z

Thank you also!

I presume that the immediate mitigation is to not use gRPC until the problem is solved?
Correct. To rephrase: preliminary mitigation is to rate-limit at least (disable entirely, at most) g/RPC access to this node. At least for a couple of days and assess if the problem still appears. That would be a great first step!

@simonecid , I hope the above mitigation helps, we've released v10, do let us know whether you still encounter this problem.

mpoke · 2023-09-11T13:57:51Z

Closing this issue due to lack of activity. @simonecid please let us know if the problem persists.

mpoke added this to Cosmos Hub Jan 20, 2023

github-project-automation bot moved this to 🩹 Triage in Cosmos Hub Jan 20, 2023

mpoke added the type: bug Issues that need priority attention -- something isn't working label Jan 20, 2023

mpoke moved this from 🩹 Triage to 🛑 Blocked in Cosmos Hub Mar 9, 2023

mpoke moved this from 🛑 Blocked to 📥 Todo in Cosmos Hub Apr 2, 2023

mpoke assigned mmulji-ic Apr 2, 2023

mmulji-ic moved this from 📥 Todo to 🏗 In progress in Cosmos Hub Apr 4, 2023

mmulji-ic moved this from 🏗 In progress to 🛑 Blocked in Cosmos Hub Apr 4, 2023

mpoke mentioned this issue Apr 14, 2023

Incoming user issues Q-2.23 #2405

Closed

mmulji-ic added the scope: comet-bft label May 10, 2023

mpoke mentioned this issue Jul 18, 2023

Incoming user issues Q-3.23 #2617

Closed

mpoke closed this as completed Sep 11, 2023

github-project-automation bot moved this from 🛑 F3: OnHold to 👍 F4: Assessment in Cosmos Hub Sep 11, 2023

mpoke moved this from 👍 F4: Assessment to ✅ Done in Cosmos Hub Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gaiad stuck on round number blocks #1958

Gaiad stuck on round number blocks #1958

simonecid commented Dec 13, 2022 •

edited

Loading

mpoke commented Mar 9, 2023

simonecid commented Mar 13, 2023

simonecid commented Mar 27, 2023

mmulji-ic commented Apr 4, 2023

adizere commented Jun 5, 2023

simonecid commented Jun 5, 2023

adizere commented Jun 5, 2023

simonecid commented Jun 5, 2023 •

edited

Loading

adizere commented Jun 5, 2023

mmulji-ic commented Jun 27, 2023

mpoke commented Sep 11, 2023

Gaiad stuck on round number blocks #1958

Gaiad stuck on round number blocks #1958

Comments

simonecid commented Dec 13, 2022 • edited Loading

Summary of Bug

Version

Steps to Reproduce

Configuration

For Admin Use

mpoke commented Mar 9, 2023

simonecid commented Mar 13, 2023

simonecid commented Mar 27, 2023

mmulji-ic commented Apr 4, 2023

adizere commented Jun 5, 2023

simonecid commented Jun 5, 2023

adizere commented Jun 5, 2023

simonecid commented Jun 5, 2023 • edited Loading

adizere commented Jun 5, 2023

mmulji-ic commented Jun 27, 2023

mpoke commented Sep 11, 2023

simonecid commented Dec 13, 2022 •

edited

Loading

simonecid commented Jun 5, 2023 •

edited

Loading