Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus failure on cosmos DB pruning #8354

Open
mhofman opened this issue Sep 19, 2023 · 1 comment
Open

Consensus failure on cosmos DB pruning #8354

mhofman opened this issue Sep 19, 2023 · 1 comment
Assignees
Labels
agoric-cosmos bug Something isn't working

Comments

@mhofman
Copy link
Member

mhofman commented Sep 19, 2023

Describe the bug

A validator reported experiencing a consensus failure with the following error:

ERR CONSENSUS FAILURE!!! err="unable to delete version 11686000 with 1 active readers" module=consensus stack="goroutine 7581"

Tracking the error message, it stems from cosmos IAVL pruning logic: https://github.com/cosmos/iavl/blob/v0.17.3/nodedb.go#L203

Our cosmos-sdk is using version v0.17.3 of that package, while the latest v0.47 of cosmos-sdk has bumped the dependency to v0.20.0. However searching for changes and issues on the iavl repo doesn't raise any changes in logic related deleting and existing readers.

There is a known issue regarding mismatching snapshot-interval and keep-interval configs in cosmos, but 1) they're supposed to be mitigated in our version of cosmos-sdk, and 2) the validator claims the node is not creating state-sync snapshots.

The config relating to pruning shared by the validator:

pruning = "custom"
pruning-keep-recent = "100"
pruning-keep-every = "0"
pruning-interval = "10"

A keep-recent of 100 should allow any potential state-sync snapshot of the cosmos DB to be performed. While the full snapshot may not yet be complete after 100 blocks since our snapshots usually take about 150 blocks to complete, the snapshot of the multistore is performed first and the read of the multistore closed before reaching the swingset extension which is where all the time is spent. See https://github.com/agoric-labs/cosmos-sdk/blob/v0.45.11-alpha.agoric.3/snapshots/manager.go#L176-L186

A somewhat related issue in cosmos-sdk regarding prune everything doesn't seem applicable since the keep-recent config is set to 100.

Expected behavior

No crash on pruning

Platform Environment

agoric-upgrade-11 on mainnet

@mhofman mhofman added bug Something isn't working agoric-cosmos labels Sep 19, 2023
@JimLarson JimLarson self-assigned this Sep 20, 2023
@JimLarson
Copy link
Contributor

Original thought: Underlying Cosmos issue - need to confirm. If it's broken-as-intended, at least make an FAQ entry.

@mhofman says that the reporter wasn't doing state sync exports, and even if they were, it's already mitigated.

Validator recovered on restart.

@ivanlei ivanlei assigned ivanlei and unassigned JimLarson Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agoric-cosmos bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants