fix(consensus): we should panic if finalize block on apply commit fails #966

lklimek · 2024-11-02T18:55:45Z

Issue being fixed or feature implemented

When apply commit fails due to issue in finalize block (abci app returns error), abci client can be closed.
This leaves system in nonoperational state, causing future abci requests to fail with client has stopped errors.

This especially affects nodes that are not in current validator set, as apply commit is used in this case to finalize blocks
committed by the network.

When quorum rotation happens and affected nodes become active validators, they are stuck and cannot vote, leading to
potential chain halt.

What was done?

Panic in case of finalize block failure in apply commit.

Also removed redundant logs.

How Has This Been Tested?

Modified code to fail on finalize block on height 1070. Reproduced issue. Implemented fix to see that node panics, as expected.

Breaking Changes

None

Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have added or updated relevant unit/integration/functional/e2e tests
I have made corresponding changes to the documentation

For repository code-owners and collaborators only

I have assigned this pull request to a milestone

Summary by CodeRabbit

New Features
- Introduced a method to set the application hash size externally for improved block processing.
Bug Fixes
- Enhanced error handling during block finalization to ensure critical failures are addressed with a panic and restart mechanism.
Refactor
- Streamlined error handling in block execution for more efficient processing.
- Updated method signatures to improve type safety across mock implementations.

coderabbitai · 2024-11-02T18:55:52Z

Walkthrough

The changes in this pull request introduce modifications to error handling and method signatures across several files. In internal/consensus/state_apply_commit.go, the Execute method now panics upon encountering an error during block finalization, replacing the previous behavior of returning nil. The internal/state/execution.go file simplifies the execBlockWithoutState function by removing error logging and adds a new method, SetAppHashSize, to the BlockExecutor struct. Additionally, method signatures in mock files have been updated to utilize the new crypto.QuorumHash type.

Changes

File Path	Change Summary
internal/consensus/state_apply_commit.go	Modified `Execute` method to panic on errors during block finalization instead of returning `nil`. Added comments for clarity.
internal/state/execution.go	Simplified `execBlockWithoutState` by removing error logging. Added `SetAppHashSize` method to `BlockExecutor` for setting app hash size. Minor comment adjustments made.
dash/core/mocks/client.go	Updated `QuorumInfo` method signature to change `quorumHash` parameter type from `bytes.HexBytes` to `crypto.QuorumHash`.
types/mocks/priv_validator.go	Updated multiple method signatures to change `quorumHash` parameter type from `bytes.HexBytes` to `crypto.QuorumHash`, enhancing type safety.

Poem

In the land of code where rabbits play,
Errors now panic, no more delay! 🐇✨
Blocks finalize with a hop and a cheer,
Hash sizes set, clarity near.
With each little change, we grow and we learn,
In the world of commits, it's our turn to churn!

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

internal/consensus/state_apply_commit.go (1)
82-84: Consider adding metrics for monitoring finalize block failures.

To better monitor and alert on these critical failures, consider adding a metric to track finalize block panics.
 		// If something went wrong within ABCI client, it can stop and we can't recover from it.
 		// So, we panic here to ensure that the node will be restarted.
+		c.metrics.FinalizeBlockFailures.Inc()
 		panic(fmt.Errorf("failed to finalize block %X at height %d: %w", block.Hash(), block.Height, err))
internal/state/execution.go (2)
766-766: LGTM! Consider documenting the error handling strategy.

The simplification of error handling aligns with the PR objective. However, since this is a critical consensus function, consider adding a comment explaining that errors are intentionally propagated to the caller for centralized handling.
-	return execBlock(ctx, appConn, block, commit, logger)
+	// Directly propagate errors to caller for centralized handling during consensus
+	return execBlock(ctx, appConn, block, commit, logger)
766-766: Add documentation and validation for SetAppHashSize.

The new method needs documentation explaining its purpose and usage. Also consider adding validation for the size parameter.
+// SetAppHashSize sets the expected size of the application hash.
+// This size is used to validate the hash returned by the ABCI application
+// during block processing. The size must be greater than 0.
 func (blockExec *BlockExecutor) SetAppHashSize(size int) {
+	if size <= 0 {
+		panic(fmt.Sprintf("invalid app hash size: %d", size))
+	}
 	blockExec.appHashSize = size
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f8ebd34 and 34fce4d.

📒 Files selected for processing (2)

internal/consensus/state_apply_commit.go (1 hunks)
internal/state/execution.go (1 hunks)

🔇 Additional comments (3)

internal/consensus/state_apply_commit.go (3)

82-84: LGTM! The panic implementation aligns with PR objectives.

The change correctly implements panic behavior for finalize block failures, which is essential for preventing nodes from becoming stuck in an unrecoverable state. The error message provides good debugging context by including the block hash and height.

82-84: Verify panic recovery mechanisms.

While the panic implementation is correct, we should ensure proper panic recovery exists in upstream code to handle this gracefully during node restart.

#!/bin/bash
# Search for panic recovery mechanisms in consensus package
rg -A 5 "recover\(\)" internal/consensus/

82-84: Verify finalize method error conditions.

Let's ensure we understand all possible error conditions from the finalize method to validate this panic approach.

✅ Verification successful

Let me analyze the error paths from the execution results and generate a final response.

Panic is appropriate for unrecoverable ABCI client errors

The panic approach is justified here because:

The error occurs in FinalizeBlock which is a critical consensus operation
Error paths from the execution results show that errors at this stage are unrecoverable and indicate serious issues:
- Invalid app hash size
- Failed ABCI client responses
- Consensus state corruption
- Failed state commits

The node cannot safely continue operation if the ABCI client fails to finalize a block, as this would lead to:

Inconsistent application state
Inability to make progress in consensus
Potential chain halt

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for the finalize method implementation and its error paths
ast-grep --pattern 'func $_ finalize($$$) ($_, error) {
  $$$
}'

# Search for all error returns in finalize implementations
rg -A 3 "return.*err" --type go internal/state/

Length of output: 48537

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

dash/core/mocks/client.go (1)

Line range hint 175-241: Consider updating other quorum-related methods for consistency.

For better consistency across the codebase, consider updating the quorumHash parameter type from bytes.HexBytes to crypto.QuorumHash in the following methods as well:

QuorumSign

QuorumVerify

This is a safe change since crypto.QuorumHash is an alias of bytes.HexBytes.
types/mocks/priv_validator.go (1)
Inconsistent type usage found in dash/core/mocks/client.go

The file dash/core/mocks/client.go is part of the current PR changes but still uses bytes.HexBytes for quorumHash parameters in its methods:

QuorumSign

QuorumVerify

This is inconsistent with the type updates in other files where quorumHash parameters have been changed to crypto.QuorumHash. While we know from the learnings that crypto.QuorumHash is an alias of bytes.HexBytes, maintaining consistency across the codebase is important for readability and maintainability.

🔗 Analysis chain

Line range hint 28-274: LGTM! Type updates are consistent with the codebase.

The changes consistently update the quorumHash parameter type from bytes.HexBytes to crypto.QuorumHash across all methods. This is safe because:

According to previous learnings, crypto.QuorumHash is an alias of bytes.HexBytes

The changes are applied consistently across all affected methods

The mock's return type handling is correctly updated to match the new parameter types

Let's verify that all occurrences of quorumHash parameters have been updated:

Based on the shell script results, I notice there are still some files using bytes.HexBytes for quorumHash parameters, particularly in:

dash/quorum/selectpeers/dip6.go

dash/quorum/selectpeers/dip6_test.go

dash/core/mocks/client.go

dash/core/client.go

Let's verify if these files are part of the current PR changes:
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that all quorumHash parameters use crypto.QuorumHash type
# and there are no remaining bytes.HexBytes usages for quorumHash

# Search for any remaining bytes.HexBytes type used for quorumHash parameters
rg -U 'quorumHash\s+bytes\.HexBytes' || echo "No remaining bytes.HexBytes found for quorumHash parameters"

# Verify all quorumHash parameters use crypto.QuorumHash
rg -U 'quorumHash\s+crypto\.QuorumHash'
Length of output: 14667

Script:
#!/bin/bash
# Check if these files are modified in the current PR
git diff --name-only HEAD^ HEAD | grep -E 'dash/quorum/selectpeers/dip6.go|dash/quorum/selectpeers/dip6_test.go|dash/core/mocks/client.go|dash/core/client.go' || echo "None of these files are modified in the current PR"
Length of output: 188

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 34fce4d and a5cfdfa.

📒 Files selected for processing (3)

.github/workflows/lint.yml (1 hunks)
dash/core/mocks/client.go (3 hunks)
types/mocks/priv_validator.go (15 hunks)

🧰 Additional context used

📓 Learnings (2)

dash/core/mocks/client.go (1)

Learnt from: lklimek
PR: dashpay/tenderdash#955
File: dash/core/mocks/client.go:146-146
Timestamp: 2024-10-16T12:00:25.016Z
Learning: In the Tenderdash codebase, `crypto.QuorumHash` is an alias of `bytes.HexBytes`, so updating method parameter types from `bytes.HexBytes` to `crypto.QuorumHash` does not require changing existing method calls.

types/mocks/priv_validator.go (1)

Learnt from: lklimek
PR: dashpay/tenderdash#955
File: dash/core/mocks/client.go:146-146
Timestamp: 2024-10-16T12:00:25.016Z
Learning: In the Tenderdash codebase, `crypto.QuorumHash` is an alias of `bytes.HexBytes`, so updating method parameter types from `bytes.HexBytes` to `crypto.QuorumHash` does not require changing existing method calls.

🔇 Additional comments (3)

.github/workflows/lint.yml (1)

59-59: Consider the trade-offs of only checking new issues.

While focusing on new issues can reduce noise in PR reviews, it might miss important existing issues that could be relevant to the error handling changes in this PR. Given the critical nature of the changes (introducing panic for finalize block failures), comprehensive linting coverage is important.

Let's verify the impact of this change on linting coverage:

Consider keeping only-new-issues: false for critical changes affecting error handling, or at minimum, run a full lint check locally before merging.

✅ Verification successful

Based on the shell script results, I can now provide a comprehensive analysis of the linting configuration change:

Setting only-new-issues: true is appropriate for this PR

The codebase already has established error handling patterns and panic usage that follow consistent conventions:

Panics are primarily used in:
- Test code and mocks
- Initialization/setup failures
- Unrecoverable states that indicate programming errors
- Generated protobuf code
Error handling follows good practices:
- Proper error propagation
- Descriptive error messages
- Consistent error checking patterns

The PR's changes introducing panic for finalize block failures align with the existing patterns where panics are used for truly unrecoverable states. The linting configuration to check only new issues is appropriate because:

The existing error handling is well-established and consistent
The PR's changes follow the same patterns
Focusing on new issues will help catch any deviations in the new code while avoiding noise from existing patterns

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for existing linting issues that might be relevant to error handling

# Test 1: Look for existing error handling patterns
echo "Checking existing error handling patterns..."
rg -g '*.go' 'if err != nil|return err|panic\(' --stats

# Test 2: Look for existing panic calls
echo "Checking existing panic calls..."
ast-grep --pattern 'panic($_)'

# Test 3: Check for existing error-related comments that might need attention
echo "Checking error-related TODO/FIXME comments..."
rg -g '*.go' 'TODO.*error|FIXME.*error|TODO.*panic|FIXME.*panic'

Length of output: 305923

dash/core/mocks/client.go (2)

9-10: LGTM: Import statement correctly added.

The new import for the crypto package is necessary for using the crypto.QuorumHash type.

Line range hint 146-171: LGTM: Type updates are consistent.

The changes to use crypto.QuorumHash are implemented consistently throughout the QuorumInfo method, including all type assertions.

shumkov

👍

fix(consensus): we should panic on apply commit

34fce4d

lklimek requested review from QuantumExplorer and shumkov as code owners November 2, 2024 18:55

coderabbitai bot reviewed Nov 2, 2024

View reviewed changes

lklimek added 2 commits November 2, 2024 20:28

build(github): lint only new issues

c9910b8

chore: update mocks

a5cfdfa

coderabbitai bot reviewed Nov 2, 2024

View reviewed changes

lklimek enabled auto-merge (squash) November 2, 2024 19:43

shumkov approved these changes Nov 2, 2024

View reviewed changes

lklimek merged commit 71764d5 into v1.3-dev Nov 2, 2024
19 checks passed

lklimek deleted the fix/non-validator-finalize-failure branch November 2, 2024 19:46

lklimek added this to the v1.3 milestone Nov 2, 2024

coderabbitai bot mentioned this pull request Nov 2, 2024

fix(rpc): validators endpoint fail during quorum rotation #959

Merged

5 tasks

This was referenced Nov 4, 2024

fix(drive): uncommitted state if db transaction fails dashpay/platform#2305

Merged

fix(drive): apply batch is not using transaction in remove_all_votes_given_by_identities dashpay/platform#2309

Merged

coderabbitai bot mentioned this pull request Nov 4, 2024

build(deps): replace tendermint/tm-db with cometbft/cometbft-db #973

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(consensus): we should panic if finalize block on apply commit fails #966

fix(consensus): we should panic if finalize block on apply commit fails #966

lklimek commented Nov 2, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 2, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

shumkov left a comment

fix(consensus): we should panic if finalize block on apply commit fails #966

fix(consensus): we should panic if finalize block on apply commit fails #966

Conversation

lklimek commented Nov 2, 2024 • edited by coderabbitai bot Loading

Issue being fixed or feature implemented

What was done?

How Has This Been Tested?

Breaking Changes

Checklist:

Summary by CodeRabbit

coderabbitai bot commented Nov 2, 2024 • edited Loading

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

shumkov left a comment

Choose a reason for hiding this comment

lklimek commented Nov 2, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 2, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)