Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(consensus): we should panic if finalize block on apply commit fails #966

Merged
merged 3 commits into from
Nov 2, 2024

Conversation

lklimek
Copy link
Collaborator

@lklimek lklimek commented Nov 2, 2024

Issue being fixed or feature implemented

When apply commit fails due to issue in finalize block (abci app returns error), abci client can be closed.
This leaves system in nonoperational state, causing future abci requests to fail with client has stopped errors.

This especially affects nodes that are not in current validator set, as apply commit is used in this case to finalize blocks
committed by the network.

When quorum rotation happens and affected nodes become active validators, they are stuck and cannot vote, leading to
potential chain halt.

What was done?

Panic in case of finalize block failure in apply commit.

Also removed redundant logs.

How Has This Been Tested?

Modified code to fail on finalize block on height 1070. Reproduced issue. Implemented fix to see that node panics, as expected.

Breaking Changes

None

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have made corresponding changes to the documentation

For repository code-owners and collaborators only

  • I have assigned this pull request to a milestone

Summary by CodeRabbit

  • New Features

    • Introduced a method to set the application hash size externally for improved block processing.
  • Bug Fixes

    • Enhanced error handling during block finalization to ensure critical failures are addressed with a panic and restart mechanism.
  • Refactor

    • Streamlined error handling in block execution for more efficient processing.
    • Updated method signatures to improve type safety across mock implementations.

Copy link
Contributor

coderabbitai bot commented Nov 2, 2024

Walkthrough

The changes in this pull request introduce modifications to error handling and method signatures across several files. In internal/consensus/state_apply_commit.go, the Execute method now panics upon encountering an error during block finalization, replacing the previous behavior of returning nil. The internal/state/execution.go file simplifies the execBlockWithoutState function by removing error logging and adds a new method, SetAppHashSize, to the BlockExecutor struct. Additionally, method signatures in mock files have been updated to utilize the new crypto.QuorumHash type.

Changes

File Path Change Summary
internal/consensus/state_apply_commit.go Modified Execute method to panic on errors during block finalization instead of returning nil. Added comments for clarity.
internal/state/execution.go Simplified execBlockWithoutState by removing error logging. Added SetAppHashSize method to BlockExecutor for setting app hash size. Minor comment adjustments made.
dash/core/mocks/client.go Updated QuorumInfo method signature to change quorumHash parameter type from bytes.HexBytes to crypto.QuorumHash.
types/mocks/priv_validator.go Updated multiple method signatures to change quorumHash parameter type from bytes.HexBytes to crypto.QuorumHash, enhancing type safety.

Poem

In the land of code where rabbits play,
Errors now panic, no more delay! 🐇✨
Blocks finalize with a hop and a cheer,
Hash sizes set, clarity near.
With each little change, we grow and we learn,
In the world of commits, it's our turn to churn!


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
internal/consensus/state_apply_commit.go (1)

82-84: Consider adding metrics for monitoring finalize block failures.

To better monitor and alert on these critical failures, consider adding a metric to track finalize block panics.

 		// If something went wrong within ABCI client, it can stop and we can't recover from it.
 		// So, we panic here to ensure that the node will be restarted.
+		c.metrics.FinalizeBlockFailures.Inc()
 		panic(fmt.Errorf("failed to finalize block %X at height %d: %w", block.Hash(), block.Height, err))
internal/state/execution.go (2)

766-766: LGTM! Consider documenting the error handling strategy.

The simplification of error handling aligns with the PR objective. However, since this is a critical consensus function, consider adding a comment explaining that errors are intentionally propagated to the caller for centralized handling.

-	return execBlock(ctx, appConn, block, commit, logger)
+	// Directly propagate errors to caller for centralized handling during consensus
+	return execBlock(ctx, appConn, block, commit, logger)

766-766: Add documentation and validation for SetAppHashSize.

The new method needs documentation explaining its purpose and usage. Also consider adding validation for the size parameter.

+// SetAppHashSize sets the expected size of the application hash.
+// This size is used to validate the hash returned by the ABCI application
+// during block processing. The size must be greater than 0.
 func (blockExec *BlockExecutor) SetAppHashSize(size int) {
+	if size <= 0 {
+		panic(fmt.Sprintf("invalid app hash size: %d", size))
+	}
 	blockExec.appHashSize = size
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f8ebd34 and 34fce4d.

📒 Files selected for processing (2)
  • internal/consensus/state_apply_commit.go (1 hunks)
  • internal/state/execution.go (1 hunks)
🔇 Additional comments (3)
internal/consensus/state_apply_commit.go (3)

82-84: LGTM! The panic implementation aligns with PR objectives.

The change correctly implements panic behavior for finalize block failures, which is essential for preventing nodes from becoming stuck in an unrecoverable state. The error message provides good debugging context by including the block hash and height.


82-84: Verify panic recovery mechanisms.

While the panic implementation is correct, we should ensure proper panic recovery exists in upstream code to handle this gracefully during node restart.

#!/bin/bash
# Search for panic recovery mechanisms in consensus package
rg -A 5 "recover\(\)" internal/consensus/

82-84: Verify finalize method error conditions.

Let's ensure we understand all possible error conditions from the finalize method to validate this panic approach.

✅ Verification successful

Let me analyze the error paths from the execution results and generate a final response.

Panic is appropriate for unrecoverable ABCI client errors

The panic approach is justified here because:

  1. The error occurs in FinalizeBlock which is a critical consensus operation
  2. Error paths from the execution results show that errors at this stage are unrecoverable and indicate serious issues:
    • Invalid app hash size
    • Failed ABCI client responses
    • Consensus state corruption
    • Failed state commits

The node cannot safely continue operation if the ABCI client fails to finalize a block, as this would lead to:

  • Inconsistent application state
  • Inability to make progress in consensus
  • Potential chain halt
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for the finalize method implementation and its error paths
ast-grep --pattern 'func $_ finalize($$$) ($_, error) {
  $$$
}'

# Search for all error returns in finalize implementations
rg -A 3 "return.*err" --type go internal/state/

Length of output: 48537

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
dash/core/mocks/client.go (1)

Line range hint 175-241: Consider updating other quorum-related methods for consistency.

For better consistency across the codebase, consider updating the quorumHash parameter type from bytes.HexBytes to crypto.QuorumHash in the following methods as well:

  • QuorumSign
  • QuorumVerify

This is a safe change since crypto.QuorumHash is an alias of bytes.HexBytes.

types/mocks/priv_validator.go (1)

Inconsistent type usage found in dash/core/mocks/client.go

The file dash/core/mocks/client.go is part of the current PR changes but still uses bytes.HexBytes for quorumHash parameters in its methods:

  • QuorumSign
  • QuorumVerify

This is inconsistent with the type updates in other files where quorumHash parameters have been changed to crypto.QuorumHash. While we know from the learnings that crypto.QuorumHash is an alias of bytes.HexBytes, maintaining consistency across the codebase is important for readability and maintainability.

🔗 Analysis chain

Line range hint 28-274: LGTM! Type updates are consistent with the codebase.

The changes consistently update the quorumHash parameter type from bytes.HexBytes to crypto.QuorumHash across all methods. This is safe because:

  1. According to previous learnings, crypto.QuorumHash is an alias of bytes.HexBytes
  2. The changes are applied consistently across all affected methods
  3. The mock's return type handling is correctly updated to match the new parameter types

Let's verify that all occurrences of quorumHash parameters have been updated:


Based on the shell script results, I notice there are still some files using bytes.HexBytes for quorumHash parameters, particularly in:

  • dash/quorum/selectpeers/dip6.go
  • dash/quorum/selectpeers/dip6_test.go
  • dash/core/mocks/client.go
  • dash/core/client.go

Let's verify if these files are part of the current PR changes:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that all quorumHash parameters use crypto.QuorumHash type
# and there are no remaining bytes.HexBytes usages for quorumHash

# Search for any remaining bytes.HexBytes type used for quorumHash parameters
rg -U 'quorumHash\s+bytes\.HexBytes' || echo "No remaining bytes.HexBytes found for quorumHash parameters"

# Verify all quorumHash parameters use crypto.QuorumHash
rg -U 'quorumHash\s+crypto\.QuorumHash'

Length of output: 14667


Script:

#!/bin/bash
# Check if these files are modified in the current PR
git diff --name-only HEAD^ HEAD | grep -E 'dash/quorum/selectpeers/dip6.go|dash/quorum/selectpeers/dip6_test.go|dash/core/mocks/client.go|dash/core/client.go' || echo "None of these files are modified in the current PR"

Length of output: 188

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 34fce4d and a5cfdfa.

📒 Files selected for processing (3)
  • .github/workflows/lint.yml (1 hunks)
  • dash/core/mocks/client.go (3 hunks)
  • types/mocks/priv_validator.go (15 hunks)
🧰 Additional context used
📓 Learnings (2)
dash/core/mocks/client.go (1)
Learnt from: lklimek
PR: dashpay/tenderdash#955
File: dash/core/mocks/client.go:146-146
Timestamp: 2024-10-16T12:00:25.016Z
Learning: In the Tenderdash codebase, `crypto.QuorumHash` is an alias of `bytes.HexBytes`, so updating method parameter types from `bytes.HexBytes` to `crypto.QuorumHash` does not require changing existing method calls.
types/mocks/priv_validator.go (1)
Learnt from: lklimek
PR: dashpay/tenderdash#955
File: dash/core/mocks/client.go:146-146
Timestamp: 2024-10-16T12:00:25.016Z
Learning: In the Tenderdash codebase, `crypto.QuorumHash` is an alias of `bytes.HexBytes`, so updating method parameter types from `bytes.HexBytes` to `crypto.QuorumHash` does not require changing existing method calls.
🔇 Additional comments (3)
.github/workflows/lint.yml (1)

59-59: Consider the trade-offs of only checking new issues.

While focusing on new issues can reduce noise in PR reviews, it might miss important existing issues that could be relevant to the error handling changes in this PR. Given the critical nature of the changes (introducing panic for finalize block failures), comprehensive linting coverage is important.

Let's verify the impact of this change on linting coverage:

Consider keeping only-new-issues: false for critical changes affecting error handling, or at minimum, run a full lint check locally before merging.

✅ Verification successful

Based on the shell script results, I can now provide a comprehensive analysis of the linting configuration change:

Setting only-new-issues: true is appropriate for this PR

The codebase already has established error handling patterns and panic usage that follow consistent conventions:

  1. Panics are primarily used in:

    • Test code and mocks
    • Initialization/setup failures
    • Unrecoverable states that indicate programming errors
    • Generated protobuf code
  2. Error handling follows good practices:

    • Proper error propagation
    • Descriptive error messages
    • Consistent error checking patterns

The PR's changes introducing panic for finalize block failures align with the existing patterns where panics are used for truly unrecoverable states. The linting configuration to check only new issues is appropriate because:

  1. The existing error handling is well-established and consistent
  2. The PR's changes follow the same patterns
  3. Focusing on new issues will help catch any deviations in the new code while avoiding noise from existing patterns
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for existing linting issues that might be relevant to error handling

# Test 1: Look for existing error handling patterns
echo "Checking existing error handling patterns..."
rg -g '*.go' 'if err != nil|return err|panic\(' --stats

# Test 2: Look for existing panic calls
echo "Checking existing panic calls..."
ast-grep --pattern 'panic($_)'

# Test 3: Check for existing error-related comments that might need attention
echo "Checking error-related TODO/FIXME comments..."
rg -g '*.go' 'TODO.*error|FIXME.*error|TODO.*panic|FIXME.*panic'

Length of output: 305923

dash/core/mocks/client.go (2)

9-10: LGTM: Import statement correctly added.

The new import for the crypto package is necessary for using the crypto.QuorumHash type.


Line range hint 146-171: LGTM: Type updates are consistent.

The changes to use crypto.QuorumHash are implemented consistently throughout the QuorumInfo method, including all type assertions.

@lklimek lklimek enabled auto-merge (squash) November 2, 2024 19:43
Copy link
Member

@shumkov shumkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants