-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(consensus): we should panic if finalize block on apply commit fails #966
Conversation
WalkthroughThe changes in this pull request introduce modifications to error handling and method signatures across several files. In Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
internal/consensus/state_apply_commit.go (1)
82-84
: Consider adding metrics for monitoring finalize block failures.To better monitor and alert on these critical failures, consider adding a metric to track finalize block panics.
// If something went wrong within ABCI client, it can stop and we can't recover from it. // So, we panic here to ensure that the node will be restarted. + c.metrics.FinalizeBlockFailures.Inc() panic(fmt.Errorf("failed to finalize block %X at height %d: %w", block.Hash(), block.Height, err))
internal/state/execution.go (2)
766-766
: LGTM! Consider documenting the error handling strategy.The simplification of error handling aligns with the PR objective. However, since this is a critical consensus function, consider adding a comment explaining that errors are intentionally propagated to the caller for centralized handling.
- return execBlock(ctx, appConn, block, commit, logger) + // Directly propagate errors to caller for centralized handling during consensus + return execBlock(ctx, appConn, block, commit, logger)
766-766
: Add documentation and validation for SetAppHashSize.The new method needs documentation explaining its purpose and usage. Also consider adding validation for the size parameter.
+// SetAppHashSize sets the expected size of the application hash. +// This size is used to validate the hash returned by the ABCI application +// during block processing. The size must be greater than 0. func (blockExec *BlockExecutor) SetAppHashSize(size int) { + if size <= 0 { + panic(fmt.Sprintf("invalid app hash size: %d", size)) + } blockExec.appHashSize = size }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
internal/consensus/state_apply_commit.go
(1 hunks)internal/state/execution.go
(1 hunks)
🔇 Additional comments (3)
internal/consensus/state_apply_commit.go (3)
82-84
: LGTM! The panic implementation aligns with PR objectives.
The change correctly implements panic behavior for finalize block failures, which is essential for preventing nodes from becoming stuck in an unrecoverable state. The error message provides good debugging context by including the block hash and height.
82-84
: Verify panic recovery mechanisms.
While the panic implementation is correct, we should ensure proper panic recovery exists in upstream code to handle this gracefully during node restart.
#!/bin/bash
# Search for panic recovery mechanisms in consensus package
rg -A 5 "recover\(\)" internal/consensus/
82-84
: Verify finalize method error conditions.
Let's ensure we understand all possible error conditions from the finalize method to validate this panic approach.
✅ Verification successful
Let me analyze the error paths from the execution results and generate a final response.
Panic is appropriate for unrecoverable ABCI client errors
The panic approach is justified here because:
- The error occurs in
FinalizeBlock
which is a critical consensus operation - Error paths from the execution results show that errors at this stage are unrecoverable and indicate serious issues:
- Invalid app hash size
- Failed ABCI client responses
- Consensus state corruption
- Failed state commits
The node cannot safely continue operation if the ABCI client fails to finalize a block, as this would lead to:
- Inconsistent application state
- Inability to make progress in consensus
- Potential chain halt
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for the finalize method implementation and its error paths
ast-grep --pattern 'func $_ finalize($$$) ($_, error) {
$$$
}'
# Search for all error returns in finalize implementations
rg -A 3 "return.*err" --type go internal/state/
Length of output: 48537
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
dash/core/mocks/client.go (1)
Line range hint
175-241
: Consider updating other quorum-related methods for consistency.For better consistency across the codebase, consider updating the
quorumHash
parameter type frombytes.HexBytes
tocrypto.QuorumHash
in the following methods as well:
QuorumSign
QuorumVerify
This is a safe change since
crypto.QuorumHash
is an alias ofbytes.HexBytes
.types/mocks/priv_validator.go (1)
Inconsistent type usage found in
dash/core/mocks/client.go
The file
dash/core/mocks/client.go
is part of the current PR changes but still usesbytes.HexBytes
forquorumHash
parameters in its methods:
QuorumSign
QuorumVerify
This is inconsistent with the type updates in other files where
quorumHash
parameters have been changed tocrypto.QuorumHash
. While we know from the learnings thatcrypto.QuorumHash
is an alias ofbytes.HexBytes
, maintaining consistency across the codebase is important for readability and maintainability.🔗 Analysis chain
Line range hint
28-274
: LGTM! Type updates are consistent with the codebase.The changes consistently update the
quorumHash
parameter type frombytes.HexBytes
tocrypto.QuorumHash
across all methods. This is safe because:
- According to previous learnings,
crypto.QuorumHash
is an alias ofbytes.HexBytes
- The changes are applied consistently across all affected methods
- The mock's return type handling is correctly updated to match the new parameter types
Let's verify that all occurrences of
quorumHash
parameters have been updated:
Based on the shell script results, I notice there are still some files using
bytes.HexBytes
forquorumHash
parameters, particularly in:
dash/quorum/selectpeers/dip6.go
dash/quorum/selectpeers/dip6_test.go
dash/core/mocks/client.go
dash/core/client.go
Let's verify if these files are part of the current PR changes:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that all quorumHash parameters use crypto.QuorumHash type # and there are no remaining bytes.HexBytes usages for quorumHash # Search for any remaining bytes.HexBytes type used for quorumHash parameters rg -U 'quorumHash\s+bytes\.HexBytes' || echo "No remaining bytes.HexBytes found for quorumHash parameters" # Verify all quorumHash parameters use crypto.QuorumHash rg -U 'quorumHash\s+crypto\.QuorumHash'Length of output: 14667
Script:
#!/bin/bash # Check if these files are modified in the current PR git diff --name-only HEAD^ HEAD | grep -E 'dash/quorum/selectpeers/dip6.go|dash/quorum/selectpeers/dip6_test.go|dash/core/mocks/client.go|dash/core/client.go' || echo "None of these files are modified in the current PR"Length of output: 188
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
.github/workflows/lint.yml
(1 hunks)dash/core/mocks/client.go
(3 hunks)types/mocks/priv_validator.go
(15 hunks)
🧰 Additional context used
📓 Learnings (2)
dash/core/mocks/client.go (1)
Learnt from: lklimek
PR: dashpay/tenderdash#955
File: dash/core/mocks/client.go:146-146
Timestamp: 2024-10-16T12:00:25.016Z
Learning: In the Tenderdash codebase, `crypto.QuorumHash` is an alias of `bytes.HexBytes`, so updating method parameter types from `bytes.HexBytes` to `crypto.QuorumHash` does not require changing existing method calls.
types/mocks/priv_validator.go (1)
Learnt from: lklimek
PR: dashpay/tenderdash#955
File: dash/core/mocks/client.go:146-146
Timestamp: 2024-10-16T12:00:25.016Z
Learning: In the Tenderdash codebase, `crypto.QuorumHash` is an alias of `bytes.HexBytes`, so updating method parameter types from `bytes.HexBytes` to `crypto.QuorumHash` does not require changing existing method calls.
🔇 Additional comments (3)
.github/workflows/lint.yml (1)
59-59
: Consider the trade-offs of only checking new issues.
While focusing on new issues can reduce noise in PR reviews, it might miss important existing issues that could be relevant to the error handling changes in this PR. Given the critical nature of the changes (introducing panic for finalize block failures), comprehensive linting coverage is important.
Let's verify the impact of this change on linting coverage:
Consider keeping only-new-issues: false
for critical changes affecting error handling, or at minimum, run a full lint check locally before merging.
✅ Verification successful
Based on the shell script results, I can now provide a comprehensive analysis of the linting configuration change:
Setting only-new-issues: true
is appropriate for this PR
The codebase already has established error handling patterns and panic usage that follow consistent conventions:
-
Panics are primarily used in:
- Test code and mocks
- Initialization/setup failures
- Unrecoverable states that indicate programming errors
- Generated protobuf code
-
Error handling follows good practices:
- Proper error propagation
- Descriptive error messages
- Consistent error checking patterns
The PR's changes introducing panic for finalize block failures align with the existing patterns where panics are used for truly unrecoverable states. The linting configuration to check only new issues is appropriate because:
- The existing error handling is well-established and consistent
- The PR's changes follow the same patterns
- Focusing on new issues will help catch any deviations in the new code while avoiding noise from existing patterns
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for existing linting issues that might be relevant to error handling
# Test 1: Look for existing error handling patterns
echo "Checking existing error handling patterns..."
rg -g '*.go' 'if err != nil|return err|panic\(' --stats
# Test 2: Look for existing panic calls
echo "Checking existing panic calls..."
ast-grep --pattern 'panic($_)'
# Test 3: Check for existing error-related comments that might need attention
echo "Checking error-related TODO/FIXME comments..."
rg -g '*.go' 'TODO.*error|FIXME.*error|TODO.*panic|FIXME.*panic'
Length of output: 305923
dash/core/mocks/client.go (2)
9-10
: LGTM: Import statement correctly added.
The new import for the crypto
package is necessary for using the crypto.QuorumHash
type.
Line range hint 146-171
: LGTM: Type updates are consistent.
The changes to use crypto.QuorumHash
are implemented consistently throughout the QuorumInfo
method, including all type assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Issue being fixed or feature implemented
When apply commit fails due to issue in finalize block (abci app returns error), abci client can be closed.
This leaves system in nonoperational state, causing future abci requests to fail with
client has stopped
errors.This especially affects nodes that are not in current validator set, as apply commit is used in this case to finalize blocks
committed by the network.
When quorum rotation happens and affected nodes become active validators, they are stuck and cannot vote, leading to
potential chain halt.
What was done?
Panic in case of finalize block failure in apply commit.
Also removed redundant logs.
How Has This Been Tested?
Modified code to fail on finalize block on height 1070. Reproduced issue. Implemented fix to see that node panics, as expected.
Breaking Changes
None
Checklist:
For repository code-owners and collaborators only
Summary by CodeRabbit
New Features
Bug Fixes
Refactor