Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Node) Enhance raft test codes #2289

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nick-bisonai
Copy link
Collaborator

@nick-bisonai nick-bisonai commented Sep 27, 2024

Description

Minor updates

  • add some properties and functions for testing purposes in raft code
  • add more test codes which simulates certain situation while running raft cluster
  • remove resignLeader channel and unified logic into the function

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.

Deployment

  • Should publish npm package
  • Should publish Docker image

@nick-bisonai nick-bisonai self-assigned this Sep 27, 2024
Copy link
Contributor

coderabbitai bot commented Sep 27, 2024

📝 Walkthrough

Walkthrough

The changes involve significant modifications to the Raft consensus algorithm implementation in the node/pkg/raft package. Key updates include the removal of the Resign channel, enhancements to support a blacklist mechanism for nodes, and the addition of an isDelayedNode feature. The testing framework has also been improved, allowing for dynamic node configuration and introducing functions for isolating and recovering nodes. These changes streamline the leadership transition process and enhance testing scenarios for better validation of Raft behaviors.

Changes

Files Change Summary
node/pkg/raft/raft.go Removed Resign channel from Raft struct; updated ResignLeader and becomeLeader methods; added blacklist functionality and methods for managing it; introduced isDelayedNode flag and related methods.
node/pkg/raft/raft_test.go Added isolateNode and recoverNode functions; modified setup function for dynamic node count; updated test cases to include scenarios with two nodes; improved logging statements.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant RaftNode
    participant Blacklist

    Client->>RaftNode: Send Message
    RaftNode->>Blacklist: Check if sender is blacklisted
    alt Sender is blacklisted
        Blacklist-->>RaftNode: Return true
        RaftNode-->>Client: Ignore message
    else Sender is not blacklisted
        Blacklist-->>RaftNode: Return false
        RaftNode-->>Client: Process message
    end
Loading

🐇 In the realm of Raft, we hop with glee,
With blacklists and delays, oh what a spree!
Leadership flows like a dance in the night,
Testing our nodes, making sure they’re just right.
So let’s celebrate this code, so bright!
Hopping along, everything feels just right! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@nick-bisonai nick-bisonai marked this pull request as ready for review September 27, 2024 08:46
@nick-bisonai nick-bisonai requested a review from a team as a code owner September 27, 2024 08:46
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Outside diff range and nitpick comments (6)
node/pkg/raft/raft_test.go (6)

72-79: Handle potential errors in isolateNode function

In the isolateNode function, you are calling entry.addBlacklist(node.GetHostId()) without checking for errors. Consider handling any potential errors returned by addBlacklist to ensure robustness and facilitate debugging.

Apply this diff to handle errors:

 for _, entry := range nodes {
     if entry.GetHostId() == node.GetHostId() {
         continue
     }
-    entry.addBlacklist(node.GetHostId())
+    if err := entry.addBlacklist(node.GetHostId()); err != nil {
+        log.Error().Err(err).Msgf("Failed to isolate node %s", node.GetHostId())
+    }
 }

81-88: Handle potential errors in recoverNode function

Similarly, in the recoverNode function, you're calling entry.removeBlacklist(node.GetHostId()) without error handling. It's good practice to handle or log any errors to assist with troubleshooting.

Apply this diff to handle errors:

 for _, entry := range nodes {
     if entry.GetHostId() == node.GetHostId() {
         continue
     }
-    entry.removeBlacklist(node.GetHostId())
+    if err := entry.removeBlacklist(node.GetHostId()); err != nil {
+        log.Error().Err(err).Msgf("Failed to recover node %s", node.GetHostId())
+    }
 }

90-90: Update function comments to reflect new parameter

The setup function signature now includes numOfNodes as a parameter. Updating the function's comment to include this parameter will improve code clarity and maintainability.


161-162: Update documentation for setupRaftCluster

With the addition of numOfNodes to the setupRaftCluster function, ensure that any associated comments or documentation are updated to reflect this change for consistency and clarity.


310-313: Add cleanup for newNode.Topic in tests

In your test case, after creating newNode, you defer newNode.Host.Close() but don't close newNode.Topic. To ensure all resources are released, consider closing the topic as well.

Apply this diff:

 defer newNode.Host.Close()
+defer newNode.Topic.Close()

865-866: Ensure consistency in log messages across tests

In TestRaft_MessageDelay_2_nodes, you correctly log "Ready to test, setup 2 nodes". Review other test functions to ensure all log messages accurately reflect the number of nodes involved, enhancing clarity during test executions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between fd3d9ae and 4735e5e.

📒 Files selected for processing (3)
  • node/pkg/raft/raft.go (6 hunks)
  • node/pkg/raft/raft_test.go (10 hunks)
  • node/pkg/raft/types.go (1 hunks)
🔇 Additional comments (4)
node/pkg/raft/types.go (1)

77-79: Consider separating test-specific fields and clarify changes

  1. The addition of test-specific fields (blacklist and isDelayedNode) to the main Raft struct might not be ideal:

    • It could lead to confusion in production code.
    • It might unnecessarily increase the memory footprint in non-test scenarios.

    Consider moving these fields to a separate test-specific struct or using dependency injection for testing.

  2. Please clarify the purpose and usage of the new fields:

    • What is the blacklist used for in the context of Raft testing?
    • How is the isDelayedNode flag utilized in test scenarios?
  3. The AI summary mentions the removal of the Resign chan interface{} field. Could you please explain the implications of this removal? How will the node step down from leadership without this channel?

To verify the usage and impact of these changes, please run the following script:

This script will help us understand how these new fields are used and if there are any remaining references to the removed Resign channel.

✅ Verification successful

Verification Complete: Changes Confirmed

  1. Test-Specific Fields:

    • The blacklist and isDelayedNode fields are utilized within raft.go and types.go for testing purposes, aligning with their intended use.
  2. Resign Channel Removal:

    • The Resign chan interface{} has been successfully removed, with no remaining references found in the codebase.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for usage of new fields and removed Resign channel

# Check for blacklist usage
echo "Checking blacklist usage:"
rg --type go 'blacklist' ./node/pkg/raft

# Check for isDelayedNode usage
echo "Checking isDelayedNode usage:"
rg --type go 'isDelayedNode' ./node/pkg/raft

# Check for Resign channel usage (to verify removal impact)
echo "Checking Resign channel usage:"
rg --type go 'Resign\s+chan\s+interface{}' ./node/pkg/raft

Length of output: 1113


Script:

#!/bin/bash
# Description: Check for usage of new fields and removed Resign channel

# Check for blacklist usage
echo "Checking blacklist usage:"
rg --type go 'blacklist' ./node/pkg/raft

# Check for isDelayedNode usage
echo "Checking isDelayedNode usage:"
rg --type go 'isDelayedNode' ./node/pkg/raft

# Check for Resign channel usage (corrected regex)
echo "Checking Resign channel usage:"
rg --type go 'Resign\s+chan\s+interface\{\}' ./node/pkg/raft

Length of output: 979

node/pkg/raft/raft_test.go (3)

59-61: Verify term comparison logic when only one node is present

In the allNodesHaveSameTermMin function, when there's only one node, you're checking if its current term is greater than min. Ensure that this condition correctly represents the intended behavior in scenarios with a single node, and that it doesn't cause any unintended side effects in your tests.


799-801: Verify network partition behavior in a two-node cluster

With only two nodes in TestRaft_NetworkPartitionAndRecovery_2_nodes, a network partition can lead to a split-brain scenario where neither node can form a majority. Verify that your Raft implementation correctly handles this case, and the test adequately assesses the cluster's behavior under partition.


743-743: Re-evaluate the necessity of the assertion on newList length

In TestRaft_LeaderDisconnect_2_nodes, you assert that len(newList) equals 1 after removing the old leader. Confirm that this assertion is required for the test and accurately reflects the expected number of nodes remaining in the cluster.

Comment on lines +46 to +47
blacklist: make(map[string]struct{}),
isDelayedNode: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential Data Races on 'blacklist' and 'isDelayedNode'

The fields blacklist and isDelayedNode are accessed by multiple goroutines without adequate synchronization, which could lead to data races and panics.

To fix this issue, ensure that all accesses to blacklist and isDelayedNode are protected by the mutex r.Mutex. For example, modify the handleMessage and PublishMessage methods to acquire the lock when accessing these fields.

Comment on lines +451 to +470
func (r *Raft) addBlacklist(id string) {
// used for testing purposes
r.Mutex.Lock()
defer r.Mutex.Unlock()
r.blacklist[id] = struct{}{}
}

func (r *Raft) removeBlacklist(id string) {
// used for testing purposes
r.Mutex.Lock()
defer r.Mutex.Unlock()
delete(r.blacklist, id)
}

func (r *Raft) setDelayedNode(isDelayed bool) {
// used for testing purposes
r.Mutex.Lock()
defer r.Mutex.Unlock()
r.isDelayedNode = isDelayed
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Embedding Testing Code into Production Code

The methods addBlacklist, removeBlacklist, and setDelayedNode are added for testing purposes but are part of the production codebase. Including testing utilities in production code can lead to maintenance challenges and potential misuse.

Consider refactoring to separate testing utilities from production code. Options include:

  • Moving these methods to a build tag dedicated to testing.
  • Using interfaces or dependency injection to mock behaviors in tests without altering the production codebase.
  • Placing testing code within a _test.go file to ensure it's only included during testing.

This separation enhances code maintainability and prevents testing code from impacting production.

Comment on lines +249 to +251
if r.isDelayedNode {
time.Sleep(500 * time.Millisecond)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unsafe Access to 'isDelayedNode' in 'PublishMessage'

The isDelayedNode flag is accessed without synchronization in PublishMessage, which can result in data races when setDelayedNode is called concurrently.

Apply this diff to synchronize access to isDelayedNode:

 func (r *Raft) PublishMessage(ctx context.Context, msg Message) error {
     data, err := json.Marshal(msg)
     if err != nil {
         return err
     }

+    r.Mutex.Lock()
     if r.isDelayedNode {
         time.Sleep(500 * time.Millisecond)
     }
+    r.Mutex.Unlock()

     return r.Topic.Publish(ctx, data)
 }

Ensure all accesses to isDelayedNode are properly synchronized.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if r.isDelayedNode {
time.Sleep(500 * time.Millisecond)
}
r.Mutex.Lock()
if r.isDelayedNode {
time.Sleep(500 * time.Millisecond)
}
r.Mutex.Unlock()

Comment on lines +147 to 149
if r.Role != Leader {
r.startElectionTimer()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unnecessary Election Timer Reset in Follower State

In handleHeartbeat, the election timer is restarted even if the node is not the leader. Restarting the election timer upon receiving a heartbeat as a follower may not be necessary and could interfere with the expected timing of elections.

Review the logic to ensure the election timer is only restarted when appropriate. If the node is a follower and receives a heartbeat, the election timer should already be managed correctly. Consider modifying the code as follows:

 func (r *Raft) handleHeartbeat(msg Message) error {
     // existing code...

-    if r.Role != Leader {
+    if r.Role == Candidate {
         r.startElectionTimer()
     }

     // existing code...
 }

This change ensures that only candidates reset their election timers upon receiving a heartbeat, aligning with typical Raft protocol behavior.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if r.Role != Leader {
r.startElectionTimer()
}
if r.Role == Candidate {
r.startElectionTimer()
}

Comment on lines +326 to +332
r.Role = Follower
r.LeaderID = ""

r.HeartbeatTicker.Stop()
r.LeaderJobTicker.Stop()

r.startElectionTimer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Nil Pointer Dereference Risk when Stopping Tickers

In ResignLeader, calling Stop() on r.HeartbeatTicker and r.LeaderJobTicker without checking if they are initialized may lead to nil pointer dereference if these tickers are nil.

Apply this diff to add nil checks before stopping the tickers:

 func (r *Raft) ResignLeader() {
     r.Role = Follower
     r.LeaderID = ""

-    r.HeartbeatTicker.Stop()
-    r.LeaderJobTicker.Stop()
+    if r.HeartbeatTicker != nil {
+        r.HeartbeatTicker.Stop()
+    }
+    if r.LeaderJobTicker != nil {
+        r.LeaderJobTicker.Stop()
+    }

     r.startElectionTimer()
 }

This ensures that Stop() is only called on initialized tickers.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
r.Role = Follower
r.LeaderID = ""
r.HeartbeatTicker.Stop()
r.LeaderJobTicker.Stop()
r.startElectionTimer()
func (r *Raft) ResignLeader() {
r.Role = Follower
r.LeaderID = ""
if r.HeartbeatTicker != nil {
r.HeartbeatTicker.Stop()
}
if r.LeaderJobTicker != nil {
r.LeaderJobTicker.Stop()
}
r.startElectionTimer()
}

Comment on lines +107 to +111
if _, ok := r.blacklist[msg.SentFrom]; ok {
// blacklist for testing to simulate network partition
return nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unsafe Access to 'blacklist' in 'handleMessage'

In the handleMessage method, the access to r.blacklist is not protected by a mutex, leading to potential concurrent map read and write, which can cause runtime panics.

Apply this diff to safely access blacklist:

 func (r *Raft) handleMessage(ctx context.Context, msg Message) error {
+    r.Mutex.Lock()
     if _, ok := r.blacklist[msg.SentFrom]; ok {
         // blacklist for testing to simulate network partition
+        r.Mutex.Unlock()
         return nil
     }
+    r.Mutex.Unlock()

Alternatively, create a helper function that safely checks the blacklist with proper synchronization.

Committable suggestion was skipped due to low confidence.

Comment on lines +785 to +786
log.Info().Msg("Ready to test, setup 3 nodes")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct log message to match the test setup

In TestRaft_NetworkPartitionAndRecovery_2_nodes, the log message states "Ready to test, setup 3 nodes", but the test is set up with 2 nodes. Updating the log message will prevent confusion.

Apply this diff to fix the log message:

-	log.Info().Msg("Ready to test, setup 3 nodes")
+	log.Info().Msg("Ready to test, setup 2 nodes")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log.Info().Msg("Ready to test, setup 3 nodes")
log.Info().Msg("Ready to test, setup 2 nodes")

Comment on lines +170 to +193
func joinNewNode(ctx context.Context, testItems *TestItems) (*Raft, error) {
newHost, err := libp2psetup.NewHost(ctx)
if err != nil {
return nil, err
}

newPs, err := libp2psetup.MakePubsub(ctx, newHost)
if err != nil {
return nil, err
}

for _, host := range testItems.Hosts {
newHost.Connect(ctx, peer.AddrInfo{ID: host.ID(), Addrs: host.Addrs()})
}

newTopic, err := newPs.Join(topicString)
if err != nil {
return nil, err
}

newNode := NewRaftNode(newHost, newPs, newTopic, 100, time.Second)
go newNode.Run(ctx)
return newNode, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure proper cleanup of resources in joinNewNode

The joinNewNode function creates a new node and starts it but doesn't include cleanup logic for the node's resources. To prevent resource leaks, make sure to close the host and topic when they are no longer needed.

Apply this diff to add cleanup:

 func joinNewNode(ctx context.Context, testItems *TestItems) (*Raft, error) {
     // ... existing code ...

     newNode := NewRaftNode(newHost, newPs, newTopic, 100, time.Second)
     go newNode.Run(ctx)
-    return newNode, nil
+    return newNode, nil
+}
+
+// In the calling test function, add defer statements after successful node creation:
+defer func() {
+    newNode.Topic.Close()
+    newNode.Host.Close()
+}()

Committable suggestion was skipped due to low confidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant