Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix a bug about send notify message to remote node. #16696

Merged
merged 3 commits into from
Jun 12, 2024

Conversation

m-schen
Copy link
Contributor

@m-schen m-schen commented Jun 5, 2024

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

https://github.com/matrixorigin/MO-Cloud/issues/2788

What this PR does / why we need it:

调整接收远程节点的dispatch发送的数据时,关闭流和cancel上下文的顺序。确保相关上下文cancel后再关闭流。


PR Type

Bug fix


Description

This PR addresses a bug related to sending notify messages to remote nodes by improving error handling and context management:

  • Renamed errChan to preScopeResultReceiveChan for better clarity in handling pre-scope results.
  • Introduced notifyMessageResultReceiveChan to handle results from notify message routines.
  • Ensured that senders are closed after context cancellation to avoid misreporting errors.
  • Replaced the notifyAndReceiveFromRemote function with sendNotifyMessage to improve context handling and error reporting.

Changes walkthrough 📝

Relevant files
Bug fix
scope.go
Improve error handling and context management in MergeRun and related
functions.

pkg/sql/compile/scope.go

  • Renamed errChan to preScopeResultReceiveChan for clarity.
  • Introduced notifyMessageResultReceiveChan to handle notify message
    results.
  • Added logic to close senders after context cancellation to avoid
    misreporting errors.
  • Replaced notifyAndReceiveFromRemote with sendNotifyMessage for better
    context handling.
  • +56/-27 

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Review 🔍

    ⏱️ Estimated effort to review [1-5]

    3, because the PR involves changes in error handling and context management in a complex function, which requires understanding the flow of asynchronous operations and error propagation. The changes are substantial but localized to specific functions, making the review moderately challenging.

    🧪 Relevant tests

    No

    ⚡ Possible issues

    Possible Bug: The notifyMessageResultReceiveChan is used to handle results from notify message routines, but there is no explicit handling for cases where result.err is nil but result.sender is not, which might lead to unhandled scenarios.

    Resource Leak: The deferred function in MergeRun ensures senders are closed, but if an exception occurs before reaching the defer block, the senders might not be closed, leading to potential resource leaks.

    🔒 Security concerns

    No

    @mergify mergify bot requested a review from sukki37 June 5, 2024 12:30
    @mergify mergify bot added the kind/bug Something isn't working label Jun 5, 2024
    @matrix-meow matrix-meow added the size/S Denotes a PR that changes [10,99] lines label Jun 5, 2024
    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Possible bug
    Add a nil check for notifyMessageResultReceiveChan before reading from it to avoid nil pointer dereference

    Add a check to ensure that notifyMessageResultReceiveChan is not nil before attempting to
    read from it in the defer function to avoid potential nil pointer dereference.

    pkg/sql/compile/scope.go [254-259]

    -for len(notifyMessageResultReceiveChan) > 0 {
    -    result := <-notifyMessageResultReceiveChan
    -    if result.sender != nil {
    -        _ = result.sender.Close(false)
    +if notifyMessageResultReceiveChan != nil {
    +    for len(notifyMessageResultReceiveChan) > 0 {
    +        result := <-notifyMessageResultReceiveChan
    +        if result.sender != nil {
    +            _ = result.sender.Close(false)
    +        }
         }
     }
     
    Suggestion importance[1-10]: 8

    Why: Adding a nil check before using notifyMessageResultReceiveChan is crucial to prevent runtime panics due to nil pointer dereferences. This suggestion addresses a potential major runtime error.

    8
    Best practice
    Move the defer statement to the beginning of the function to ensure cleanup code is executed

    The defer statement should be moved to the beginning of the function to ensure that the
    cleanup code is executed even if an early return occurs due to an error.

    pkg/sql/compile/scope.go [245-261]

    +defer func() {
    +    // should wait all the notify-message-routine and preScopes done.
    +    wg.Wait()
     
    +    // not necessary, but we still clean the preScope error channel here.
    +    for len(preScopeResultReceiveChan) > 0 {
    +        _ = <-preScopeResultReceiveChan
    +    }
     
    +    // clean the notifyMessageResultReceiveChan to make sure all the rpc-sender can be closed.
    +    for len(notifyMessageResultReceiveChan) > 0 {
    +        result := <-notifyMessageResultReceiveChan
    +        if result.sender != nil {
    +            _ = result.sender.Close(false)
    +        }
    +    }
    +}()
    +
    Suggestion importance[1-10]: 7

    Why: Moving the defer statement to the beginning of the function ensures that cleanup operations are always executed, even if an error causes an early return. This is a good practice for resource management.

    7
    Possible issue
    Place wg.Done() inside the closeWithError function to avoid potential goroutine leaks

    Ensure that the wg.Done() call is placed inside the closeWithError function to avoid
    potential goroutine leaks in case of early returns.

    pkg/sql/compile/scope.go [1147-1162]

     closeWithError := func(err error, reg *process.WaitRegister, sender morpc.Stream) {
    +    defer wg.Done()
         if reg != nil {
             select {
             case <-s.Proc.Ctx.Done():
                 resultChan <- notifyMessageResult{err: nil, sender: sender}
             default:
                 resultChan <- notifyMessageResult{err: err, sender: sender}
             }
    -        wg.Done()
         }
     }
     
    Suggestion importance[1-10]: 7

    Why: Ensuring wg.Done() is called in every execution path of closeWithError function prevents goroutine leaks, which is crucial for the stability and performance of the application. This suggestion enhances the reliability of goroutine management.

    7
    Add timeout handling in the select statement to prevent infinite blocking

    Add a timeout or context cancellation handling for the select statement in the for loop to
    prevent potential infinite blocking if neither channel receives any data.

    pkg/sql/compile/scope.go [286-302]

     for {
         select {
         case err := <-preScopeResultReceiveChan:
             if err != nil {
                 return err
             }
             preScopeCount--
     
         case result := <-notifyMessageResultReceiveChan:
             if result.sender != nil {
                 _ = result.sender.Close(false)
             }
             if result.err != nil {
                 return result.err
             }
             remoteScopeCount--
    +
    +    case <-time.After(time.Second * 30):
    +        return fmt.Errorf("timeout waiting for scope results")
         }
     }
     
    Suggestion importance[1-10]: 6

    Why: Implementing a timeout in the select statement is a good practice to avoid potential infinite blocking situations. This suggestion improves the robustness of the code by handling scenarios where channels might not receive data.

    6

    @m-schen
    Copy link
    Contributor Author

    m-schen commented Jun 6, 2024

    这个暂时不合并,等待测试。

    @mergify mergify bot merged commit babf8ea into matrixorigin:1.2-dev Jun 12, 2024
    17 of 18 checks passed
    @m-schen m-schen deleted the try-fix-2788 branch June 18, 2024 02:34
    m-schen added a commit to m-schen/matrixone that referenced this pull request Jun 24, 2024
    调整接收远程节点的dispatch发送的数据时,关闭流和cancel上下文的顺序。确保相关上下文cancel后再关闭流。
    
    Approved by: @aunjgr, @sukki37
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    Bug fix kind/bug Something isn't working Review effort [1-5]: 3 size/S Denotes a PR that changes [10,99] lines
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    4 participants