Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] launch: fix timeout mechanism when TN service startup #16761

Merged
merged 2 commits into from
Jun 8, 2024

Conversation

volgariver6
Copy link
Contributor

@volgariver6 volgariver6 commented Jun 7, 2024

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue https://github.com/matrixorigin/MO-Cloud/issues/3450

What this PR does / why we need it:

fix timeout mechanism when TN service startup:
5m timeout for total wait and 5s timeout for each request.


PR Type

Bug fix


Description

  • Fixed the timeout mechanism when starting the TN service by implementing a retry mechanism with a 5-second timeout for each request and a total wait time of 5 minutes.
  • Changed the return value in the startTNServiceCluster function to return an error instead of nil.
  • Improved the client connection logic in the newHAKeeperClient function to handle discovery and service addresses more robustly.

Changes walkthrough 📝

Relevant files
Bug fix
launch.go
Fix timeout and retry mechanism for TN service startup     

cmd/mo-service/launch.go

  • Changed return value in startTNServiceCluster function to return an
    error instead of nil.
  • Implemented a retry mechanism with a 5-second timeout for each request
    and a total wait time of 5 minutes in waitHAKeeperReady.
  • +21/-10 
    hakeeper_client.go
    Improve client connection logic in HAKeeper client             

    pkg/logservice/hakeeper_client.go

  • Modified newHAKeeperClient to handle the discovery address and service
    addresses more robustly.
  • Ensured that the client is returned only if it is successfully
    created.
  • +3/-2     

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Review 🔍

    ⏱️ Estimated effort to review [1-5]

    2, because the changes are localized to specific functions and involve straightforward logic adjustments, such as error handling and timeout settings. The changes are not extensive but require understanding of the context and the impact on the system's behavior.

    🧪 Relevant tests

    No

    ⚡ Possible issues

    Retry Logic: The retry mechanism in waitHAKeeperReady function uses a fixed 5-second timeout for each attempt without any limit on the number of retries within the 5-minute total timeout. This could potentially lead to a large number of retries if the service is not available, impacting system resources.

    Error Handling: The change in startTNServiceCluster to return an error instead of nil is correct, but it's important to ensure that all callers of this function properly handle the error to avoid unintended behavior.

    🔒 Security concerns

    No

    @matrix-meow matrix-meow added the size/S Denotes a PR that changes [10,99] lines label Jun 7, 2024
    @mergify mergify bot requested a review from sukki37 June 7, 2024 12:48
    @mergify mergify bot added the kind/bug Something isn't working label Jun 7, 2024
    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Best practice
    Use a ticker instead of time.Sleep to avoid blocking the goroutine

    Instead of using time.Sleep(time.Second) in the loop, consider using a ticker to avoid
    blocking the goroutine and to make the code more idiomatic.

    cmd/mo-service/launch.go [237]

    -time.Sleep(time.Second)
    +ticker := time.NewTicker(time.Second)
    +defer ticker.Stop()
    +for {
    +    select {
    +    case <-ctx.Done():
    +        return nil, moerr.NewInternalErrorNoCtx("wait hakeeper ready timeout")
    +    case <-ticker.C:
    +        client, err := getClient()
    +        if err == nil {
    +            return client, nil
    +        }
    +        logutil.Errorf("hakeeper not ready, err: %v", err)
    +    }
    +}
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Replacing time.Sleep with a ticker in a loop is a best practice in Go for managing repeated actions without blocking. This suggestion improves the efficiency and responsiveness of the loop.

    7
    Enhancement
    Add a log message before returning the timeout error for better context

    Add a log message before returning the timeout error to provide more context about the
    failure.

    cmd/mo-service/launch.go [231]

    +logutil.Errorf("Timeout waiting for hakeeper to be ready")
     return nil, moerr.NewInternalErrorNoCtx("wait hakeeper ready timeout")
     
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: Adding a log message before returning a timeout error provides additional context which is useful for debugging. It's a good enhancement for better error handling.

    6
    Maintainability
    Combine the if condition and the return statement for conciseness

    Combine the if condition and the return statement to make the code more concise.

    cmd/mo-service/launch.go [124-125]

    +if err := startService(ctx, cfg, stopper, shutdownC); err != nil {
    +    return err
    +}
     
    -
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: The suggestion to combine the if condition and the return statement is valid and makes the code more concise. However, the existing code already uses this pattern, so the suggestion does not introduce a change.

    5

    @mergify mergify bot merged commit cdfd3b7 into matrixorigin:1.2-dev Jun 8, 2024
    18 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    Bug fix kind/bug Something isn't working Review effort [1-5]: 2 size/S Denotes a PR that changes [10,99] lines
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    4 participants