Skip to content

K8s: Improve Node checks for liveness probe and preStop hook #2661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 18, 2025

Conversation

VietND96
Copy link
Member

@VietND96 VietND96 commented Feb 18, 2025

User description

Thanks for contributing to the Docker-Selenium project!
A PR well described will help maintainers to quickly review and merge it

Before submitting your PR, please check our contributing guidelines, applied for this repository.
Avoid large PRs, help reviewers by making them as simple and short as possible.

Description

Motivation and Context

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

PR Type

Enhancement, Bug fix


Description

  • Improved handling of Node liveness probe and preStop hook in Kubernetes.

  • Enhanced error handling and logging for Node and Hub/Router interactions.

  • Added session count checks and retry mechanisms for Node termination readiness.

  • Refined environment variable usage and HTTP response handling in scripts.


Changes walkthrough 📝

Relevant files
Enhancement
nodePreStop.sh
Enhance Node preStop script for better error handling       

charts/selenium-grid/configs/node/nodePreStop.sh

  • Improved logging for Hub/Router authentication and endpoint checks.
  • Added handling for HTTP 200 responses and other statuses from
    Hub/Router.
  • Enhanced session count checks and retry logic for Node termination
    readiness.
  • Refined usage of environment variables and error handling in Node
    drain logic.
  • +22/-9   
    nodeProbe.sh
    Improve Node probe script for robustness                                 

    charts/selenium-grid/configs/node/nodeProbe.sh

  • Improved logging for Node status and session count.
  • Enhanced handling of HTTP responses from Hub/Router.
  • Added fallback logic for Node readiness when Hub/Router is
    unreachable.
  • Refined JSON parsing and error handling in Node probe logic.
  • +12/-6   

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Error Handling

    The script assumes jq commands will succeed when parsing JSON responses. While there are some error fallbacks with || echo "", more robust error handling for malformed JSON responses would be beneficial.

    SLOT_HAS_SESSION=$(jq -r '[.value.node.slots[]? | select(.session != null)] | length' "${tmp_node_file}" || echo 0)
    if [ "${SLOT_HAS_SESSION}" -eq 0 ] && [ "${endpoint_http_code}" = "200" ]; then
    Edge Case

    The script assumes Node is ready when Hub is unreachable (HTTP non-200). This could mask actual Node registration issues and lead to false positives in node readiness detection.

    elif [ -n "${NODE_ID}" ] && [ "${endpoint_http_code}" != "200" ]; then
      echo "$(date -u +"${ts_format}") [${probe_name}] - Node ID: ${NODE_ID} report its status, but could not double check ID in Hub. Assume that Node is ready."
      exit 0

    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Verify drain request success

    The script should validate the HTTP response code from the drain request to
    ensure the node drain signal was successfully processed. Currently, there's no
    check if the drain request succeeded.

    charts/selenium-grid/configs/node/nodePreStop.sh [54]

    -curl --noproxy "*" -m ${max_time} -k -X POST -H "Authorization: Basic ${BASIC_AUTH}" ${grid_url}/se/grid/distributor/node/${NODE_ID}/drain --header "${HEADERS}"
    +if ! curl --noproxy "*" -m ${max_time} -k -X POST -H "Authorization: Basic ${BASIC_AUTH}" ${grid_url}/se/grid/distributor/node/${NODE_ID}/drain --header "${HEADERS}" -w "%{http_code}" | grep -q "^2"; then
    +  echo "$(date -u +"${ts_format}") [${probe_name}] - Failed to signal node drain to Hub/Router"
    +  exit 1
    +fi
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion adds crucial error handling for the drain request, which could prevent silent failures and ensure proper node termination. This is important for maintaining grid stability and preventing orphaned sessions.

    Medium
    Handle JSON parsing errors

    The script should handle potential JSON parsing errors when checking for active
    sessions, as malformed JSON could cause incorrect termination decisions.

    charts/selenium-grid/configs/node/nodePreStop.sh [84-85]

    -SLOT_HAS_SESSION=$(jq -r '[.value.node.slots[]? | select(.session != null)] | length' "${tmp_node_file}" || echo 0)
    +if ! SLOT_HAS_SESSION=$(jq -r '[.value.node.slots[]? | select(.session != null)] | length' "${tmp_node_file}" 2>/dev/null); then
    +  echo "$(date -u +"${ts_format}") [${probe_name}] - Failed to parse node status JSON"
    +  exit 1
    +fi
     if [ "${SLOT_HAS_SESSION}" -eq 0 ] && [ "${endpoint_http_code}" = "200" ]; then
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    __

    Why: The suggestion improves error handling for JSON parsing failures, which could prevent incorrect termination decisions based on malformed data. This enhances the reliability of the node shutdown process.

    Medium
    • More

    Copy link

    qodo-merge-pro bot commented Feb 18, 2025

    CI Feedback 🧐

    (Feedback updated until commit 8e454ba)

    A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

    Action: Rerun workflow when failure

    Failed stage: Authenticate GitHub CLI for PR [❌]

    Failure summary:

    The action failed because the GitHub authentication token provided (GH_CLI_TOKEN_PR) lacks the
    required 'read:org' permission scope. The GitHub CLI (gh) command attempted to authenticate but
    couldn't proceed due to insufficient permissions.

    Relevant error logs:
    1:  ##[group]Operating System
    2:  Ubuntu
    ...
    
    28:  SecurityEvents: write
    29:  Statuses: write
    30:  ##[endgroup]
    31:  Secret source: Actions
    32:  Prepare workflow directory
    33:  Prepare all required actions
    34:  Getting action download info
    35:  Download action repository 'actions/checkout@main' (SHA:85e6279cec87321a52edac9c87bce653a07cf6c2)
    36:  Complete job name: Rerun workflow when failure
    ...
    
    48:  show-progress: true
    49:  lfs: false
    50:  submodules: false
    51:  set-safe-directory: true
    52:  env:
    53:  GH_CLI_TOKEN: ***
    54:  GH_CLI_TOKEN_PR: ***
    55:  RUN_ID: 13386789165
    56:  RERUN_FAILED_ONLY: true
    ...
    
    119:  ##[group]Run sudo apt update
    120:  �[36;1msudo apt update�[0m
    121:  �[36;1msudo apt install gh�[0m
    122:  shell: /usr/bin/bash -e {0}
    123:  env:
    124:  GH_CLI_TOKEN: ***
    125:  GH_CLI_TOKEN_PR: ***
    126:  RUN_ID: 13386789165
    127:  RERUN_FAILED_ONLY: true
    ...
    
    167:  0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
    168:  ##[group]Run echo "$GH_CLI_TOKEN_PR" | gh auth login --with-token
    169:  �[36;1mecho "$GH_CLI_TOKEN_PR" | gh auth login --with-token�[0m
    170:  shell: /usr/bin/bash -e {0}
    171:  env:
    172:  GH_CLI_TOKEN: ***
    173:  GH_CLI_TOKEN_PR: ***
    174:  RUN_ID: 13386789165
    175:  RERUN_FAILED_ONLY: true
    176:  RUN_ATTEMPT: 1
    177:  ##[endgroup]
    178:  error validating token: missing required scope 'read:org'
    179:  ##[error]Process completed with exit code 1.
    

    @VietND96 VietND96 merged commit 0a5b6ec into trunk Feb 18, 2025
    25 of 27 checks passed
    @VietND96 VietND96 deleted the node-probe branch February 18, 2025 10:35
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant