Skip to content

Conversation

@nv-nmailhot
Copy link
Contributor

@nv-nmailhot nv-nmailhot commented Dec 1, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Enhanced container validation workflows with improved error extraction and diagnostics
    • Added comprehensive failure reporting with contextual pod, event, and deployment information
  • Chores

    • Improved CI/CD logging and health monitoring during backend validation
    • Strengthened failure annotations with enhanced diagnostic details for easier troubleshooting

✏️ Tip: You can customize this high-level summary in your review settings.

@nv-nmailhot nv-nmailhot requested a review from a team as a code owner December 1, 2025 20:40
@github-actions github-actions bot added the feat label Dec 1, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 1, 2025

Walkthrough

This pull request introduces a Python script for extracting and analyzing log errors with optional LogAI integration and regex fallback, and updates the container validation workflow to centralize log handling, implement robust error diagnostics capture on failures, and enhance failure reporting with extracted error details.

Changes

Cohort / File(s) Change Summary
Log Error Extraction
​.github/scripts/extract_log_errors.py
New script introducing LogErrorExtractor class with methods for LogAI-based and regex-based error extraction. Supports JSON output, context windows, deduplication, and graceful fallback handling. Main entry point processes log files and outputs summaries or JSON objects.
Workflow Enhancement
​.github/workflows/container-validation-backends.yml
Centralized log handling via tee redirection and error handler function with pod/event diagnostics capture. Added LogAI-based failure analysis with setup-python, LogAI installation, and error extraction steps. Enhanced annotations with error details, namespace, and framework context. Implemented proactive health checks, Helm dependency installation, and retry loops for LLM endpoint verification. Applied error-annotation patterns across operator deployment and test job sections.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas requiring attention:
    • LogAI integration and fallback logic in extract_log_errors.py — verify graceful degradation and regex pattern completeness
    • Repetitive error-annotation patterns in workflow — ensure consistency across operator and test sections to prevent divergence
    • Trap/ERR handler implementation — confirm proper error propagation and exit codes in workflow steps
    • Python script integration into workflow pipeline — validate dependencies and environment compatibility

Poem

🐰 A log-hunting script hops through the night,
Catching errors with LogAI's light,
Or regex patterns, a trusty backup,
While workflows now sip from the diagnostics cup—
Errors exposed, no more hiding away!

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is essentially empty—only the template structure is present with no actual content filled in under Overview, Details, or Where should the reviewer start? Complete the PR description by filling in Overview (what problem this solves), Details (specific changes made), Where should the reviewer start (file callouts), and Related Issues (actual issue number instead of #xxx).
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: error annotations with trap' is concise and directly reflects the main changes—implementing error annotations and trap handling in the workflow.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
.github/scripts/extract_log_errors.py (3)

84-86: Use tempfile module instead of hardcoded /tmp path.

Using a fixed path /tmp/analysis.log can cause race conditions if multiple CI jobs run concurrently on the same runner, and may fail in restricted environments.

+import tempfile
+
 # Write log content to a temporary file for LogAI processing
-temp_log = Path("/tmp/analysis.log")
-temp_log.write_text(self.log_content)
+with tempfile.NamedTemporaryFile(mode='w', suffix='.log', delete=False) as f:
+    f.write(self.log_content)
+    temp_log = Path(f.name)

Don't forget to clean up the temp file after use, or use delete=True if the file isn't needed after processing.


130-136: Avoid catching bare Exception.

Catching Exception is too broad and can mask unexpected errors like KeyboardInterrupt or SystemExit. Consider catching more specific exceptions.

-except Exception as e:
+except (ValueError, AttributeError, TypeError, OSError) as e:
     # Log to stderr but don't let it pollute the main output
     print(
         f"LogAI extraction failed: {e}, falling back to regex extraction",
         file=sys.stderr,
     )
     return []

196-228: Consider caching extracted errors to avoid redundant processing.

Both get_summary() and get_primary_error() call extract_errors() internally. When both are used (as in main()), errors are extracted twice. For a CI utility this is acceptable, but caching would improve efficiency.

 def __init__(self, log_file: Path):
     self.log_file = log_file
     self.log_content = ""
+    self._cached_errors = None

     if log_file.exists():
         with open(log_file, "r", encoding="utf-8", errors="ignore") as f:
             self.log_content = f.read()

 def extract_errors(self) -> List[Dict[str, Any]]:
     """Extract errors using LogAI first, then fallback."""
+    if self._cached_errors is not None:
+        return self._cached_errors
+
     if not self.log_content:
         return []
     # ... rest of method
+    self._cached_errors = errors
+    return self._cached_errors
.github/workflows/container-validation-backends.yml (1)

961-997: Same issues as deploy-operator annotation: unused variable and hardcoded line numbers.

Line 962 creates ERROR_MESSAGE_JSON which is unused, and lines 990-991 have hardcoded line numbers (593-807) that will become stale.

Additionally, this annotation logic is duplicated between deploy-operator and deploy-test steps. Consider extracting this into a reusable composite action or shared script to reduce maintenance burden.

-        # Safely encode ERROR_MESSAGE for JSON using jq
-        ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg')
-
         # Create a check run with the annotation
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d9fc85 and 2baf326.

📒 Files selected for processing (2)
  • .github/scripts/extract_log_errors.py (1 hunks)
  • .github/workflows/container-validation-backends.yml (5 hunks)
🧰 Additional context used
🪛 actionlint (1.7.9)
.github/workflows/container-validation-backends.yml

648-648: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)


906-906: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 Ruff (0.14.6)
.github/scripts/extract_log_errors.py

14-14: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


15-15: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


33-51: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


85-85: Probable insecure usage of temporary file or directory: "/tmp/analysis.log"

(S108)


130-130: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
.github/workflows/container-validation-backends.yml (2)

551-575: Well-structured error handler with diagnostic capture.

The error handler pattern captures useful Kubernetes diagnostics (pod status, events, deployments, helm status) on failure, which will significantly improve debugging. The trap on ERR and preservation of the original exit code is correctly implemented.


875-902: Good use of trap with continue-on-error for failure diagnostics.

The pattern of using trap 'handle_error' ERR combined with continue-on-error: true allows the workflow to capture diagnostics on any failure while still proceeding to the log analysis and annotation steps. This provides better failure visibility.

Comment on lines +646 to +650
- name: Setup Python for Log Analysis
if: always() && steps.deploy-operator-step.outcome == 'failure'
uses: actions/setup-python@v4
with:
python-version: '3.10'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Update actions/setup-python to v5.

The actions/setup-python@v4 action is outdated. Update to v5 to ensure compatibility with current GitHub Actions runners.

     - name: Setup Python for Log Analysis
       if: always() && steps.deploy-operator-step.outcome == 'failure'
-      uses: actions/setup-python@v4
+      uses: actions/setup-python@v5
       with:
         python-version: '3.10'

Note: The same change should be applied at line 906 for the test failure log analysis step.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Setup Python for Log Analysis
if: always() && steps.deploy-operator-step.outcome == 'failure'
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Setup Python for Log Analysis
if: always() && steps.deploy-operator-step.outcome == 'failure'
uses: actions/setup-python@v5
with:
python-version: '3.10'
🧰 Tools
🪛 actionlint (1.7.9)

648-648: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents
In .github/workflows/container-validation-backends.yml around lines 646-650 (and
also at line ~906 for the test failure log analysis step), the workflow uses
actions/setup-python@v4 which is outdated; update both occurrences to
actions/setup-python@v5 by changing the action reference from
actions/setup-python@v4 to actions/setup-python@v5 while keeping the same with:
python-version entries so the workflow uses the current setup-python v5
implementation.

Comment on lines +711 to +744
# Safely encode ERROR_MESSAGE for JSON using jq
ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg')
# Create a check run with the annotation
CHECK_RUN_ID=$(curl -s -X POST \
-H "Authorization: token $GITHUB_TOKEN" \
-H "Accept: application/vnd.github.v3+json" \
"https://api.github.com/repos/${{ github.repository }}/check-runs" \
-d "$(jq -n \
--arg name "Deploy Operator" \
--arg sha "${{ github.sha }}" \
--arg namespace "${NAMESPACE}" \
--arg error_msg "$ERROR_MESSAGE" \
--arg repo "${{ github.repository }}" \
--arg run_id "${{ github.run_id }}" \
'{
"name": $name,
"head_sha": $sha,
"status": "completed",
"conclusion": "failure",
"output": {
"title": "Operator Deployment Failed (LogAI Analysis)",
"summary": ("Failed to deploy dynamo-platform operator to namespace " + $namespace),
"text": ("**Job**: deploy-operator\n**Namespace**: " + $namespace + "\n**Analysis Method**: LogAI\n\n**Error Details**:\n```\n" + $error_msg + "\n```\n\n[View Job Run](https://github.com/" + $repo + "/actions/runs/" + $run_id + ")"),
"annotations": [{
"path": ".github/workflows/container-validation-backends.yml",
"start_line": 357,
"end_line": 425,
"annotation_level": "failure",
"message": $error_msg,
"title": "Operator deployment failed (LogAI)"
}]
}
}')" | jq -r '.id')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded line numbers in annotations will become stale.

The annotation references hardcoded line numbers (357-425 on lines 737-738). These will become incorrect as the workflow file is modified, causing annotations to point to the wrong locations.

Also, ERROR_MESSAGE_JSON on line 712 is created but never used—the jq command uses $error_msg directly via --arg.

Consider either:

  1. Removing the hardcoded line numbers and using a more generic annotation approach
  2. Adding a comment noting these need to be updated when the file structure changes
  3. Removing the unused ERROR_MESSAGE_JSON variable
-        # Safely encode ERROR_MESSAGE for JSON using jq
-        ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg')
-
         # Create a check run with the annotation
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Safely encode ERROR_MESSAGE for JSON using jq
ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg')
# Create a check run with the annotation
CHECK_RUN_ID=$(curl -s -X POST \
-H "Authorization: token $GITHUB_TOKEN" \
-H "Accept: application/vnd.github.v3+json" \
"https://api.github.com/repos/${{ github.repository }}/check-runs" \
-d "$(jq -n \
--arg name "Deploy Operator" \
--arg sha "${{ github.sha }}" \
--arg namespace "${NAMESPACE}" \
--arg error_msg "$ERROR_MESSAGE" \
--arg repo "${{ github.repository }}" \
--arg run_id "${{ github.run_id }}" \
'{
"name": $name,
"head_sha": $sha,
"status": "completed",
"conclusion": "failure",
"output": {
"title": "Operator Deployment Failed (LogAI Analysis)",
"summary": ("Failed to deploy dynamo-platform operator to namespace " + $namespace),
"text": ("**Job**: deploy-operator\n**Namespace**: " + $namespace + "\n**Analysis Method**: LogAI\n\n**Error Details**:\n```\n" + $error_msg + "\n```\n\n[View Job Run](https://github.com/" + $repo + "/actions/runs/" + $run_id + ")"),
"annotations": [{
"path": ".github/workflows/container-validation-backends.yml",
"start_line": 357,
"end_line": 425,
"annotation_level": "failure",
"message": $error_msg,
"title": "Operator deployment failed (LogAI)"
}]
}
}')" | jq -r '.id')
# Create a check run with the annotation
CHECK_RUN_ID=$(curl -s -X POST \
-H "Authorization: token $GITHUB_TOKEN" \
-H "Accept: application/vnd.github.v3+json" \
"https://api.github.com/repos/${{ github.repository }}/check-runs" \
-d "$(jq -n \
--arg name "Deploy Operator" \
--arg sha "${{ github.sha }}" \
--arg namespace "${NAMESPACE}" \
--arg error_msg "$ERROR_MESSAGE" \
--arg repo "${{ github.repository }}" \
--arg run_id "${{ github.run_id }}" \
'{
"name": $name,
"head_sha": $sha,
"status": "completed",
"conclusion": "failure",
"output": {
"title": "Operator Deployment Failed (LogAI Analysis)",
"summary": ("Failed to deploy dynamo-platform operator to namespace " + $namespace),
"text": ("**Job**: deploy-operator\n**Namespace**: " + $namespace + "\n**Analysis Method**: LogAI\n\n**Error Details**:\n
🤖 Prompt for AI Agents
In .github/workflows/container-validation-backends.yml around lines 711 to 744,
the check-run annotation embeds hardcoded start_line/end_line (357-425) that
will become stale and also defines ERROR_MESSAGE_JSON at line 712 which is never
used; remove the hardcoded line numbers from the annotation payload (omit
start_line and end_line or replace with a generic approach such as no file
location or pointing to the workflow file without line numbers) and either
delete the unused ERROR_MESSAGE_JSON variable or actually use it when building
the jq payload (pass the pre-encoded JSON-safe error text into the jq invocation
instead of the raw $ERROR_MESSAGE).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants