feat: error annotations with trap #4672

nv-nmailhot · 2025-12-01T20:40:09Z

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Enhanced container validation workflows with improved error extraction and diagnostics
- Added comprehensive failure reporting with contextual pod, event, and deployment information
Chores
- Improved CI/CD logging and health monitoring during backend validation
- Strengthened failure annotations with enhanced diagnostic details for easier troubleshooting

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-01T20:45:15Z

Walkthrough

This pull request introduces a Python script for extracting and analyzing log errors with optional LogAI integration and regex fallback, and updates the container validation workflow to centralize log handling, implement robust error diagnostics capture on failures, and enhance failure reporting with extracted error details.

Changes

Cohort / File(s)	Change Summary
Log Error Extraction `.github/scripts/extract_log_errors.py`	New script introducing `LogErrorExtractor` class with methods for LogAI-based and regex-based error extraction. Supports JSON output, context windows, deduplication, and graceful fallback handling. Main entry point processes log files and outputs summaries or JSON objects.
Workflow Enhancement `.github/workflows/container-validation-backends.yml`	Centralized log handling via tee redirection and error handler function with pod/event diagnostics capture. Added LogAI-based failure analysis with setup-python, LogAI installation, and error extraction steps. Enhanced annotations with error details, namespace, and framework context. Implemented proactive health checks, Helm dependency installation, and retry loops for LLM endpoint verification. Applied error-annotation patterns across operator deployment and test job sections.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring attention:
- LogAI integration and fallback logic in extract_log_errors.py — verify graceful degradation and regex pattern completeness
- Repetitive error-annotation patterns in workflow — ensure consistency across operator and test sections to prevent divergence
- Trap/ERR handler implementation — confirm proper error propagation and exit codes in workflow steps
- Python script integration into workflow pipeline — validate dependencies and environment compatibility

Poem

🐰 A log-hunting script hops through the night,
Catching errors with LogAI's light,
Or regex patterns, a trusty backup,
While workflows now sip from the diagnostics cup—
Errors exposed, no more hiding away!

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is essentially empty—only the template structure is present with no actual content filled in under Overview, Details, or Where should the reviewer start?	Complete the PR description by filling in Overview (what problem this solves), Details (specific changes made), Where should the reviewer start (file callouts), and Related Issues (actual issue number instead of #xxx).

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: error annotations with trap' is concise and directly reflects the main changes—implementing error annotations and trap handling in the workflow.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

.github/scripts/extract_log_errors.py (3)
84-86: Use tempfile module instead of hardcoded /tmp path.

Using a fixed path /tmp/analysis.log can cause race conditions if multiple CI jobs run concurrently on the same runner, and may fail in restricted environments.
+import tempfile
+
 # Write log content to a temporary file for LogAI processing
-temp_log = Path("/tmp/analysis.log")
-temp_log.write_text(self.log_content)
+with tempfile.NamedTemporaryFile(mode='w', suffix='.log', delete=False) as f:
+    f.write(self.log_content)
+    temp_log = Path(f.name)
Don't forget to clean up the temp file after use, or use delete=True if the file isn't needed after processing.

130-136: Avoid catching bare Exception.

Catching Exception is too broad and can mask unexpected errors like KeyboardInterrupt or SystemExit. Consider catching more specific exceptions.
-except Exception as e:
+except (ValueError, AttributeError, TypeError, OSError) as e:
     # Log to stderr but don't let it pollute the main output
     print(
         f"LogAI extraction failed: {e}, falling back to regex extraction",
         file=sys.stderr,
     )
     return []
196-228: Consider caching extracted errors to avoid redundant processing.

Both get_summary() and get_primary_error() call extract_errors() internally. When both are used (as in main()), errors are extracted twice. For a CI utility this is acceptable, but caching would improve efficiency.
 def __init__(self, log_file: Path):
     self.log_file = log_file
     self.log_content = ""
+    self._cached_errors = None

     if log_file.exists():
         with open(log_file, "r", encoding="utf-8", errors="ignore") as f:
             self.log_content = f.read()

 def extract_errors(self) -> List[Dict[str, Any]]:
     """Extract errors using LogAI first, then fallback."""
+    if self._cached_errors is not None:
+        return self._cached_errors
+
     if not self.log_content:
         return []
     # ... rest of method
+    self._cached_errors = errors
+    return self._cached_errors
.github/workflows/container-validation-backends.yml (1)
961-997: Same issues as deploy-operator annotation: unused variable and hardcoded line numbers.

Line 962 creates ERROR_MESSAGE_JSON which is unused, and lines 990-991 have hardcoded line numbers (593-807) that will become stale.

Additionally, this annotation logic is duplicated between deploy-operator and deploy-test steps. Consider extracting this into a reusable composite action or shared script to reduce maintenance burden.
-        # Safely encode ERROR_MESSAGE for JSON using jq
-        ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg')
-
         # Create a check run with the annotation

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d9fc85 and 2baf326.

📒 Files selected for processing (2)

.github/scripts/extract_log_errors.py (1 hunks)
.github/workflows/container-validation-backends.yml (5 hunks)

🧰 Additional context used

🪛 actionlint (1.7.9)

.github/workflows/container-validation-backends.yml

648-648: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

906-906: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 Ruff (0.14.6)

.github/scripts/extract_log_errors.py

14-14: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

15-15: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

33-51: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

85-85: Probable insecure usage of temporary file or directory: "/tmp/analysis.log"

(S108)

130-130: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: vllm (arm64)
GitHub Check: operator (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: operator (amd64)
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (2)

.github/workflows/container-validation-backends.yml (2)

551-575: Well-structured error handler with diagnostic capture.

The error handler pattern captures useful Kubernetes diagnostics (pod status, events, deployments, helm status) on failure, which will significantly improve debugging. The trap on ERR and preservation of the original exit code is correctly implemented.

875-902: Good use of trap with continue-on-error for failure diagnostics.

The pattern of using trap 'handle_error' ERR combined with continue-on-error: true allows the workflow to capture diagnostics on any failure while still proceeding to the log analysis and annotation steps. This provides better failure visibility.

coderabbitai · 2025-12-01T20:45:18Z

.github/workflows/container-validation-backends.yml

+    - name: Setup Python for Log Analysis
+      if: always() && steps.deploy-operator-step.outcome == 'failure'
+      uses: actions/setup-python@v4
+      with:
+        python-version: '3.10'


🛠️ Refactor suggestion | 🟠 Major

Update actions/setup-python to v5.

The actions/setup-python@v4 action is outdated. Update to v5 to ensure compatibility with current GitHub Actions runners.

- name: Setup Python for Log Analysis if: always() && steps.deploy-operator-step.outcome == 'failure' - uses: actions/setup-python@v4 + uses: actions/setup-python@v5 with: python-version: '3.10'

Note: The same change should be applied at line 906 for the test failure log analysis step.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Setup Python for Log Analysis

if: always() && steps.deploy-operator-step.outcome == 'failure'

uses: actions/setup-python@v4

with:

python-version: '3.10'

- name: Setup Python for Log Analysis

if: always() && steps.deploy-operator-step.outcome == 'failure'

uses: actions/setup-python@v5

with:

python-version: '3.10'

🧰 Tools

🪛 actionlint (1.7.9)

648-648: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents

In .github/workflows/container-validation-backends.yml around lines 646-650 (and also at line ~906 for the test failure log analysis step), the workflow uses actions/setup-python@v4 which is outdated; update both occurrences to actions/setup-python@v5 by changing the action reference from actions/setup-python@v4 to actions/setup-python@v5 while keeping the same with: python-version entries so the workflow uses the current setup-python v5 implementation.

coderabbitai · 2025-12-01T20:45:18Z

.github/workflows/container-validation-backends.yml

+        # Safely encode ERROR_MESSAGE for JSON using jq
+        ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg')
+
+        # Create a check run with the annotation
+        CHECK_RUN_ID=$(curl -s -X POST \
+          -H "Authorization: token $GITHUB_TOKEN" \
+          -H "Accept: application/vnd.github.v3+json" \
+          "https://api.github.com/repos/${{ github.repository }}/check-runs" \
+          -d "$(jq -n \
+            --arg name "Deploy Operator" \
+            --arg sha "${{ github.sha }}" \
+            --arg namespace "${NAMESPACE}" \
+            --arg error_msg "$ERROR_MESSAGE" \
+            --arg repo "${{ github.repository }}" \
+            --arg run_id "${{ github.run_id }}" \
+            '{
+              "name": $name,
+              "head_sha": $sha,
+              "status": "completed",
+              "conclusion": "failure",
+              "output": {
+                "title": "Operator Deployment Failed (LogAI Analysis)",
+                "summary": ("Failed to deploy dynamo-platform operator to namespace " + $namespace),
+                "text": ("**Job**: deploy-operator\n**Namespace**: " + $namespace + "\n**Analysis Method**: LogAI\n\n**Error Details**:\n```\n" + $error_msg + "\n```\n\n[View Job Run](https://github.com/" + $repo + "/actions/runs/" + $run_id + ")"),
+                "annotations": [{
+                  "path": ".github/workflows/container-validation-backends.yml",
+                  "start_line": 357,
+                  "end_line": 425,
+                  "annotation_level": "failure",
+                  "message": $error_msg,
+                  "title": "Operator deployment failed (LogAI)"
+                }]
+              }
+            }')" | jq -r '.id')


⚠️ Potential issue | 🟡 Minor

Hardcoded line numbers in annotations will become stale.

The annotation references hardcoded line numbers (357-425 on lines 737-738). These will become incorrect as the workflow file is modified, causing annotations to point to the wrong locations.

Also, ERROR_MESSAGE_JSON on line 712 is created but never used—the jq command uses $error_msg directly via --arg.

Consider either:

Removing the hardcoded line numbers and using a more generic annotation approach

Adding a comment noting these need to be updated when the file structure changes

Removing the unused ERROR_MESSAGE_JSON variable

- # Safely encode ERROR_MESSAGE for JSON using jq - ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg') - # Create a check run with the annotation

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Safely encode ERROR_MESSAGE for JSON using jq

ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg')

# Create a check run with the annotation

CHECK_RUN_ID=$(curl -s -X POST \

-H "Authorization: token $GITHUB_TOKEN" \

-H "Accept: application/vnd.github.v3+json" \

"https://api.github.com/repos/${{ github.repository }}/check-runs" \

-d "$(jq -n \

--arg name "Deploy Operator" \

--arg sha "${{ github.sha }}" \

--arg namespace "${NAMESPACE}" \

--arg error_msg "$ERROR_MESSAGE" \

--arg repo "${{ github.repository }}" \

--arg run_id "${{ github.run_id }}" \

'{

"name": $name,

"head_sha": $sha,

"status": "completed",

"conclusion": "failure",

"output": {

"title": "Operator Deployment Failed (LogAI Analysis)",

"summary": ("Failed to deploy dynamo-platform operator to namespace " + $namespace),

"text": ("**Job**: deploy-operator\n**Namespace**: " + $namespace + "\n**Analysis Method**: LogAI\n\n**Error Details**:\n```\n" + $error_msg + "\n```\n\n[View Job Run](https://github.com/" + $repo + "/actions/runs/" + $run_id + ")"),

"annotations": [{

"path": ".github/workflows/container-validation-backends.yml",

"start_line": 357,

"end_line": 425,

"annotation_level": "failure",

"message": $error_msg,

"title": "Operator deployment failed (LogAI)"

}]

}

}')" | jq -r '.id')

# Create a check run with the annotation

CHECK_RUN_ID=$(curl -s -X POST \

-H "Authorization: token $GITHUB_TOKEN" \

-H "Accept: application/vnd.github.v3+json" \

"https://api.github.com/repos/${{ github.repository }}/check-runs" \

-d "$(jq -n \

--arg name "Deploy Operator" \

--arg sha "${{ github.sha }}" \

--arg namespace "${NAMESPACE}" \

--arg error_msg "$ERROR_MESSAGE" \

--arg repo "${{ github.repository }}" \

--arg run_id "${{ github.run_id }}" \

'{

"name": $name,

"head_sha": $sha,

"status": "completed",

"conclusion": "failure",

"output": {

"title": "Operator Deployment Failed (LogAI Analysis)",

"summary": ("Failed to deploy dynamo-platform operator to namespace " + $namespace),

"text": ("**Job**: deploy-operator\n**Namespace**: " + $namespace + "\n**Analysis Method**: LogAI\n\n**Error Details**:\n

🤖 Prompt for AI Agents

In .github/workflows/container-validation-backends.yml around lines 711 to 744, the check-run annotation embeds hardcoded start_line/end_line (357-425) that will become stale and also defines ERROR_MESSAGE_JSON at line 712 which is never used; remove the hardcoded line numbers from the annotation payload (omit start_line and end_line or replace with a generic approach such as no file location or pointing to the workflow file without line numbers) and either delete the unused ERROR_MESSAGE_JSON variable or actually use it when building the jq payload (pass the pre-encoded JSON-safe error text into the jq invocation instead of the raw $ERROR_MESSAGE).

nv-nmailhot added 17 commits November 7, 2025 15:09

add error message propogation

959ec2b

extract error with ai

11e20fd

always run deploy test to test temporarily

ada3992

change rules to test

38cc68d

force trigger always

4e01be8

test log ai

50926d8

logai fixes

556308d

Merge branch 'main' into nmailhot/ai-logs

a591a54

add more error identifiers

8704fef

remove unneeded files

9cf62b9

change step name and revert temporary testing changes

810dc25

safely encode error message

d8b1601

fix precommit issues

55a2966

minor cleanup

b87a752

Merge branch 'main' into nmailhot/final-error-log

a2c2469

Merge branch 'main' into nmailhot/final-error-log

c147af1

use trap and remove excess error injections

2baf326

nv-nmailhot requested a review from a team as a code owner December 1, 2025 20:40

pull-request-size bot added the size/XL label Dec 1, 2025

github-actions bot added the feat label Dec 1, 2025

coderabbitai bot reviewed Dec 1, 2025

View reviewed changes

add intended error to test

d4e166a

copy-pr-bot bot temporarily deployed to GITLAB December 2, 2025 15:48 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 2, 2025 15:49 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: error annotations with trap #4672

feat: error annotations with trap #4672

Uh oh!

nv-nmailhot commented Dec 1, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 1, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 1, 2025

Uh oh!

coderabbitai bot Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: error annotations with trap #4672

Are you sure you want to change the base?

feat: error annotations with trap #4672

Uh oh!

Conversation

nv-nmailhot commented Dec 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 1, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-nmailhot commented Dec 1, 2025 •

edited by coderabbitai bot

Loading