-
Notifications
You must be signed in to change notification settings - Fork 725
feat: error annotations with trap #4672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request introduces a Python script for extracting and analyzing log errors with optional LogAI integration and regex fallback, and updates the container validation workflow to centralize log handling, implement robust error diagnostics capture on failures, and enhance failure reporting with extracted error details. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
.github/scripts/extract_log_errors.py (3)
84-86: Usetempfilemodule instead of hardcoded/tmppath.Using a fixed path
/tmp/analysis.logcan cause race conditions if multiple CI jobs run concurrently on the same runner, and may fail in restricted environments.+import tempfile + # Write log content to a temporary file for LogAI processing -temp_log = Path("/tmp/analysis.log") -temp_log.write_text(self.log_content) +with tempfile.NamedTemporaryFile(mode='w', suffix='.log', delete=False) as f: + f.write(self.log_content) + temp_log = Path(f.name)Don't forget to clean up the temp file after use, or use
delete=Trueif the file isn't needed after processing.
130-136: Avoid catching bareException.Catching
Exceptionis too broad and can mask unexpected errors likeKeyboardInterruptorSystemExit. Consider catching more specific exceptions.-except Exception as e: +except (ValueError, AttributeError, TypeError, OSError) as e: # Log to stderr but don't let it pollute the main output print( f"LogAI extraction failed: {e}, falling back to regex extraction", file=sys.stderr, ) return []
196-228: Consider caching extracted errors to avoid redundant processing.Both
get_summary()andget_primary_error()callextract_errors()internally. When both are used (as inmain()), errors are extracted twice. For a CI utility this is acceptable, but caching would improve efficiency.def __init__(self, log_file: Path): self.log_file = log_file self.log_content = "" + self._cached_errors = None if log_file.exists(): with open(log_file, "r", encoding="utf-8", errors="ignore") as f: self.log_content = f.read() def extract_errors(self) -> List[Dict[str, Any]]: """Extract errors using LogAI first, then fallback.""" + if self._cached_errors is not None: + return self._cached_errors + if not self.log_content: return [] # ... rest of method + self._cached_errors = errors + return self._cached_errors.github/workflows/container-validation-backends.yml (1)
961-997: Same issues as deploy-operator annotation: unused variable and hardcoded line numbers.Line 962 creates
ERROR_MESSAGE_JSONwhich is unused, and lines 990-991 have hardcoded line numbers (593-807) that will become stale.Additionally, this annotation logic is duplicated between deploy-operator and deploy-test steps. Consider extracting this into a reusable composite action or shared script to reduce maintenance burden.
- # Safely encode ERROR_MESSAGE for JSON using jq - ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg') - # Create a check run with the annotation
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/scripts/extract_log_errors.py(1 hunks).github/workflows/container-validation-backends.yml(5 hunks)
🧰 Additional context used
🪛 actionlint (1.7.9)
.github/workflows/container-validation-backends.yml
648-648: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
906-906: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🪛 Ruff (0.14.6)
.github/scripts/extract_log_errors.py
14-14: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
15-15: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
33-51: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
85-85: Probable insecure usage of temporary file or directory: "/tmp/analysis.log"
(S108)
130-130: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
.github/workflows/container-validation-backends.yml (2)
551-575: Well-structured error handler with diagnostic capture.The error handler pattern captures useful Kubernetes diagnostics (pod status, events, deployments, helm status) on failure, which will significantly improve debugging. The trap on ERR and preservation of the original exit code is correctly implemented.
875-902: Good use of trap withcontinue-on-errorfor failure diagnostics.The pattern of using
trap 'handle_error' ERRcombined withcontinue-on-error: trueallows the workflow to capture diagnostics on any failure while still proceeding to the log analysis and annotation steps. This provides better failure visibility.
| - name: Setup Python for Log Analysis | ||
| if: always() && steps.deploy-operator-step.outcome == 'failure' | ||
| uses: actions/setup-python@v4 | ||
| with: | ||
| python-version: '3.10' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Update actions/setup-python to v5.
The actions/setup-python@v4 action is outdated. Update to v5 to ensure compatibility with current GitHub Actions runners.
- name: Setup Python for Log Analysis
if: always() && steps.deploy-operator-step.outcome == 'failure'
- uses: actions/setup-python@v4
+ uses: actions/setup-python@v5
with:
python-version: '3.10'Note: The same change should be applied at line 906 for the test failure log analysis step.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Setup Python for Log Analysis | |
| if: always() && steps.deploy-operator-step.outcome == 'failure' | |
| uses: actions/setup-python@v4 | |
| with: | |
| python-version: '3.10' | |
| - name: Setup Python for Log Analysis | |
| if: always() && steps.deploy-operator-step.outcome == 'failure' | |
| uses: actions/setup-python@v5 | |
| with: | |
| python-version: '3.10' |
🧰 Tools
🪛 actionlint (1.7.9)
648-648: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🤖 Prompt for AI Agents
In .github/workflows/container-validation-backends.yml around lines 646-650 (and
also at line ~906 for the test failure log analysis step), the workflow uses
actions/setup-python@v4 which is outdated; update both occurrences to
actions/setup-python@v5 by changing the action reference from
actions/setup-python@v4 to actions/setup-python@v5 while keeping the same with:
python-version entries so the workflow uses the current setup-python v5
implementation.
| # Safely encode ERROR_MESSAGE for JSON using jq | ||
| ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg') | ||
| # Create a check run with the annotation | ||
| CHECK_RUN_ID=$(curl -s -X POST \ | ||
| -H "Authorization: token $GITHUB_TOKEN" \ | ||
| -H "Accept: application/vnd.github.v3+json" \ | ||
| "https://api.github.com/repos/${{ github.repository }}/check-runs" \ | ||
| -d "$(jq -n \ | ||
| --arg name "Deploy Operator" \ | ||
| --arg sha "${{ github.sha }}" \ | ||
| --arg namespace "${NAMESPACE}" \ | ||
| --arg error_msg "$ERROR_MESSAGE" \ | ||
| --arg repo "${{ github.repository }}" \ | ||
| --arg run_id "${{ github.run_id }}" \ | ||
| '{ | ||
| "name": $name, | ||
| "head_sha": $sha, | ||
| "status": "completed", | ||
| "conclusion": "failure", | ||
| "output": { | ||
| "title": "Operator Deployment Failed (LogAI Analysis)", | ||
| "summary": ("Failed to deploy dynamo-platform operator to namespace " + $namespace), | ||
| "text": ("**Job**: deploy-operator\n**Namespace**: " + $namespace + "\n**Analysis Method**: LogAI\n\n**Error Details**:\n```\n" + $error_msg + "\n```\n\n[View Job Run](https://github.com/" + $repo + "/actions/runs/" + $run_id + ")"), | ||
| "annotations": [{ | ||
| "path": ".github/workflows/container-validation-backends.yml", | ||
| "start_line": 357, | ||
| "end_line": 425, | ||
| "annotation_level": "failure", | ||
| "message": $error_msg, | ||
| "title": "Operator deployment failed (LogAI)" | ||
| }] | ||
| } | ||
| }')" | jq -r '.id') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded line numbers in annotations will become stale.
The annotation references hardcoded line numbers (357-425 on lines 737-738). These will become incorrect as the workflow file is modified, causing annotations to point to the wrong locations.
Also, ERROR_MESSAGE_JSON on line 712 is created but never used—the jq command uses $error_msg directly via --arg.
Consider either:
- Removing the hardcoded line numbers and using a more generic annotation approach
- Adding a comment noting these need to be updated when the file structure changes
- Removing the unused
ERROR_MESSAGE_JSONvariable
- # Safely encode ERROR_MESSAGE for JSON using jq
- ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg')
-
# Create a check run with the annotation📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Safely encode ERROR_MESSAGE for JSON using jq | |
| ERROR_MESSAGE_JSON=$(jq -n --arg msg "$ERROR_MESSAGE" '$msg') | |
| # Create a check run with the annotation | |
| CHECK_RUN_ID=$(curl -s -X POST \ | |
| -H "Authorization: token $GITHUB_TOKEN" \ | |
| -H "Accept: application/vnd.github.v3+json" \ | |
| "https://api.github.com/repos/${{ github.repository }}/check-runs" \ | |
| -d "$(jq -n \ | |
| --arg name "Deploy Operator" \ | |
| --arg sha "${{ github.sha }}" \ | |
| --arg namespace "${NAMESPACE}" \ | |
| --arg error_msg "$ERROR_MESSAGE" \ | |
| --arg repo "${{ github.repository }}" \ | |
| --arg run_id "${{ github.run_id }}" \ | |
| '{ | |
| "name": $name, | |
| "head_sha": $sha, | |
| "status": "completed", | |
| "conclusion": "failure", | |
| "output": { | |
| "title": "Operator Deployment Failed (LogAI Analysis)", | |
| "summary": ("Failed to deploy dynamo-platform operator to namespace " + $namespace), | |
| "text": ("**Job**: deploy-operator\n**Namespace**: " + $namespace + "\n**Analysis Method**: LogAI\n\n**Error Details**:\n```\n" + $error_msg + "\n```\n\n[View Job Run](https://github.com/" + $repo + "/actions/runs/" + $run_id + ")"), | |
| "annotations": [{ | |
| "path": ".github/workflows/container-validation-backends.yml", | |
| "start_line": 357, | |
| "end_line": 425, | |
| "annotation_level": "failure", | |
| "message": $error_msg, | |
| "title": "Operator deployment failed (LogAI)" | |
| }] | |
| } | |
| }')" | jq -r '.id') | |
| # Create a check run with the annotation | |
| CHECK_RUN_ID=$(curl -s -X POST \ | |
| -H "Authorization: token $GITHUB_TOKEN" \ | |
| -H "Accept: application/vnd.github.v3+json" \ | |
| "https://api.github.com/repos/${{ github.repository }}/check-runs" \ | |
| -d "$(jq -n \ | |
| --arg name "Deploy Operator" \ | |
| --arg sha "${{ github.sha }}" \ | |
| --arg namespace "${NAMESPACE}" \ | |
| --arg error_msg "$ERROR_MESSAGE" \ | |
| --arg repo "${{ github.repository }}" \ | |
| --arg run_id "${{ github.run_id }}" \ | |
| '{ | |
| "name": $name, | |
| "head_sha": $sha, | |
| "status": "completed", | |
| "conclusion": "failure", | |
| "output": { | |
| "title": "Operator Deployment Failed (LogAI Analysis)", | |
| "summary": ("Failed to deploy dynamo-platform operator to namespace " + $namespace), | |
| "text": ("**Job**: deploy-operator\n**Namespace**: " + $namespace + "\n**Analysis Method**: LogAI\n\n**Error Details**:\n |
🤖 Prompt for AI Agents
In .github/workflows/container-validation-backends.yml around lines 711 to 744,
the check-run annotation embeds hardcoded start_line/end_line (357-425) that
will become stale and also defines ERROR_MESSAGE_JSON at line 712 which is never
used; remove the hardcoded line numbers from the annotation payload (omit
start_line and end_line or replace with a generic approach such as no file
location or pointing to the workflow file without line numbers) and either
delete the unused ERROR_MESSAGE_JSON variable or actually use it when building
the jq payload (pass the pre-encoded JSON-safe error text into the jq invocation
instead of the raw $ERROR_MESSAGE).
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.