Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
0365ee9
Add an azdo failure skill
lewing Feb 2, 2026
3c3adec
Improve Get-HelixFailures.ps1 script
lewing Feb 2, 2026
c040d21
Fix PowerShell variable scope issue
lewing Feb 2, 2026
e5f365b
Add build error extraction and failure classification
lewing Feb 2, 2026
2bdfcc5
Additional improvements to failure analysis skill
lewing Feb 2, 2026
80911b5
Fix URL parsing issues in Get-HelixFailures.ps1
lewing Feb 2, 2026
94a8925
Add Docker image pull failure pattern to classification
lewing Feb 2, 2026
b9144e8
Merge best features from both skill PRs
lewing Feb 2, 2026
53724c2
Add support for local test failures and test run URL extraction
lewing Feb 2, 2026
eeec65c
Add Azure DevOps CLI support for fetching failed test names
lewing Feb 3, 2026
3c86381
Include Helix console log links in failure output
lewing Feb 3, 2026
f501745
Add known issue search using Build Analysis label
lewing Feb 3, 2026
983b399
Document known issue search feature
lewing Feb 3, 2026
3c1f01b
Add build links alongside log URLs for failed jobs
lewing Feb 3, 2026
4bc9376
Add request caching for faster repeated analysis
lewing Feb 3, 2026
35b4f7b
Show build status (in-progress/completed) in output
lewing Feb 3, 2026
271fbca
Add cache cleanup (-ClearCache parameter and auto-cleanup on startup)
lewing Feb 3, 2026
c3268d3
Add cross-platform temp directory detection
lewing Feb 3, 2026
a71d363
Improve error handling and caching behavior
lewing Feb 3, 2026
be505ff
Be more conservative about transient failure guidance
lewing Feb 3, 2026
7a2ba86
Address PR review comments
lewing Feb 3, 2026
09e593c
Add guidance to read PR context before analyzing failures
lewing Feb 3, 2026
39840e3
Add build and log URLs to failure output
lewing Feb 3, 2026
2d1e21b
Refactor: organize script with regions and remove whitespace
lewing Feb 3, 2026
0bfa4aa
Improve C++/native build error detection
lewing Feb 3, 2026
d0525db
Analyze all failing builds for a PR, not just the first
lewing Feb 3, 2026
1ed1241
Add Build Analysis check for known issues
lewing Feb 3, 2026
b4defa5
Update SKILL.md with Build Analysis and multi-build docs
lewing Feb 3, 2026
ebf291b
Fix: use explicit default value for Context parameter
lewing Feb 3, 2026
021de3e
Add PR change correlation for failure analysis
lewing Feb 3, 2026
22ec99f
Update SKILL.md examples to include links
lewing Feb 3, 2026
2158808
Simplify skill: reduce cache TTL to 30s, remove severity classification
lewing Feb 3, 2026
bed5862
Add MihuBot semantic search integration for related issues
lewing Feb 3, 2026
ab7e37b
Highlight binlog artifacts and add MSBuild analysis guidance
lewing Feb 3, 2026
bacfb86
Restructure skill following Anthropic best practices
lewing Feb 4, 2026
83e5ace
Add canceled job detection and smart retry recommendations
lewing Feb 4, 2026
b5b1caa
Generalize skill examples for all dotnet repositories
lewing Feb 4, 2026
0f02fda
Address PR review comments
lewing Feb 4, 2026
d2e6302
Address additional PR review comments
lewing Feb 4, 2026
b98b0c0
Improve known issue search for local test failures
lewing Feb 4, 2026
8e43185
Fix indentation in test failure extraction block
lewing Feb 4, 2026
f789e9e
Address PR review comments (batch 3)
lewing Feb 4, 2026
3499f2f
Fix artifact file property name (Name -> FileName)
lewing Feb 4, 2026
60d1419
Add helix-artifacts.md reference documentation
lewing Feb 4, 2026
5fe1450
Simplify helix-artifacts.md - focus on patterns not specifics
lewing Feb 4, 2026
ff97b6d
Address Copilot review: security and robustness improvements
lewing Feb 4, 2026
a07dc98
Use relative paths in skill documentation
lewing Feb 4, 2026
af34913
Address remaining code quality review items
lewing Feb 4, 2026
da7a287
Fix documentation errors
lewing Feb 4, 2026
822621e
Improve azdo-helix-failures skill per Agent Skills spec
lewing Feb 4, 2026
319efed
Add guidance for reviewing facts before presenting conclusions
lewing Feb 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions .github/skills/azdo-helix-failures/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
name: azdo-helix-failures
description: Retrieve and analyze test failures from Azure DevOps builds and Helix test runs for dotnet repositories. Use when investigating CI failures, debugging failing PRs, or given URLs containing dev.azure.com or helix.dot.net.
---

# Azure DevOps and Helix Failure Analysis

Analyze CI test failures in Azure DevOps and Helix for dotnet repositories (runtime, sdk, aspnetcore, roslyn, and more).

## When to Use This Skill

Use this skill when:
- Investigating CI failures or checking why a PR's tests are failing
- Debugging Helix test issues or analyzing build errors
- Given URLs containing `dev.azure.com`, `helix.dot.net`, or GitHub PR links with failing checks
- Asked questions like "why is this PR failing", "analyze the CI failures", or "what's wrong with this build"

## Quick Start

```powershell
# Analyze PR failures (most common) - defaults to dotnet/runtime
./scripts/Get-HelixFailures.ps1 -PRNumber 123445 -ShowLogs

# Analyze by build ID
./scripts/Get-HelixFailures.ps1 -BuildId 1276327 -ShowLogs

# Query specific Helix work item
./scripts/Get-HelixFailures.ps1 -HelixJob "4b24b2c2-..." -WorkItem "System.Net.Http.Tests"

# Other dotnet repositories
./scripts/Get-HelixFailures.ps1 -PRNumber 12345 -Repository "dotnet/aspnetcore"
./scripts/Get-HelixFailures.ps1 -PRNumber 67890 -Repository "dotnet/sdk"
./scripts/Get-HelixFailures.ps1 -PRNumber 11111 -Repository "dotnet/roslyn"
```

## Key Parameters

| Parameter | Description |
|-----------|-------------|
| `-PRNumber` | GitHub PR number to analyze |
| `-BuildId` | Azure DevOps build ID |
| `-ShowLogs` | Fetch and display Helix console logs |
| `-Repository` | Target repo (default: dotnet/runtime) |
| `-MaxJobs` | Max failed jobs to show (default: 5) |
| `-SearchMihuBot` | Search MihuBot for related issues |

## What the Script Does

1. Fetches Build Analysis for known issues
2. Gets failed jobs from Azure DevOps timeline
3. **Separates canceled jobs from failed jobs** (canceled = dependency failures)
4. Extracts Helix work item failures
5. Fetches console logs (with `-ShowLogs`)
6. Searches for known issues with "Known Build Error" label
7. Correlates failures with PR changes
8. **Provides smart retry recommendations**

## Interpreting Results

**Known Issues section**: Failures matching existing GitHub issues - these are tracked and being investigated.

**Canceled jobs**: Jobs that were canceled (not failed) due to earlier stage failures or timeouts. These don't need separate investigation.

**PR Change Correlation**: Files changed by PR appearing in failures - likely PR-related.

**Build errors**: Compilation failures need code fixes.

**Helix failures**: Test failures on distributed infrastructure.

**Local test failures**: Some repos (e.g., dotnet/sdk) run tests directly on build agents. These can also match known issues - search for the test name with the "Known Build Error" label.

## Retry Recommendations

The script provides a recommendation at the end:

| Recommendation | Meaning |
|----------------|---------|
| **KNOWN ISSUES DETECTED** | Tracked issues found that may correlate with failures. Review details. |
| **LIKELY PR-RELATED** | Failures correlate with PR changes. Fix issues first. |
| **POSSIBLY TRANSIENT** | No clear cause - check main branch, search for issues. |
| **REVIEW REQUIRED** | Could not auto-determine cause. Manual review needed. |

## Analysis Workflow

1. **Read PR context first** - Check title, description, comments
2. **Run the script** with `-ShowLogs` for detailed failure info
3. **Check Build Analysis** - Known issues are safe to retry
4. **Correlate with PR changes** - Same files failing = likely PR-related
5. **Interpret patterns**:
- Same error across many jobs → Real code issue
- Device failures (iOS/Android/tvOS) → Often transient infrastructure
- Docker/container image pull failures → Infrastructure issue
- Network timeouts, "host not found" → Transient infrastructure
- Test timeout but tests passed → Executor issue, not test failure

## Presenting Results

The script provides a recommendation at the end, but this is based on heuristics and may be incomplete. Before presenting conclusions to the user:

1. Review the detailed failure information, not just the summary
2. Look for patterns the script may have missed (e.g., related failures across jobs)
3. Consider the PR context (what files changed, what the PR is trying to do)
4. Present findings with appropriate caveats - state what is known vs. uncertain
5. If the script's recommendation seems inconsistent with the details, trust the details

## References

- **Helix artifacts & binlogs**: See [references/helix-artifacts.md](references/helix-artifacts.md)
- **Manual investigation steps**: See [references/manual-investigation.md](references/manual-investigation.md)
- **AzDO/Helix details**: See [references/azdo-helix-reference.md](references/azdo-helix-reference.md)

## Tips

1. Read PR description and comments first for context
2. Check if same test fails on main branch before assuming transient
3. Look for `[ActiveIssue]` attributes for known skipped tests
4. Use `-SearchMihuBot` for semantic search of related issues
5. Binlogs in artifacts help diagnose MSB4018 task failures
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Azure DevOps and Helix Reference

## Supported Repositories

The script works with any dotnet repository that uses Azure DevOps and Helix:

| Repository | Common Pipelines |
|------------|-----------------|
| `dotnet/runtime` | runtime, runtime-dev-innerloop, dotnet-linker-tests |
| `dotnet/sdk` | dotnet-sdk (mix of local and Helix tests) |
| `dotnet/aspnetcore` | aspnetcore-ci |
| `dotnet/roslyn` | roslyn-CI |
| `dotnet/maui` | maui-public |

Use `-Repository` to specify the target:
```powershell
./scripts/Get-HelixFailures.ps1 -PRNumber 12345 -Repository "dotnet/aspnetcore"
```

## Build Definition IDs (Example: dotnet/runtime)

Each repository has its own build definition IDs. Here are common ones for dotnet/runtime:

| Definition ID | Name | Description |
|---------------|------|-------------|
| `129` | runtime | Main PR validation build |
| `133` | runtime-dev-innerloop | Fast innerloop validation |
| `139` | dotnet-linker-tests | ILLinker/trimming tests |

**Note:** The script auto-discovers builds for a PR, so you rarely need to know definition IDs.

## Azure DevOps Organizations

**Public builds (default):**
- Organization: `dnceng-public`
- Project: `cbb18261-c48f-4abb-8651-8cdcb5474649`

**Internal/private builds:**
- Organization: `dnceng`
- Project GUID: Varies by pipeline

Override with:
```powershell
./scripts/Get-HelixFailures.ps1 -BuildId 1276327 -Organization "dnceng" -Project "internal-project-guid"
```

## Common Pipeline Names (Example: dotnet/runtime)

| Pipeline | Description |
|----------|-------------|
| `runtime` | Main PR validation build |
| `runtime-dev-innerloop` | Fast innerloop validation |
| `dotnet-linker-tests` | ILLinker/trimming tests |
| `runtime-wasm-perf` | WASM performance tests |
| `runtime-libraries enterprise-linux` | Enterprise Linux compatibility |

Other repos have different pipelines - the script discovers them automatically from the PR.

## Useful Links

- [Helix Portal](https://helix.dot.net/): View Helix jobs and work items (all repos)
- [Helix API Documentation](https://helix.dot.net/swagger/): Swagger docs for Helix REST API
- [Build Analysis](https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/LandingPage.md): Known issues tracking (arcade infrastructure)
- [dnceng-public AzDO](https://dev.azure.com/dnceng-public/public/_build): Public builds for all dotnet repos

### Repository-specific docs:
- [runtime: Triaging Failures](https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/triaging-failures.md)
- [runtime: Area Owners](https://github.com/dotnet/runtime/blob/main/docs/area-owners.md)

## Test Execution Types

### Helix Tests
Tests run on Helix distributed test infrastructure. The script extracts console log URLs and can fetch detailed failure info with `-ShowLogs`.

### Local Tests (Non-Helix)
Some repositories (e.g., dotnet/sdk) run tests directly on the build agent. The script detects these and extracts Azure DevOps Test Run URLs.

## Known Issue Labels

- `Known Build Error` - Used by Build Analysis across all dotnet repositories
- Search syntax: `repo:<owner>/<repo> is:issue is:open label:"Known Build Error" <test-name>`

Example searches:
```bash
# Search in runtime
gh issue list --repo dotnet/runtime --label "Known Build Error" --search "FileSystemWatcher"

# Search in aspnetcore
gh issue list --repo dotnet/aspnetcore --label "Known Build Error" --search "Blazor"

# Search in sdk
gh issue list --repo dotnet/sdk --label "Known Build Error" --search "template"
```
110 changes: 110 additions & 0 deletions .github/skills/azdo-helix-failures/references/helix-artifacts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Helix Work Item Artifacts

Guide to finding and analyzing artifacts from Helix test runs.

## Accessing Artifacts

### Via the Script

Query a specific work item to see its artifacts:

```powershell
./scripts/Get-HelixFailures.ps1 -HelixJob "4b24b2c2-..." -WorkItem "Microsoft.NET.Sdk.Tests.dll.1" -ShowLogs
```

### Via API

```bash
# Get work item details including Files array
curl -s "https://helix.dot.net/api/2019-06-17/jobs/{jobId}/workitems/{workItemName}"
```

The `Files` array contains artifacts with `FileName` and `Uri` properties.

## Artifact Availability Varies

**Not all test types produce the same artifacts.** What you see depends on the repo, test type, and configuration:

- **Build/publish tests** (SDK, WASM) → Multiple binlogs
- **AOT compilation tests** (iOS/Android) → `AOTBuild.binlog` plus device logs
- **Standard unit tests** → Console logs only, no binlogs
- **Crash failures** (exit code 134) → Core dumps may be present

Always query the specific work item to see what's available rather than assuming a fixed structure.

## Common Artifact Patterns

| File Pattern | Purpose | When Useful |
|--------------|---------|-------------|
| `*.binlog` | MSBuild binary logs | AOT/build failures, MSB4018 errors |
| `console.*.log` | Console output | Always available, general output |
| `run-*.log` | XHarness execution logs | Mobile test failures |
| `device-*.log` | Device-specific logs | iOS/Android device issues |
| `dotnetTestLog.*.log` | dotnet test output | Test framework issues |
| `vstest.*.log` | VSTest output | aspnetcore/SDK test issues |
| `core.*`, `*.dmp` | Core dumps | Crashes, hangs |
| `testResults.xml` | Test results | Detailed pass/fail info |

Artifacts may be at the root level or nested in subdirectories like `xharness-output/logs/`.

## Binlog Files

Binlogs are **only present for tests that invoke MSBuild** (build/publish tests, AOT compilation). Standard unit tests don't produce binlogs.

### Common Names

| File | Description |
|------|-------------|
| `build.msbuild.binlog` | Build phase |
| `publish.msbuild.binlog` | Publish phase |
| `AOTBuild.binlog` | AOT compilation |
| `msbuild.binlog` | General MSBuild operations |
| `msbuild0.binlog`, `msbuild1.binlog` | Per-test-run logs (numbered) |

### Analyzing Binlogs

**Online viewer (no download):**
1. Copy the binlog URI from the script output
2. Go to https://live.msbuildlog.com/
3. Paste the URL to load and analyze

**Download and view locally:**
```bash
curl -o build.binlog "https://helix.dot.net/api/jobs/{jobId}/workitems/{workItem}/files/build.msbuild.binlog?api-version=2019-06-17"
# Open with MSBuild Structured Log Viewer
```

**AI-assisted analysis:**
Use the MSBuild MCP server to analyze binlogs for errors and warnings.

## Core Dumps

Core dumps appear when tests crash (typically exit code 134 on Linux/macOS):

```
core.1000.34 # Format: core.{uid}.{pid}
```

## Mobile Test Artifacts (iOS/Android)

Mobile device tests typically include XHarness orchestration logs:

- `run-ios-device.log` / `run-android.log` - Execution log
- `device-{machine}-*.log` - Device output
- `list-ios-device-*.log` - Device discovery
- `AOTBuild.binlog` - AOT compilation (when applicable)
- `*.crash` - iOS crash reports

## Finding the Right Work Item

1. Run the script with `-ShowLogs` to see Helix job/work item info
2. Look for lines like:
```
Helix Job: 4b24b2c2-ad5a-4c46-8a84-844be03b1d51
Work Item: Microsoft.NET.Sdk.Tests.dll.1
```
3. Query that specific work item for full artifact list

## Artifact Retention

Helix artifacts are retained for a limited time (typically 30 days). Download important artifacts promptly if needed for long-term analysis.
Loading