[User Story] CI Health: Redefining CI investigations and Health #75243

hoyosjs · 2022-09-08T06:48:27Z

[User Story] CI Health: Redefining CI investigations and Merge on Green

The purpose of this issue is to document the different work streams happening throughout the runtime to improve the UX and reliability
of the CI system. The main goal is to help developers feel productive while maintaining product risk low. The aim of this project is to
achieve a system where 80%+ off all PRs are merged with a green "Build Analysis" check, with all issues understood, and aiming to lower
the time wasted in repetitive investigation of known issues.

The work streams are roughly:

Future Work

Adding crashdump and hang dump in Libraries tests
V1 & V2 test system: Enable crash collection on macOS (e.g., Singlefile, exception handling)
V2 test system: Hang dump collection and integrate symbolication from V1
- 4.3 [Moved to Future Item] [Owner: Runtime] Ensure timeouts and hang dumps are properly handled in the new testing system, and that they are surfaced in a way build analysis can upload them.
Move from ASP.NET to dotnet/arcade (repo with all the shared infrastructure) for test level retry
No crashdump and hang dump support for mono and wasm. They don't have a good crash mechanism yet. @SamMonoRT @lewing @BrzVlad

cc: @JulieLeeMSFT @tommcdon @markwilkie

cc: @AlitzelMendez @missymessa @ulisesh @ChadNedzlek

The text was updated successfully, but these errors were encountered:

ghost · 2022-09-08T06:48:32Z

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Issue Details

[User Story] CI Health: Redefining CI investigations and Merge on Green

The purpose of this issue is to document the different work streams happening throughout the runtime to improve the UX and reliability
of the CI system. The main goal is to help developers feel productive while maintaining product risk low. The aim of this project is to
achieve a system where 80%+ off all PRs are merged with a green "Build Analysis" check, with all issues understood, and aiming to lower
the time wasted in repetitive investigation of known issues.

The work streams are roughly:

Issues can be easily searched for throughout the different components of a PR:
- Build issue search within AzDO has been deployed.
- Helix test log searching. Rolled out and the tab identifies issue, but issue counts are not accurate yet and doesn't properly update
It's easy to report issues directly from the Build Analysis check tab
- Build issues from are reported are easy to report as infrastructure issues for issues like AzDO feeds with retries capability.
- Test issues are easy to report from the failed build. This includes all relevant information and all the end user has to do is provide identifiable information for automation to find the correct issue.
- Issue should contain all accurate accounting of occurrences as this helps teams prioritize impactful issues. We are missing for the table to have source - i.e. PR backlink - and an accurate count of hits on a sliding window.
Update docs to account for opening issues, assessing if an issue is known, and how to proceed if issues are found:
- Update documentation for analyzing PR failures #74615 largely achieved this work, but it needs to be updated for the opening issues workflow that got enabled as well as some of the timing expectations for the system.
Tests should have failures logged in a format that :
- The legacy system (xUnit) is not properly surfacing asserts. Ensure that StdErr for the child process is properly being redirected as much as possible.
- Ensure the new source-generated testing framework allows for proper attribution at the test level. This includes an analysis of catastrophe style issues that are now reported as workitem failures. @davidwrighton was taking a cursory look at this.
- Ensure timeouts and hang dumps are properly handled in the new testing system and surfaced in a way build analysis can surface them.
Redefine merge on red: Make build analysis the definition for merge on red
- Turning 'Build Analysis' into a required check requires:
  - Reporting an issue should rerun the check against it to move it to the known column.
  - Correlating an issue manually is possible (even if undesirable) to unblock merging.
  - Re-run a check is necessary to some extent - otherwise PRs need to wait 1+ hours for DWV to rerun.
- Define a metric that measures how successful this new definition is at helping people quickly distinguish errors from their PRs from known issues.
- Find a way to help people discover this definition easily: if all known issues, it should be obvious to the end user they can merge.
- Define a mechanism to study what failures need hardening and what issues should be invested on. The dashboard could surface

cc: @JulieLeeMSFT @tommcdon @markwilkie

cc: @AlitzelMendez @missymessa @ulisesh @ChadNedzlek

Author:	hoyosjs
Assignees:	-
Labels:	`area-Infrastructure`, `User Story`
Milestone:	8.0.0

JulieLeeMSFT · 2022-09-08T21:47:18Z

cc @jeffschwMSFT @mangod9.

danmoseley · 2022-09-15T18:31:43Z

Ensure timeouts and hang dumps are properly handled in the new testing system,

Does this include making sure hangs lead to dumps? Or we believe that's now the case (I wasn't aware). I do agree this would really help a category of test failures that aren't currently actionable.

hoyosjs · 2022-09-15T20:03:34Z

@danmoseley This happens for coreclr tests. Libraries tests have no provision for this, other than dotnet-test based runs and I think there were reasons not to move to that?

danmoseley · 2022-09-15T21:59:49Z

Ah. Yes, for moving to dotnet-test I think we discussed that we need some more lightweight runner due to being bottom of the stack. @ViktorHofer do we have anything like that on the backlog still?

ViktorHofer · 2022-09-16T10:25:47Z

@ViktorHofer do we have anything like that on the backlog still?

I filed microsoft/vstest#3595 for that a while ago. We basically need a way to run our tests in-proc with a minimal set of dependencies.

hoyosjs added area-Infrastructure User Story A single user-facing feature. Can be grouped under an epic. labels Sep 8, 2022

hoyosjs added this to the 8.0.0 milestone Sep 8, 2022

kunalspathak mentioned this issue Oct 31, 2022

Produce crashreport.json and use llvm-symbolizer to create stack trace #77578

Merged

ivdiazsa mentioned this issue Nov 9, 2022

Add PyYAML 5.3.1 as a dependency to Helix machines running the Python scripts dotnet/arcade#11559

Closed

3 tasks

ivdiazsa mentioned this issue Dec 1, 2022

Merge-on-Red: Implemented YAML log reader alongside the XML ones dotnet/arcade#11807

Closed

kunalspathak mentioned this issue Dec 7, 2022

Install public version of windbg/cdb on windows helix machines dotnet/arcade#11868

Open

5 tasks

agocke modified the milestones: 8.0.0, 9.0.0 Sep 5, 2023

agocke added this to Runtime Infra Jul 8, 2024

agocke modified the milestones: 9.0.0, 10.0.0 Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User Story] CI Health: Redefining CI investigations and Health #75243

[User Story] CI Health: Redefining CI investigations and Health #75243

hoyosjs commented Sep 8, 2022 •

edited by JulieLeeMSFT

Loading

ghost commented Sep 8, 2022

[User Story] CI Health: Redefining CI investigations and Merge on Green

JulieLeeMSFT commented Sep 8, 2022

danmoseley commented Sep 15, 2022

hoyosjs commented Sep 15, 2022

danmoseley commented Sep 15, 2022

ViktorHofer commented Sep 16, 2022

[User Story] CI Health: Redefining CI investigations and Health #75243

[User Story] CI Health: Redefining CI investigations and Health #75243

Comments

hoyosjs commented Sep 8, 2022 • edited by JulieLeeMSFT Loading

[User Story] CI Health: Redefining CI investigations and Merge on Green

Future Work

ghost commented Sep 8, 2022

[User Story] CI Health: Redefining CI investigations and Merge on Green

JulieLeeMSFT commented Sep 8, 2022

danmoseley commented Sep 15, 2022

hoyosjs commented Sep 15, 2022

danmoseley commented Sep 15, 2022

ViktorHofer commented Sep 16, 2022

hoyosjs commented Sep 8, 2022 •

edited by JulieLeeMSFT

Loading