Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User Story] CI Health: Redefining CI investigations and Health #75243

Open
16 of 28 tasks
hoyosjs opened this issue Sep 8, 2022 · 6 comments
Open
16 of 28 tasks

[User Story] CI Health: Redefining CI investigations and Health #75243

hoyosjs opened this issue Sep 8, 2022 · 6 comments
Labels
area-Infrastructure User Story A single user-facing feature. Can be grouped under an epic.
Milestone

Comments

@hoyosjs
Copy link
Member

hoyosjs commented Sep 8, 2022

[User Story] CI Health: Redefining CI investigations and Merge on Green

The purpose of this issue is to document the different work streams happening throughout the runtime to improve the UX and reliability
of the CI system. The main goal is to help developers feel productive while maintaining product risk low. The aim of this project is to
achieve a system where 80%+ off all PRs are merged with a green "Build Analysis" check, with all issues understood, and aiming to lower
the time wasted in repetitive investigation of known issues.

The work streams are roughly:

  • 1. Issues can be easily searched for throughout the different components of a PR to reason about failures:

    • 1.1 Build issue search within AzDO has been deployed.
    • 1.2 [Owner: DevWF] Helix test log searching. Rolled out and the tab identifies issue, but issue counts are not accurate yet and doesn't properly update a failure table on the tracking issue.
  • 2. It's easy to report issues directly from the Build Analysis check tab:

    • 2.1 Build issues from are reported are easy to report as infrastructure issues for issues like AzDO feeds with retries capability.
    • 2.2 Test issues are easy to report from the failed build. This includes all relevant information and all the end user has to do is provide identifiable information for automation to find the correct issue.
    • 2.3 [Owner: DevWF] Issue should contain all accurate accounting of occurrences as this helps teams prioritize impactful issues. We are missing for the table to have source - i.e. PR backlink - and an accurate count of hits on a sliding window.
  • 3. Update docs to account for opening issues, assessing if an issue is known, and how to proceed if issues are found:

  • 4. Tests should have failures logged in a format that Build Analysis can easily reasoned about and surfaced to the check tab:

  • 5. Redefine merge on red: Make build analysis the definition for merge on red

    • 5.1 [Owner: DevWF] Turning 'Build Analysis' into a required check requires:
      • 5.1.1 Reporting an issue should rerun the check against it to move it to the known column.
      • 5.1.2 Correlating an issue manually is possible (even if undesirable) to unblock merging.
      • 5.1.3 Re-run a check is necessary to some extent - otherwise PRs need to wait 1+ hours for DWV to rerun.
    • 5.2 [Owner: Runtime/DevWF] Define a metric that measures how successful this new definition is at helping people quickly distinguish errors from their PRs from known issues.
    • 5.3 [Owner: Runtime] Find a way to help people discover this definition easily: if all known issues, it should be obvious to the end user they can merge. Specify in documentation to mark this as completed.
    • 5.4 [Owner: Runtime/DevWF] Define a mechanism to study what failures need hardening and what issues should be invested on. The dashboard could surface

Future Work

  • Adding crashdump and hang dump in Libraries tests
  • V1 & V2 test system: Enable crash collection on macOS (e.g., Singlefile, exception handling)
  • V2 test system: Hang dump collection and integrate symbolication from V1
    • 4.3 [Moved to Future Item] [Owner: Runtime] Ensure timeouts and hang dumps are properly handled in the new testing system, and that they are surfaced in a way build analysis can upload them.
  • Move from ASP.NET to dotnet/arcade (repo with all the shared infrastructure) for test level retry
  • No crashdump and hang dump support for mono and wasm. They don't have a good crash mechanism yet. @SamMonoRT @lewing @BrzVlad

cc: @JulieLeeMSFT @tommcdon @markwilkie

cc: @AlitzelMendez @missymessa @ulisesh @ChadNedzlek

@hoyosjs hoyosjs added area-Infrastructure User Story A single user-facing feature. Can be grouped under an epic. labels Sep 8, 2022
@hoyosjs hoyosjs added this to the 8.0.0 milestone Sep 8, 2022
@ghost
Copy link

ghost commented Sep 8, 2022

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Issue Details

[User Story] CI Health: Redefining CI investigations and Merge on Green

The purpose of this issue is to document the different work streams happening throughout the runtime to improve the UX and reliability
of the CI system. The main goal is to help developers feel productive while maintaining product risk low. The aim of this project is to
achieve a system where 80%+ off all PRs are merged with a green "Build Analysis" check, with all issues understood, and aiming to lower
the time wasted in repetitive investigation of known issues.

The work streams are roughly:

  • Issues can be easily searched for throughout the different components of a PR:

    • Build issue search within AzDO has been deployed.
    • Helix test log searching. Rolled out and the tab identifies issue, but issue counts are not accurate yet and doesn't properly update
  • It's easy to report issues directly from the Build Analysis check tab

    • Build issues from are reported are easy to report as infrastructure issues for issues like AzDO feeds with retries capability.
    • Test issues are easy to report from the failed build. This includes all relevant information and all the end user has to do is provide identifiable information for automation to find the correct issue.
    • Issue should contain all accurate accounting of occurrences as this helps teams prioritize impactful issues. We are missing for the table to have source - i.e. PR backlink - and an accurate count of hits on a sliding window.
  • Update docs to account for opening issues, assessing if an issue is known, and how to proceed if issues are found:

  • Tests should have failures logged in a format that :

    • The legacy system (xUnit) is not properly surfacing asserts. Ensure that StdErr for the child process is properly being redirected as much as possible.
    • Ensure the new source-generated testing framework allows for proper attribution at the test level. This includes an analysis of catastrophe style issues that are now reported as workitem failures. @davidwrighton was taking a cursory look at this.
    • Ensure timeouts and hang dumps are properly handled in the new testing system and surfaced in a way build analysis can surface them.
  • Redefine merge on red: Make build analysis the definition for merge on red

    • Turning 'Build Analysis' into a required check requires:
      • Reporting an issue should rerun the check against it to move it to the known column.
      • Correlating an issue manually is possible (even if undesirable) to unblock merging.
      • Re-run a check is necessary to some extent - otherwise PRs need to wait 1+ hours for DWV to rerun.
    • Define a metric that measures how successful this new definition is at helping people quickly distinguish errors from their PRs from known issues.
    • Find a way to help people discover this definition easily: if all known issues, it should be obvious to the end user they can merge.
    • Define a mechanism to study what failures need hardening and what issues should be invested on. The dashboard could surface

cc: @JulieLeeMSFT @tommcdon @markwilkie

cc: @AlitzelMendez @missymessa @ulisesh @ChadNedzlek

Author: hoyosjs
Assignees: -
Labels:

area-Infrastructure, User Story

Milestone: 8.0.0

@JulieLeeMSFT
Copy link
Member

cc @jeffschwMSFT @mangod9.

@danmoseley
Copy link
Member

Ensure timeouts and hang dumps are properly handled in the new testing system,

Does this include making sure hangs lead to dumps? Or we believe that's now the case (I wasn't aware). I do agree this would really help a category of test failures that aren't currently actionable.

@hoyosjs
Copy link
Member Author

hoyosjs commented Sep 15, 2022

@danmoseley This happens for coreclr tests. Libraries tests have no provision for this, other than dotnet-test based runs and I think there were reasons not to move to that?

@danmoseley
Copy link
Member

Ah. Yes, for moving to dotnet-test I think we discussed that we need some more lightweight runner due to being bottom of the stack. @ViktorHofer do we have anything like that on the backlog still?

@ViktorHofer
Copy link
Member

@ViktorHofer do we have anything like that on the backlog still?

I filed microsoft/vstest#3595 for that a while ago. We basically need a way to run our tests in-proc with a minimal set of dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Infrastructure User Story A single user-facing feature. Can be grouped under an epic.
Projects
Status: No status
Development

No branches or pull requests

5 participants