Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP 500s from AzDO Test Results API #10358

Closed
MattGal opened this issue Aug 9, 2022 · 18 comments
Closed

HTTP 500s from AzDO Test Results API #10358

MattGal opened this issue Aug 9, 2022 · 18 comments
Assignees
Labels
Azure Dev Ops Failures related to Azure Devops agents or pipelines Critical FC - Infrastructure A build failure caused by apparent infrastructure failures. Ops - First Responder

Comments

@MattGal
Copy link
Member

MattGal commented Aug 9, 2022

Tracking IcM: https://portal.microsofticm.com/imp/v3/incidents/details/326663396/home

This seems to be impacting approximately 1 in 3 builds:

Query

image

@missymessa
Copy link
Member

Should we turn this into a Known Issue?

@MattGal
Copy link
Member Author

MattGal commented Aug 9, 2022

Should we turn this into a Known Issue?

I'd say only if we can make those grep through the Helix (non-console) logs, otherwise it's indistinguishable from a crash or other non-test-failure failure.

@markwilkie
Copy link
Member

only if we can make those grep through the Helix (non-console) logs,

Which should be rolling out this week!

MattGal added a commit that referenced this issue Aug 11, 2022
* Fix versioning errors in workloads

* Disable TRX tests while reporting to AZDO is broken (#10358) (#10380)

Co-authored-by: Matt Galbraith <MattGal@users.noreply.github.com>
MattGal added a commit to MattGal/arcade that referenced this issue Aug 12, 2022
joeloff added a commit to joeloff/arcade that referenced this issue Aug 12, 2022
* Fix versioning errors in workloads

* Disable TRX tests while reporting to AZDO is broken (dotnet#10358) (dotnet#10380)

Co-authored-by: Matt Galbraith <MattGal@users.noreply.github.com>
mmitche pushed a commit that referenced this issue Aug 15, 2022
* Refactoring workload build tasks (#8645)

* Refactoring workload build tasks

* Fix source build and some random cleanup

* Updating tests, code cleanup

* Minor fixes, unit test conversion

* Mark tests as Windows only, fix missing content for Helix

* Hide WiX and test packages from Solution Explorer

* Fix duplicate publish items

* Fix link target for helix

* Fix link metadata for WiX

* Pass ICE suppressions to Light, more cleanup

* Fix file extraction for packs, add unit test for template pack MSI

* Pass ICE suppressions to Light (#9061)

* Create workload pack group installers (#9514)

* Remove duplicate PackageReference

* Create MSIs for workoad pack groups

* Build NuGet wrapper packages for workload pack group MSIs

* Generate WorkloadPackGroups.json in manifest MSIs

* Add swix authoring for workload pack groups

* De-duplicate workload pack group creation

* Put braces around ProductCode and UpgradeCode registry values

* Write registry keys for pack groups

* Fix swix dependencies for pack groups

* Use correct GUID format when setting candle variables

* Add test for creating pack group dependency in SWR file

* Support building with missing workload packs (#9628)

* Support building with missing workload packs

* Include extracted manifest files in manifest MSI payload nupkg

* Fix versioning errors in workloads (#10363)

* Fix versioning errors in workloads

* Disable TRX tests while reporting to AZDO is broken (#10358) (#10380)

Co-authored-by: Matt Galbraith <MattGal@users.noreply.github.com>

* clean up, api changes

Co-authored-by: Daniel Plaisted <dsplaisted@gmail.com>
Co-authored-by: Matt Galbraith <MattGal@users.noreply.github.com>
@MattGal
Copy link
Member Author

MattGal commented Aug 18, 2022

No updates on the IcM, problem continues to (sporadically) occur for scenarios outside of the Arcade artificial TRX scenario.

@ChadNedzlek
Copy link
Member

This problem started Aug 9, and has happened 3,000 times a day. Hopefully we can get some traction there. Given how "big" PR's are, even a small incidence of this bubbles up into a lot of failed PRs.

@ChadNedzlek
Copy link
Member

Here's a chart showing how many jobs are impacted Jobs impacted

It's nearly 300 builds a day, this is unacceptable and needs to be elevated.

@MattGal MattGal added Critical Azure Dev Ops Failures related to Azure Devops agents or pipelines FC - Infrastructure A build failure caused by apparent infrastructure failures. labels Aug 18, 2022
@ChadNedzlek
Copy link
Member

Quick chart (red is builds that crashed into this problem at least one):

image

@ChadNedzlek
Copy link
Member

Update the description to reflect the current severity.

@markwilkie
Copy link
Member

Shouldn't this be a sev2? cc/ @Chrisboh

Is it possible to have a known issue tracking the builds affected? cc/ @ulisesh

thanks a ton @ChadNedzlek for figuring the impact here

@Chrisboh
Copy link
Member

Yeah Chad got us the data last night to confirm this is sev 2 and Stu is raising that now and getting on the bridge.

@ulisesh
Copy link
Contributor

ulisesh commented Aug 19, 2022

Shouldn't this be a sev2? cc/ @Chrisboh

Is it possible to have a known issue tracking the builds affected? cc/ @ulisesh

thanks a ton @ChadNedzlek for figuring the impact here

Unfortunately, the error happens in the Helix client and test known issues is design to identify problems in the tests

@garath garath self-assigned this Aug 20, 2022
@garath
Copy link
Member

garath commented Aug 20, 2022

The team has evidence that the root cause is related to an incomplete fix for the problems described in #9865. They are rolling-back the fix, which should resolve this issue but, unfortunately, bring back the original. They will continue to treat as Sev 2.

@MattGal
Copy link
Member Author

MattGal commented Aug 22, 2022

Error graph looks great today, rollback seems to have helped.

image

@garath
Copy link
Member

garath commented Aug 22, 2022

The rollback was successful, no hits on this over the weekend.

@garath garath closed this as completed Aug 22, 2022
@MattGal
Copy link
Member Author

MattGal commented Sep 15, 2022

This came back on 9/1/2022 and we didn't notice. Reopening (@Chrisboh for visibility)

@MattGal MattGal reopened this Sep 15, 2022
@MattGal
Copy link
Member Author

MattGal commented Sep 15, 2022

It's back in dnceng-public so they asked me to file a new IcM as they claim the root cause is different (we can't tell; we get 500s). Filed https://portal.microsofticm.com/imp/v3/incidents/details/335170304/home to track this

@MattGal
Copy link
Member Author

MattGal commented Sep 19, 2022

I think I see the actual issue; created #10916 to track this.

@MattGal
Copy link
Member Author

MattGal commented Sep 19, 2022

Chad is pursuing #10916 , closing this one in favor of that as it's a new variation.

@MattGal MattGal closed this as completed Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Azure Dev Ops Failures related to Azure Devops agents or pipelines Critical FC - Infrastructure A build failure caused by apparent infrastructure failures. Ops - First Responder
Projects
None yet
Development

No branches or pull requests

7 participants