Improve alerts for Build Analysis functionality #9078

missymessa · 2022-04-18T20:31:23Z

A couple of our alerts are probably not functioning in the best way possible, so let's figure out how to improve them:

Link to dashboard for reference: https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/buildAnalysis/build-analysis?orgId=1&from=now-2d&to=now

Test Reporting Services Monitoring

This is a tricky alert. We want to be notified of exceptions that occur in our test reporting services (of which, there are three that are currently running), however, I don't believe it's reasonable to expect the alert to self-resolve after 5 minutes if the services run on a schedule every hour or every day. We should consider breaking this out into three separate alerts for each of the services, or, at the minimum, increase the timing to look at the result every hour instead of every minute for five minutes.

Service Hook Deliveries Alert

It doesn't look like this alert actually does what we want it to. The goal for these alerts is to tell us when Azure DevOps stops sending us events through the Service Hooks functionality in order to better detect for issues like this: https://github.com/dotnet/core-eng/issues/15712

I'm not certain if that's going to be possible because we don't know if we should be getting events when we're checking for them (e.g. should we be expecting events on the weekend? Probably not), and our services are pretty far down the chain of services that ultimately get this event. On top of that, creating a separate pipeline to detect for these failures is not ideal, as when the above issue was occurring, the service hook we used for our integration test repo/pipelines weas working as intended, and that issue was isolated to the service hooks we had created for the production use cases.

And if we come to discover that what we want to track isn't possible, then that's okay. We can assume that our third-party dependencies are operating as intended, even though my gut says it would be better for us to be able to know if this functionality breaks for us before our customers do.

Alerts that Flip Flop

We have alerts that alert us when an error message occurs, which could happen every time the service runs. The alert resolves itself after some time, but it will occur again the next time the error occurs, which means there's a constant flip flopping of the alert until a fix is rolled out (for example: #9847) This is very noisy to the team as a whole and we should figure out a way to streamline this so it's not as noisy. (This might be solved by #10000)

missymessa · 2022-04-18T20:32:27Z

/cc @AlitzelMendez @ChadNedzlek @ulisesh @garath

garath · 2022-04-19T20:16:01Z

It doesn't look like this alert actually does what we want it to. The goal for these alerts is to tell us when Azure DevOps stops sending us events through the Service Hooks functionality

Sorta. The intent was to detect when AzDO tries to send a hook to the service but fails to complete for some reason (our services is unreachable, or the responded with an error code). You are correct that it can't detect AzDO failing to generate the event in the first place. We assume that AzDO will hold up its end of the contract. (Certainly I'm all for a "trust but verify" monitor whenever it's reasonable, but in this case I don't see a path to that.)

garath · 2022-04-19T20:23:22Z

We want to be notified of exceptions that occur in our test reporting services (of which, there are three that are currently running), however, I don't believe it's reasonable to expect the alert to self-resolve after 5 minutes if the services run on a schedule every hour or every day.

The idea is a little weird, I agree. Most alerts are concerned about things happening over a period of time. Then it's easy to define "start" and "end". This alert is trying to talk about a single event that occurs at an instant in time. No duration. So, no real "end" time, thus just declaring 5 minutes.

garath · 2022-04-19T20:23:29Z

increase the timing to look at the result every hour instead of every minute for five minutes.

This seems like a good idea just to save resources. No reason to query if it's not possible for the answer to have changed.

missymessa · 2022-05-10T21:23:55Z

Updating the Test Reporting Services Monitoring part in here: https://dnceng.visualstudio.com/internal/_git/dotnet-helix-service/pullrequest/22817

Still need to address how we want to handle Service Hook Delivery alerts, so leaving this issue open for that.

missymessa · 2022-08-26T18:26:15Z

Missy to review if this is still necessary or if these issues are already being covered by other fixes.

missymessa · 2023-01-23T23:41:50Z

I think all of these have been covered (where possible) or fixed to not be so noisy. Going to close this issue. (We can always open a new one if we feel there are gaps, but I think we have a good amount of coverage for the DevWF services).

missymessa mentioned this issue Jul 13, 2022

The alert state change comments from grafana are too noisy and discourage useful discussion #10000

Closed

AlitzelMendez assigned missymessa Jan 20, 2023

missymessa closed this as completed Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve alerts for Build Analysis functionality #9078

Improve alerts for Build Analysis functionality #9078

missymessa commented Apr 18, 2022 •

edited

Loading

missymessa commented Apr 18, 2022

garath commented Apr 19, 2022

garath commented Apr 19, 2022

garath commented Apr 19, 2022

missymessa commented May 10, 2022

missymessa commented Aug 26, 2022

missymessa commented Jan 23, 2023

Improve alerts for Build Analysis functionality #9078

Improve alerts for Build Analysis functionality #9078

Comments

missymessa commented Apr 18, 2022 • edited Loading

Test Reporting Services Monitoring

Service Hook Deliveries Alert

Alerts that Flip Flop

missymessa commented Apr 18, 2022

garath commented Apr 19, 2022

garath commented Apr 19, 2022

garath commented Apr 19, 2022

missymessa commented May 10, 2022

missymessa commented Aug 26, 2022

missymessa commented Jan 23, 2023

missymessa commented Apr 18, 2022 •

edited

Loading