Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve alerts for Build Analysis functionality #9078

Closed
missymessa opened this issue Apr 18, 2022 · 7 comments
Closed

Improve alerts for Build Analysis functionality #9078

missymessa opened this issue Apr 18, 2022 · 7 comments
Assignees

Comments

@missymessa
Copy link
Member

missymessa commented Apr 18, 2022

A couple of our alerts are probably not functioning in the best way possible, so let's figure out how to improve them:

Link to dashboard for reference: https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/buildAnalysis/build-analysis?orgId=1&from=now-2d&to=now

Test Reporting Services Monitoring

This is a tricky alert. We want to be notified of exceptions that occur in our test reporting services (of which, there are three that are currently running), however, I don't believe it's reasonable to expect the alert to self-resolve after 5 minutes if the services run on a schedule every hour or every day. We should consider breaking this out into three separate alerts for each of the services, or, at the minimum, increase the timing to look at the result every hour instead of every minute for five minutes.

Service Hook Deliveries Alert

It doesn't look like this alert actually does what we want it to. The goal for these alerts is to tell us when Azure DevOps stops sending us events through the Service Hooks functionality in order to better detect for issues like this: https://github.com/dotnet/core-eng/issues/15712

I'm not certain if that's going to be possible because we don't know if we should be getting events when we're checking for them (e.g. should we be expecting events on the weekend? Probably not), and our services are pretty far down the chain of services that ultimately get this event. On top of that, creating a separate pipeline to detect for these failures is not ideal, as when the above issue was occurring, the service hook we used for our integration test repo/pipelines weas working as intended, and that issue was isolated to the service hooks we had created for the production use cases.

And if we come to discover that what we want to track isn't possible, then that's okay. We can assume that our third-party dependencies are operating as intended, even though my gut says it would be better for us to be able to know if this functionality breaks for us before our customers do.

Alerts that Flip Flop

We have alerts that alert us when an error message occurs, which could happen every time the service runs. The alert resolves itself after some time, but it will occur again the next time the error occurs, which means there's a constant flip flopping of the alert until a fix is rolled out (for example: #9847) This is very noisy to the team as a whole and we should figure out a way to streamline this so it's not as noisy. (This might be solved by #10000)

@missymessa
Copy link
Member Author

@garath
Copy link
Member

garath commented Apr 19, 2022

It doesn't look like this alert actually does what we want it to. The goal for these alerts is to tell us when Azure DevOps stops sending us events through the Service Hooks functionality

Sorta. The intent was to detect when AzDO tries to send a hook to the service but fails to complete for some reason (our services is unreachable, or the responded with an error code). You are correct that it can't detect AzDO failing to generate the event in the first place. We assume that AzDO will hold up its end of the contract. (Certainly I'm all for a "trust but verify" monitor whenever it's reasonable, but in this case I don't see a path to that.)

@garath
Copy link
Member

garath commented Apr 19, 2022

We want to be notified of exceptions that occur in our test reporting services (of which, there are three that are currently running), however, I don't believe it's reasonable to expect the alert to self-resolve after 5 minutes if the services run on a schedule every hour or every day.

The idea is a little weird, I agree. Most alerts are concerned about things happening over a period of time. Then it's easy to define "start" and "end". This alert is trying to talk about a single event that occurs at an instant in time. No duration. So, no real "end" time, thus just declaring 5 minutes.

@garath
Copy link
Member

garath commented Apr 19, 2022

increase the timing to look at the result every hour instead of every minute for five minutes.

This seems like a good idea just to save resources. No reason to query if it's not possible for the answer to have changed.

@missymessa
Copy link
Member Author

Updating the Test Reporting Services Monitoring part in here: https://dnceng.visualstudio.com/internal/_git/dotnet-helix-service/pullrequest/22817

Still need to address how we want to handle Service Hook Delivery alerts, so leaving this issue open for that.

@missymessa
Copy link
Member Author

Missy to review if this is still necessary or if these issues are already being covered by other fixes.

@missymessa
Copy link
Member Author

I think all of these have been covered (where possible) or fixed to not be so noisy. Going to close this issue. (We can always open a new one if we feel there are gaps, but I think we have a good amount of coverage for the DevWF services).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants