Improve alerting for broken mobile devices #9092

premun · 2022-04-19T14:35:50Z

This issue is outcome of investigations done in #8714 and discussion that happened here.

This is a proposal for improvements of alerting around broken Helix machines so that the alert issues are easier to understand, and incidents are more manageable.

Context

We started detecting broken mobile devices and have queries that yield lists of Helix machines that are in a bad state. These machines need to be taken offline and DDFUN needs to investigate them.
Due to some limitations of Grafana, the only way to deal with the changing list of broken devices is making it hard to navigate through the alert issue and keep track of the state. As an example, we have this situation:

Machine A and B are detected as broken, issue for that specific alert is opened. The body of the issue contains these two machines.
We take the machines offline, and they might still stay broken from the point of view of the query since the query waits for successful jobs.
The alert is in the “alerting” state, but it does not keep track of which devices it was caused by
After some time, machine C can go in a bad state and now the query yields A, B and C. But since the alert is still in the alerting state, no action is taken by Grafana.
The problem above has been resolved by us making the alert send notification every 12 hours which results in a new comment on the alert issue, always listing fresh list of disabled machines. You can see an example issue here. This, however, means, that the alert issue is full of comments, where each contains a list of broken devices. FR devs must then manually compare these and keep track of the full set of the devices so that none slips through.

Goal

Create a "smart" notification handler that will compare lists of broken machines and only show updates when there are some in a way so that's easy to understand the set of broken machines. The handler should also offline machines automatically using the Helix API.

Implementation steps

Following steps are short high-level versions of what we want to create. Please refer to the word document for more details.

I expect this list to break into separate issues.

Create a new controller, similar to AlertHookController with following properties:
- This controller will accept a predefined output such as "list of machines+queues" such that can be output of Grafana notifications
- The controller will be ready to accept repeating notifications and compare which machines are reported as broken and make it easy to understand which all machines are currently broken
- It will either create a new issue or work with an existing one (use the notification ID like we already do)
- For all comments/issue descriptions, it will embed the list of machines so that it can be parsed later
- For existing issues, it won't just blindly comment but it will scan all previous alert comments and compare list of machines with the currently alerting one
  - It will only make a comment if new machines that were not previously reported are found
  - It will make sure that it is clear which machines are newly reported as broken (it can repeat the list of all previously detected too)
  - Optional: Show also Helix status of given machine (online/offline)
  - Optional: When the alert is going green again, remove the notification ID from the issue so that newly broken machines open a new alert. This means that once we resolve a batch of devices (or once we open IcM tickets for all and they fall off the radar but are offline), new machines don’t prolong the longevity of the current issue and are handled separately. This could prevent long chains of failures that creepily appear one after the other.
- In case a query parameter is set, it will also offline all of the newly detected broken machines via the Helix API
  - This action must be visibly shown in the comment so that it's clear such action happened
- There should be unit tests for this controller, all can be developed in staging but since it's a new controller, it is quite safe to proceed however
Create a new notification channel in Grafana that will point to this controller
- This notification channel should fire every hour when not ok (to allow the controller to comment about newly detected broken machines)
- There could be 2 channels, 1 that will also send the "disable the machines" parameter and 1 without
The current Apple/Android mobile device alerts should be changed to use this new notification channel
- This will require small changes in what exact data they produce
Documentation should be created describing how to use this new controller with example queries that can be used together with it

Example alert issue updates

The resulting issue updates would look something like this:

We detected broken machines, alert fires for the first time
Issue is opened and machines are listed
[… instructions on how to handle the alert coming from the notification body …]
Alert fires again but the set of machines is the same
No action, no comment is posted in the issue
Alert fires again and there are newly broken machines detected
New comment about newly broken machines is posted in the issue
The issue clearly states that machines were offlined automatically
List of previously tracked machines can be summarized too
Alert state changed to OK
Comment about alert going green
Optionally removing the notification ID

ChadNedzlek · 2022-04-28T20:51:22Z

This feels like a thing we need to just make part of our operations work (or DDFun's operation work). It seems... excessive for us to build a whole controller, just for this... to fire an alert, for some vendor/FR person on our side to open a ticket to send to yet another vendor on their side?

premun · 2022-04-28T22:07:46Z

We unfortunately need this piece of logic somewhere because Grafana just doesn't have it (no alert state for continuity).
Also once we have this, we're just one step away from opening IcM tickets for DDFUN/MLS automatically.

Or I am missing your point and you mean we should somehow make this information available for DDFUN directly and just not care? Afaik they are not able to disable machines from Helix yet (and we don't have this functionality neither)

ChadNedzlek · 2022-04-28T22:44:37Z

If we can open IcM's automatically, that would be really awesome, but I'm not sure why we need such a complex system that uses Grafana as a middle man in that scenario. My understanding right now is that machine -> heartbeat -> some service to do heartbeat exporting -> Kusto -> Grafana -> our frontend -> github... (and now we want to add -> our service again -> github again right here, parsing our own output out of github) -> FR people -> IcM. It seems like we could just cut the whole middle out there, and the thing that's reading the heartbeat to detect a bad machine can just open the ticket then and there (and even record that it did so in the heartbeat table directly!), and we can cut quite a few steps out...

My "cheap" suggestion is yes, just let DDFun look at the list of broken machines in Grafana or with a simple Kusto query, and just manage that. My understanding from @ilyas1974 / @MattGal / @Chrisboh is that is already how some other scenarios work, so maybe it makes sense to bundle this into that, by whatever means is most useful for them.

premun · 2022-04-29T08:45:57Z

Ah, I think I can see a misunderstanding. This issue was created because of how the Android/Apple alerts currently work. In this case, Grafana is just not reading the heartbeat table but is running a rather complex query over aggregate data that decides when mobile devices actually go bad. It is more like this:

Machines -> Kusto -> Grafana -> Notification service -> GitHub

We don't want to really add anything, we just want to change the very last arrow (which already reads from GitHub) to read a bit more. The code change there is not large.

I was thinking about using the heartbeats table as the source of data rather than GitHub comments (which are not ideal) but this would go against our FR process where we need to sort of group incidents so that we can keep track of things. Right now this happens via the alert issues where we always wait for IcM ticket resolution. So the timelines of when we detect, offline, report and get a device fixed are quite messy.

Regardless, we still need a middle man (Grafana) because whether a mobile device is broken or not is not decided based on one work item's result. We analyze a series of operations in aggregate. This means the machine cannot decide its state inside of that work item and change it (in the heartbeats table). This happens from the aggregate so we need to have a place that hosts this logic. Currently that is Grafana.

One more reason to have a middle man is that we need to learn about machines going offline (have auditing) - this is quite important because turning off machines automatically is a risky business (we actually agreed on this with @ilyas1974 as a hard requirement).

All this said, thanks for the input, it's good you have brought a different perspective and I will think how to simplify things. I agree with the sentiment that it is best to keep things simple and not ideal to keep adding things we need to maintain. I guess it would be possible to achieve this goal via different means but this proposal builds on already existing processes/systems and aims to improve the current process in a reasonable amount of time (I estimate around one, maximum two weeks for this, E2E).
Regardless, the bottom line is that we currently need to add a piece of logic that Grafana is missing (this "alert statefulness") that we don't have anywhere else so I don't see a way around that.

premun · 2022-05-03T13:30:41Z

We have agreed that Matt will add the queries to the broken machine detection tool he has given DDFUN. I will have to adjust the alerts so that DDFUN get alerts sooner and we get fewer alerts. This should be enough for the time being and we won't need this extra controller.

premun mentioned this issue Apr 20, 2022

Investigate if we can improve Android device alerts #8714

Closed

premun added the Needs Triage A new issue that needs to be associated with an epic. label Apr 26, 2022

premun changed the title ~~Alerting for broken Helix machines~~ Improve alerting for broken Helix machines Apr 26, 2022

premun mentioned this issue Apr 27, 2022

Production - [Alerting] Android devices disconnected #9186

Closed

michellemcdaniel removed the Needs Triage A new issue that needs to be associated with an epic. label Apr 29, 2022

premun self-assigned this May 3, 2022

premun mentioned this issue May 3, 2022

Android and Apple platform support for .NET 7 / MAUI #9136

Closed

31 tasks

premun changed the title ~~Improve alerting for broken Helix machines~~ Improve alerting for broken mobile devices May 3, 2022

premun closed this as completed May 9, 2022

premun mentioned this issue Oct 18, 2022

The alert state change comments from grafana are too noisy and discourage useful discussion #10000

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve alerting for broken mobile devices #9092

Improve alerting for broken mobile devices #9092

premun commented Apr 19, 2022 •

edited

Loading

ChadNedzlek commented Apr 28, 2022

premun commented Apr 28, 2022

ChadNedzlek commented Apr 28, 2022

premun commented Apr 29, 2022

premun commented May 3, 2022 •

edited

Loading

Improve alerting for broken mobile devices #9092

Improve alerting for broken mobile devices #9092

Comments

premun commented Apr 19, 2022 • edited Loading

Context

Goal

Implementation steps

Example alert issue updates

ChadNedzlek commented Apr 28, 2022

premun commented Apr 28, 2022

ChadNedzlek commented Apr 28, 2022

premun commented Apr 29, 2022

premun commented May 3, 2022 • edited Loading

premun commented Apr 19, 2022 •

edited

Loading

premun commented May 3, 2022 •

edited

Loading