-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve alerting for broken mobile devices #9092
Comments
This feels like a thing we need to just make part of our operations work (or DDFun's operation work). It seems... excessive for us to build a whole controller, just for this... to fire an alert, for some vendor/FR person on our side to open a ticket to send to yet another vendor on their side? |
We unfortunately need this piece of logic somewhere because Grafana just doesn't have it (no alert state for continuity). Or I am missing your point and you mean we should somehow make this information available for DDFUN directly and just not care? Afaik they are not able to disable machines from Helix yet (and we don't have this functionality neither) |
If we can open IcM's automatically, that would be really awesome, but I'm not sure why we need such a complex system that uses Grafana as a middle man in that scenario. My understanding right now is that machine -> heartbeat -> some service to do heartbeat exporting -> Kusto -> Grafana -> our frontend -> github... (and now we want to add -> our service again -> github again right here, parsing our own output out of github) -> FR people -> IcM. It seems like we could just cut the whole middle out there, and the thing that's reading the heartbeat to detect a bad machine can just open the ticket then and there (and even record that it did so in the heartbeat table directly!), and we can cut quite a few steps out... My "cheap" suggestion is yes, just let DDFun look at the list of broken machines in Grafana or with a simple Kusto query, and just manage that. My understanding from @ilyas1974 / @MattGal / @Chrisboh is that is already how some other scenarios work, so maybe it makes sense to bundle this into that, by whatever means is most useful for them. |
Ah, I think I can see a misunderstanding. This issue was created because of how the Android/Apple alerts currently work. In this case, Grafana is just not reading the heartbeat table but is running a rather complex query over aggregate data that decides when mobile devices actually go bad. It is more like this:
We don't want to really add anything, we just want to change the very last arrow (which already reads from GitHub) to read a bit more. The code change there is not large. I was thinking about using the heartbeats table as the source of data rather than GitHub comments (which are not ideal) but this would go against our FR process where we need to sort of group incidents so that we can keep track of things. Right now this happens via the alert issues where we always wait for IcM ticket resolution. So the timelines of when we detect, offline, report and get a device fixed are quite messy. Regardless, we still need a middle man (Grafana) because whether a mobile device is broken or not is not decided based on one work item's result. We analyze a series of operations in aggregate. This means the machine cannot decide its state inside of that work item and change it (in the heartbeats table). This happens from the aggregate so we need to have a place that hosts this logic. Currently that is Grafana. One more reason to have a middle man is that we need to learn about machines going offline (have auditing) - this is quite important because turning off machines automatically is a risky business (we actually agreed on this with @ilyas1974 as a hard requirement). All this said, thanks for the input, it's good you have brought a different perspective and I will think how to simplify things. I agree with the sentiment that it is best to keep things simple and not ideal to keep adding things we need to maintain. I guess it would be possible to achieve this goal via different means but this proposal builds on already existing processes/systems and aims to improve the current process in a reasonable amount of time (I estimate around one, maximum two weeks for this, E2E). |
We have agreed that Matt will add the queries to the broken machine detection tool he has given DDFUN. I will have to adjust the alerts so that DDFUN get alerts sooner and we get fewer alerts. This should be enough for the time being and we won't need this extra controller. |
This is a proposal for improvements of alerting around broken Helix machines so that the alert issues are easier to understand, and incidents are more manageable.
Context
We started detecting broken mobile devices and have queries that yield lists of Helix machines that are in a bad state. These machines need to be taken offline and DDFUN needs to investigate them.
Due to some limitations of Grafana, the only way to deal with the changing list of broken devices is making it hard to navigate through the alert issue and keep track of the state. As an example, we have this situation:
The problem above has been resolved by us making the alert send notification every 12 hours which results in a new comment on the alert issue, always listing fresh list of disabled machines. You can see an example issue here. This, however, means, that the alert issue is full of comments, where each contains a list of broken devices. FR devs must then manually compare these and keep track of the full set of the devices so that none slips through.
Goal
Create a "smart" notification handler that will compare lists of broken machines and only show updates when there are some in a way so that's easy to understand the set of broken machines. The handler should also offline machines automatically using the Helix API.
Implementation steps
Following steps are short high-level versions of what we want to create. Please refer to the word document for more details.
I expect this list to break into separate issues.
Example alert issue updates
The resulting issue updates would look something like this:
We detected broken machines, alert fires for the first time
Issue is opened and machines are listed
[… instructions on how to handle the alert coming from the notification body …]
Alert fires again but the set of machines is the same
No action, no comment is posted in the issue
Alert fires again and there are newly broken machines detected
New comment about newly broken machines is posted in the issue
The issue clearly states that machines were offlined automatically
List of previously tracked machines can be summarized too
Alert state changed to OK
Comment about alert going green
Optionally removing the notification ID
The text was updated successfully, but these errors were encountered: