Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The alert state change comments from grafana are too noisy and discourage useful discussion #10000

Closed
ChadNedzlek opened this issue Jul 13, 2022 · 28 comments
Assignees

Comments

@ChadNedzlek
Copy link
Member

ChadNedzlek commented Jul 13, 2022

Illustative example: #9847 (comment)

There are useful dev comments in there, but the flipping of the alert is causing those all to get lost, and the flipping notifications aren't really that helpful.

I propose:

  • Instead of adding a new comment, just update a section of the description with something like: This alert has changed state 30 times since it initially triggered, most recently at 5:45 GMT, and maybe any additional tidbits. Maybe with some fancy graphical representation if someone was feeling particularly ambitious.
  • Stop adding the active/inactive labels, I don't think anyone uses them for anything (other than @ChadNedzlek looking at the inactive label for curiosity), and they will add a bunch of noise to the discussion as well.
@garath
Copy link
Member

garath commented Jul 13, 2022

@premun and I have discussed this in the context of mobile devices, which also triggers a lot, and had similar ideas of "make the controller smarter".

Mobile devices would benefit from something more clearly showing the changes in an alert over time (since a single alert covers multiple machines, which go in and out of alerting).

Maybe also it could just update the graph in the issue, as that includes the "alerting" and "non alerting" highlights.

@garath
Copy link
Member

garath commented Jul 13, 2022

The example alert in #9847 is also unusual in that it is an alert that "a single event happened" instead of "this metric is in a bad state". That makes me think it should open a brand new alert for each instance. Would that be... helpful?

@missymessa
Copy link
Member

brand new alert for each instance

As in a new issue each time that alert fires?

@garath
Copy link
Member

garath commented Jul 13, 2022

As in a new issue each time that alert fires?

Yes. They all represent separate bad things happening.

@missymessa
Copy link
Member

I think that might be just as bad considering the number of times that alert has fired in a single day :(

@ChadNedzlek
Copy link
Member Author

It's usually the same bad thing every time too. And all of the alerts would point to the same panel... so it would be basically impossible to know which issue was about which problem.

@ChadNedzlek
Copy link
Member Author

We could try making a little graphic of "alert history", but that's not something provided by Grafana, so we'd have to create, and then store... an image. That's fancy pants. :-)

@garath
Copy link
Member

garath commented Jul 13, 2022

I think that might be just as bad considering the number of times that alert has fired in a single day :(

I think this is the tricky point. Is it reasonable to say all alerts should get the behavior changes discussed here? Or is it more that this alert should be treated special?

@garath
Copy link
Member

garath commented Jul 13, 2022

What are people's thoughts on the noise if (1) comments did not post a graph, and (2) we stopped using active/inactive labels? Is that better?

@missymessa
Copy link
Member

Personally, that alert should be tuned to not be so noisy, but I'm not sure what's the best way of handling it, so I made mention of it here: #9078

@garath
Copy link
Member

garath commented Jul 13, 2022

Personally, that alert should be tuned to not be so noisy, but I'm not sure what's the best way of handling it,

Good point. If this was an alert handled by FR it likely would have been changed. But since it's still in the "assign this to DevWF" mode it's been in the v-team's court.

@ChadNedzlek
Copy link
Member Author

I think all alerts should. I'm not sure the giant comment for switching states has every really been useful to anyone, other than to say "this alert is annoying because it's sending too many comments". Which seems like a self-defeating feature. If it wasn't so spammy, then it would matter that it was too spammy. :-)

Also, I don't think we're ever going to be perfect. We have a lot of alerts that flop, and nuking the ability to discuss those alerts is sad, and the payoff not really high.

@missymessa
Copy link
Member

Right, which is why I made mention of wanting to fix that alert in particular.

@michellemcdaniel michellemcdaniel changed the title Grafana alerts are just too dang noisy Gather telemetry around usefulness of alerts Sep 1, 2022
@ChadNedzlek
Copy link
Member Author

ChadNedzlek commented Sep 2, 2022

I don't think this issue any longer reflect what it was created for. The issue wasn't about the general usefulness of alerts. It was just that the commenting in flipping alerts isn't useful. We shouldn't use this issue for that, since now the original issue is lost.

@alexperovich had the right of it in the triage, this is about the fact that the alert commenting is pointlessly noisy (and doesn't seem to be providing value to anyone when I discussed it in Teams) and it hides useful discussion.

@ilyas1974, @Chrisboh, @markwilkie. If we want a generic "look at all our alerts" issue, that's fine, but I don't think it should be this one. I'm going to change the title back.

@ChadNedzlek ChadNedzlek changed the title Gather telemetry around usefulness of alerts The alert state change comments from grafana are too noisy and discourage useful discussion Sep 2, 2022
@ChadNedzlek
Copy link
Member Author

ChadNedzlek commented Sep 2, 2022

I made the title a little more specific and less flippant (even if it's sort of overlong now). Hopefully it more precisely conveys what I was trying to get at with the original issue.

@ChadNedzlek
Copy link
Member Author

It should also be pretty trivial to enact this. We can just delete the code that's making those comments, and just update the description instead if we wanted to be extra fancy. But at the very least, the comments aren't really useful, since someone generally goes to the alert on grafana anyway as part of investigation.

@Chrisboh
Copy link
Member

Chrisboh commented Sep 2, 2022

Thank you for the added clarity Chad as we did take this in a different direction based on our interpretation of the original title. @ilyas1974 do you want to create a different issue that is tracking what you are looking at?

@ChadNedzlek
Copy link
Member Author

Haven't seen many updates here, and I really think we should investigate at least a partial quick fix in FR (since FR is the people impacted by getting a bunch of noisy updates to issues). It shouldn't be that hard to at least remove the comment when state changes (and probably the active/inactive label too), since that's just deleting code.

But it seems like this issue is a bit stalled.

@missymessa
Copy link
Member

Alerts in FR has been fairly quiet this week. Both were... semi legit? So nothing super noisy so far, but I agree that we should tune alerts we get in FR if 1) they're not related to epics that are currently still open (example: Build Analysis alerts should be managed by the Build Analysis team until the epic is closed) and 2) they're noisy.

@ChadNedzlek
Copy link
Member Author

I mostly just want the back and forth comments not to get added, I don't think they add any value, and removing them is trivial: just delete this else: https://github.com/dotnet/arcade-services/blob/main/src/DotNet.Status.Web/Controllers/AlertHookController.cs#L89

@ChadNedzlek
Copy link
Member Author

You know what, I'm just going to do it...

@ChadNedzlek ChadNedzlek assigned ChadNedzlek and unassigned ilyas1974 Oct 14, 2022
@premun
Copy link
Member

premun commented Oct 17, 2022

What is this going to do to alerts such as Android devices where the value of the alerting metric (disconnected devices) matters as the set of offline devices can change while the alert is active usually for couple of days?

@premun
Copy link
Member

premun commented Oct 17, 2022

Can we maybe do something like add a different notification channel that would not do this and we could move alerts that suffer from this on this channel?

@ChadNedzlek
Copy link
Member Author

Do you have an example one of the GitHub issues that you think the comments are useful? If an alert is open, FR should be looking at it anyway, which will generally involve going to the alert for further diagnostics. I don't know that I've seen an alert where the comment was sufficient to fix the problem.

@premun
Copy link
Member

premun commented Oct 18, 2022

We've had an alert called Android devices disconnected for almost a year now that fires basically every week and the issue contains all you need, no more investigations or following any link is necessary (unless you need to go in the wiki to learn about the alert itself).

The alert lists a set of phones that need to be offlined + some DDFUN's attention. The set of machines can change during the alert being up and we need these new comments to see the new set of machines and compare it with the previous to see if there are newly broken phones. Grafana doesn't have a capability to handle this, or I haven't found a way.

We've already had a discussion about this here: #9092
(I proposed an improvement and I can see you've been part of this dicussion too)

Example alert where these comments are necessary: #10804

@ChadNedzlek
Copy link
Member Author

Would replacing the description, rather than adding a comment, be helpful? It seems like in this case, the historical list isn't super helpful, just having the most up to date one?

@premun
Copy link
Member

premun commented Oct 18, 2022

No, that wouldn't work. You need to know what devices were already handled and which are new so you need to know have a clear timeline of people reacting with the alert and the alert evolving. People are disabling machines and opening IcMs for the devices so it must be obvious what happened in which order. If description kept changing, you'd easily miss something.

I suggest adding a new notification channel which would just add a query parameter when sending the hook. The controller would then not post another comment if this query parameter is set. You can then pick the noisy alerts and move them onto this new channel.
How does that sound?

@ChadNedzlek
Copy link
Member Author

That sounds too complicated of a system. It's unfortunate that this one alert is special, but I guess we just shouldn't do anythign then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants