Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement events acknowledgement #13

Closed
dannycohen opened this issue Sep 24, 2013 · 30 comments
Closed

Implement events acknowledgement #13

dannycohen opened this issue Sep 24, 2013 · 30 comments
Assignees

Comments

@dannycohen
Copy link

As Opie, I want to view alerts details and mark an event as "acknowledged" after I took corrective actions.

Visualizations:

  1. An event can be set to be "active" or "acknowledged"
  2. Event list in the dashboard or in the event management page displays only events whose status is "active"
  3. To acknowledge an event - select one or more events in the event management page by checking the checkbox in the first column of each row, and click the "Acknowledge events" button
  4. If no heartbeat event remain active for an endpoint, the relevant indicators revers to green
  5. It may be possible that an indicator is green but it has an active event.
  • See case 2 below which covers the following flow of events:
    1. Heartbeat failure causes the creation of an event and the indicator is red
    2. The endpoint recuperates after a few minutes and heartbeat messages are received
    3. The previously created event is still active - indicating there was a failure at some point in the past
    4. The indicator is green - indicating that in the heartbeat messages are received in the the present

Notes:

  • This applies to all event types (e.g. heartbeats, failed messages etc.)

Demo / Acceptance tests:

Case 1:

  1. Deploy the heartbeat plugin in 3 of the 5 Video Store sample endpoints (in all except the ContentManagement and Operations endpoints)
  2. Run the Video Store sample
  3. Kill the "Sales" endpoint
  4. Indicator should turn red within 1 minute
  5. Number below Heartbeat Indicator should be 2 in green and 1 in red
  6. Click on heartbeat indicator
  7. 3 endpoint heartbeat indicators should be displayed, 2 green, 1 red
  8. The name of each endpoint and the number seconds elapsed since last heartbeat messages was received noted next to each indicator
  9. Click on the endpoint name "Sales"
  10. The events list in the event management page is displayed, filtered to show only the heartbeat events for the "Sales" endpoint
  11. Select all the events for the "Sales" endpoint
  12. Click the "Acknowledge events" button
  13. The cleared events are removed from the events list, both in the event management page and in the dashboard
  14. The endpoint heartbeat indicator for "Sales" endpoint is green (i.e. 3 green indicators) and so is the overall heartbeat indicator (Check supported SC version and report correct error #40)
  15. Since we did not revive the "Sales" endpoint, the "Sales" heartbeat Indicator should turn red within 1 minute (and a new heartbeat event should be created)

Case 2:

  1. Deploy the heartbeat plugin in 3 of the 5 Video Store sample endpoints (in all except the ContentManagement and Operations endpoints)
  2. Run the Video Store sample
  3. Kill the "Sales" endpoint
  4. Indicator should turn red within 1 minute
  5. Number below Heartbeat Indicator should be 2 in green and 1 in red
  6. Click on heartbeat indicator
  7. 3 endpoint heartbeat indicators should be displayed, 2 green, 1 red
  8. The name of each endpoint and the number seconds elapsed since last heartbeat messages was received noted next to each indicator
  9. Click on the endpoint name "Sales"
  10. The events list is displayed, filtered to show only the heartbeat events for the "Sales" endpoint
  11. An active heartbeat event exists for the "Sales" endpoint
  12. Create a new instance of the "Sales" endpoint
  13. "Sales" endpoint Indicator should turn green within 1 minute (since it starts receiving heartbeat messages from the endpoint)
  14. Next to the "Sales" endpoint indicator there is a small exclamation mark indicating there is an active event
  15. Click on the endpoint name "Sales"
  16. The events list is displayed, filtered to show only the heartbeat events for the "Sales" endpoint
  17. The same active heartbeat alerts exists for the "Sales" endpoint (see step 9 above)
  18. Select the events and click the "Acknowledge events" button
  19. The active event is removed from the alerts list
  20. In the "Sales" endpoint indicator the small exclamation mark is no longer visible (indicating there are no longer any active event)
@ghost ghost assigned indualagarsamy Sep 24, 2013
@dannycohen
Copy link
Author

replaces Particular/ServiceControl#43

@indualagarsamy
Copy link
Contributor

The more I think about this, the more this feature does not make sense to me

In the following use case or in any other use case, where the indicators can self correct, for example:
When the heartbeat indicator turns green again, there is not much work for Opie.
Life is good.
Why does he need to ack / clear this alert that endpoint A was down at 16:04:33?
Opie can always look at the alert history to know that there was a problem if he needs to from a history perspective.

Also the top 10 events on the dashboard, shows that heartbeats got restored at xx:yy:zz. So Opie has context for both the failure and the restored event. If they happen close enough, he'll still see it.

I'd like to downvote this feature for the minimum viable product or what am I missing here?

@andreasohlund @johnsimons @dannycohen - thoughts?

@dannycohen
Copy link
Author

Why does he need to ack / clear this alert that endpoint A was down at 16:04:33?

IMO - because it was down at 16:04:33
The problem with events is that nobody pays them any attention, and its easy to disregard.
In the case of the endpoint being down: this is an excellent case, IMO, of something that Opie must know about (why was it down ? did someone make changes to the endpoint or was it a failure ?), and acknowledge.

From a burden perspective - I'd say its the same as having emails sent and marked as read or deleted.
Optimally - there should not be many events like "endpoint was down" and if there are - then by all means, indicate it!

Same applies to custom check: if there was a change, a custom check failed and then returned to operate, it may be buried well below the 10 most recent events. What then ? how would Opie know about it ? (by actively looking in the events list ? that will not happen...)

@andreasohlund
Copy link
Member

So it seems like we need to classify the events that Opie should need to
ack? (since not all of them will need it)

Or should we mark events as "seen" as Opie scrolls through the list of
events?

On Wed, Oct 2, 2013 at 1:00 PM, Danny Cohen notifications@github.comwrote:

Why does he need to ack / clear this alert that endpoint A was down at
16:04:33?

IMO - because it was down at 16:04:33
The problem with events is that nobody pays them any attention, and its
easy to disregard.
In the case of the endpoint being down: this is an excellent case, IMO, of
something that Opie must know about (why was it down ? did someone make
changes to the endpoint or was it a failure ?), and acknowledge.

From a burden perspective - I'd say its the same as having emails sent and
marked as read or deleted.
Optimally - there should not be many events like "endpoint was down" and
if there are - then by all means, indicate it!

Same applies to custom check: if there was a change, a custom check failed
and then returned to operate, it may be buried well below the 10 most
recent events. What then ? how would Opie know about it ? (by actively
looking in the events list ? that will not happen...)


Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-25530276
.

http://andreasohlund.net
http://twitter.com/andreasohlund

@dannycohen
Copy link
Author

So it seems like we need to classify the events that Opie should need to
ack? (since not all of them will need it)

I'd say we need to ack everyhting we currently have:

  • Heartbeat misses
  • Failed Messages (ack or retry)
  • Custom checks (ack historical custom checks failures)
  • Process SLA violations

In the future we may (/will) add events that are less critical, like SLR attempts trends, or configuration changes etc.

Or should we mark events as "seen" as Opie scrolls through the list of
events?

This is a "implicit ack", which may be good enough. I'd say keep it as simple as possible for Alpha and have an explicit check box to ack (like the Read / Unread marker in email apps: It expects you to open or click on an email to be marked as read).

@andreasohlund
Copy link
Member

This is a "implicit ack", which may be good enough. I'd say keep it as
simple as possible for Alpha and have an explicit check box to ack (like
the Read / Unread marker in email apps: It expects you to open or click on
an email to be marked as read).
I'd say they are equally easy to do technically. Which one is best from
Opies point of view?

On Wed, Oct 2, 2013 at 1:11 PM, Danny Cohen notifications@github.comwrote:

So it seems like we need to classify the events that Opie should need to
ack? (since not all of them will need it)

I'd say we need to ack everyhting we currently have:

  • Heartbeat misses
  • Failed Messages (ack or retry)
  • Custom checks (ack historical custom checks failures)
  • Process SLA violations

In the future we may (/will) add events that are less critical, like SLR
attempts trends, or configuration changes etc.

Or should we mark events as "seen" as Opie scrolls through the list of
events?

This is a "implicit ack", which may be good enough. I'd say keep it as
simple as possible for Alpha and have an explicit check box to ack (like
the Read / Unread marker in email apps: It expects you to open or click on
an email to be marked as read).


Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-25530755
.

http://andreasohlund.net
http://twitter.com/andreasohlund

@dannycohen
Copy link
Author

I'd say they are equally easy to do technically. Which one is best from
Opie's point of view?

Lets go with the explicit "checkbox and click ack", and get feedback on that. We can always change to "implicit ack" later or have it be configurable, based on event criticality, Opie's preferences settings etc.

@indualagarsamy
Copy link
Contributor

@dannycohen - I think your concern is valid. But I think, its more of a notification concern. I'd argue that it belongs with the neurelic plug in. i..e Heartbeats went down for an endpoint. Notify Opie if the condition persists for x time, etc. via email, sms, etc. Notify Opie's super, if Opie hasn't taken action, etc etc.

So, I am still not seeing value in the ack. When Opie acts, let's say fixes the endpoint, the indicator will get the right status. In which case, it is fixed. Opie acting on it is ack enough, without forcing Opie to go click a ack button.

@dannycohen
Copy link
Author

I think your concern is valid. But I think, its more of a notification concern. I'd argue that it belongs with the neurelic plug in

For beta and RC, there will be no new relic plugin, so we need to provide something that allows SP to stand on its own.
In addition, even with NR, I see value in this.

@indualagarsamy
Copy link
Contributor

Asking Opie to click a button after the fact is still not helping Opie.
To get Opie's attention, you'd have to notify Opie. If ack is a big concern, then I'd say not having notifications about the alert is equally a bigger gap IMO.

The way I see it is this:
Opie looks at the dashboard (everything looks green)
Why is he going to click the Alerts screen to go and click acks on alerts that have resolved itself?
What would make Opie want to go to the alerts screen? Opie didn't get notified. Opie looks at the dashboard 2 hours later, all green.
In this state of affairs, what is the value for Opie to go in and click on Acks on the list of alerts that resolved itself?

@dannycohen
Copy link
Author

I'd say not having notifications about the alert is equally a bigger gap IMO.

Notifications will be provided by 3rd party tools.

Opie looks at the dashboard (everything looks green)
Why is he going to click the Alerts screen to go and click acks on alerts that have resolved itself?

That's the wrong description for the scenario.
Here's the same scenario with a minor change:

  • Opie looks at the dashboard.
  • Everything looks green, but there's an indication that something happened earlier in custom checks.
  • Opie goes to custom checks events, sees an issue that occureds 3 hours ago
  • After checking the root cause Opie decides to ack the events that occurred earlier

@dannycohen
Copy link
Author

Here's the alternative scenario, without the requirement to ack:

  • During the night, the system was suffered intermittent connection issues and custom checks failed. In the early morning hours all returned to normal.
  • In the morning, Opie looks at the dashboard.
  • Everything looks green now and but there's no indication that something happened earlier in custom checks.
  • Opie is unaware that something happened, and happily continues on his daily routine

I agree that SMS / Email notifications are missing, but this will be provided via 3rd party integration. Still this scenario is not a valid one for SP.

@indualagarsamy
Copy link
Contributor

I think notifications and acks go hand in hand. Implementing something partially here without the other is not making sense to me.
When you have the 3rd party notifications wired in, I think this would be solved.
Also, I don't want to pollute the individual indicators on the dashboard with potential false positives. i..e indicating something that happened, but is currently not a problem. I think those indiators are meant for what's the current situation.

@indualagarsamy
Copy link
Contributor

I'd also strongly argue that this is not a critical feature for a MVP.

@andreasohlund
Copy link
Member

+1. Can we put this onhold until we get some more user feedback?

Sent from my iPhone

On 2 okt 2013, at 19:27, Indu Alagarsamy notifications@github.com wrote:

I'd also stronlgly argue that this is not a critical feature for a MVP.


Reply to this email directly or view it on GitHub.

@dannycohen
Copy link
Author

Can we put this on hold until we get some more user feedback?

NP. Lets do that.

@indualagarsamy
Copy link
Contributor

@dannycohen - is this still targeted for beta? I thought we moved this out?

@dannycohen
Copy link
Author

@indualagarsamy - Given the schedule, I do not see how we can complete this for Beta. Moving to RC.

We;ve had some discussions about this in the past, so I would like to make sure that we agree (or not) that from a requirements perspective, this makes sense to you.

//CC @andreasohlund

@dannycohen
Copy link
Author

@indualagarsamy -
For the time being, I am leaving the references to acknowledgements in #12; (I want to reach agreement on the acknowledgement implementation issue)

@indualagarsamy
Copy link
Contributor

@dannycohen - can we talk about this?

@dannycohen
Copy link
Author

I'll schedule a chat for this evening (your morning)

@dannycohen
Copy link
Author

@indualagarsamy - untill we manage to discuss this topic, please advise regarding the following issue:

This is how my Custom Checks look right now:

image

image

  1. The problem is that the 2 failed custom checks cannot be "fixed" because I changed the code in the custom check.
    • i.e. in order to "reset" these events (or make them disappear) I will need to re-deploy the old version of the custom check so that it will reset the custom check status.
    • As you can imagine - this is unacceptable in a prod environment and also undoable since the custom check root cause of failure may be unfixable...
  2. Another option is that I can go into the database and reset the event manually
    • Scratch that... bad idea... ahhh...

So, how Can I, as Opie, make the Custom Check indicator green again, given that it is currently flashing red for failed custom checks that are obsolete and not fixable ?

//CC @andreasohlund / @johnsimons

@johnsimons
Copy link
Member

Make sure the custom check id is the same accross different revisions

On Friday, November 8, 2013, Danny Cohen wrote:

@indualagarsamy https://github.com/indualagarsamy - untill we manage to
discuss this topic, please advise regarding the following issue:

This is how my Custom Checks look right now:

[image: image]https://f.cloud.github.com/assets/3889023/1495626/f04933b4-47ed-11e3-8af0-503a730e27dd.png

[image: image]https://f.cloud.github.com/assets/3889023/1495633/feb2d964-47ed-11e3-99dd-cf82dc950b87.png

  1. The problem is that the 2 failed custom checks cannot be "fixed"
    because I changed the code in the custom check.
    • i.e. in order to "reset" these events (or make them disappear) I
      will need to re-deploy the old version of the custom check so that it will
      reset the custom check status.
    • As you can imagine - this is unacceptable in a prod environment
      and also undoable since the custom check fcause of failure may be
      unfixable...
      1. Another option is that I can go into the database and reset the
        event manually
    • Scratch that... bad idea... ahhh...

So, how Can I, as Opie, make the Custom Check indicator green again, given
that it is currently flashing red for failed custom checks that are
obsolete and not fixable ?


Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-28005372
.

Regards
John Simons
NServiceBus

@dannycohen
Copy link
Author

Make sure the custom check id is the same across different revisions

That's not a realistic solution.
In this case I intentionally changed the custom check id since I originally gave it a misleading name (unintentionally...).
Should I have kept the misleading name there just to avoid falling into this issue ? How do you suggest we can make sure users do not fall into this trap, and if they do (and they will) how do we get them out ?

Another similar scenario, is when a custom check needs to be retired (e.g. it checks something that always fails).
You can tell me to leave the custom check there, always returning a pass result, but that is as unattractive as leaving a misleading custom check id as-is because I can't get rid of old and obsolete events...

@dannycohen
Copy link
Author

@indualagarsamy / @johnsimons -
Another issue:

I have these 3 failed messages which cannot succeed on retry due to compatibility issues. They are basically poison messages.

image

image

How can I make them disappear ?
(they are obsolete, and there's nothing I can or need to do to fix the messages themselves; I just want to get the failed messages indicator to be green, since there are no failed messages - the ones that are there are irrelevant obsolete that are obstructing the usage of ServicePulse Failed message indicator...)

@andreasohlund
Copy link
Member

We need a Archive error message command

Sent from my iPhone

On 7 nov 2013, at 22:49, Danny Cohen notifications@github.com wrote:

@indualagarsamy / @johnsimons -
Another issue:

I have these 3 failed messages which cannot be retried due to compatibility issues. They are basically poison messages.

How can I make them disappear ?


Reply to this email directly or view it on GitHub.

@andreasohlund
Copy link
Member

Seem like we need a manage custom checks page where you can delete old definitions?

Sent from my iPhone

On 7 nov 2013, at 22:24, Danny Cohen notifications@github.com wrote:

Make sure the custom check id is the same across different revisions

That's not a realistic requirement.
In this case I intentionally changed the custom check id since I originally gave it a misleading name (unintentionally...).
Should I have kept the misleading name there just to avoid falling into this issue ? How do you suggest we can make sure users do not fall into this trap, and if they do (and they will) how do we get them out ?

Another similar scenario, is when a custom check needs to be retired (e.g. it checks something that is faulty).
You can tell me to leave the custom check there, always returning a pass result, but that is as unattractive as leaving a misleading custom check id as-is because I can't get rid of old and obsolete events...


Reply to this email directly or view it on GitHub.

@dannycohen
Copy link
Author

We need a Archive error message command

We can call it archive, hide, "mark as obsolete" etc.
Essentially we need a flag per event saying whether that event should no longer affect the indicator display / count.
I called this proposed flag "acknowledgement", but any name will do.

Seem like we need a manage custom checks page where you can delete old definitions?

Same as for error messages.
All other indicators will have the same issue. For example, endpoint instances that are no longer available (because the machine on which the endpoint instance resided is no longer in use; see #15).
So we can do it differently for every events / indicators type, or we can do it once per all types.

@andreasohlund
Copy link
Member

I see multiple requirements in play here.

How about:

  1. A way to reset indicator (right click on the indicator it self - reset?)
  2. Add smart context menus on the Events (we have already support for this)
    • For custom checks: Ignore/Reset/Remove
    • For errors: retry/archive
    • For heartbeats Reset/Ignore/Remove Endpoint

@dannycohen
Copy link
Author

Closed and replaced by per-indicator GH issue using the terminology proposed by @andreasohlund in #13 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants