Implement events acknowledgement #13

dannycohen · 2013-09-24T06:46:11Z

As Opie, I want to view alerts details and mark an event as "acknowledged" after I took corrective actions.

Visualizations:

An event can be set to be "active" or "acknowledged"
Event list in the dashboard or in the event management page displays only events whose status is "active"
To acknowledge an event - select one or more events in the event management page by checking the checkbox in the first column of each row, and click the "Acknowledge events" button
If no heartbeat event remain active for an endpoint, the relevant indicators revers to green
It may be possible that an indicator is green but it has an active event.

See case 2 below which covers the following flow of events:
1. Heartbeat failure causes the creation of an event and the indicator is red
2. The endpoint recuperates after a few minutes and heartbeat messages are received
3. The previously created event is still active - indicating there was a failure at some point in the past
4. The indicator is green - indicating that in the heartbeat messages are received in the the present

Notes:

This applies to all event types (e.g. heartbeats, failed messages etc.)

Demo / Acceptance tests:

Case 1:

Deploy the heartbeat plugin in 3 of the 5 Video Store sample endpoints (in all except the ContentManagement and Operations endpoints)
Run the Video Store sample
Kill the "Sales" endpoint
Indicator should turn red within 1 minute
Number below Heartbeat Indicator should be 2 in green and 1 in red
Click on heartbeat indicator
3 endpoint heartbeat indicators should be displayed, 2 green, 1 red
The name of each endpoint and the number seconds elapsed since last heartbeat messages was received noted next to each indicator
Click on the endpoint name "Sales"
The events list in the event management page is displayed, filtered to show only the heartbeat events for the "Sales" endpoint
Select all the events for the "Sales" endpoint
Click the "Acknowledge events" button
The cleared events are removed from the events list, both in the event management page and in the dashboard
The endpoint heartbeat indicator for "Sales" endpoint is green (i.e. 3 green indicators) and so is the overall heartbeat indicator (Check supported SC version and report correct error #40)
Since we did not revive the "Sales" endpoint, the "Sales" heartbeat Indicator should turn red within 1 minute (and a new heartbeat event should be created)

Case 2:

Deploy the heartbeat plugin in 3 of the 5 Video Store sample endpoints (in all except the ContentManagement and Operations endpoints)
Run the Video Store sample
Kill the "Sales" endpoint
Indicator should turn red within 1 minute
Number below Heartbeat Indicator should be 2 in green and 1 in red
Click on heartbeat indicator
3 endpoint heartbeat indicators should be displayed, 2 green, 1 red
The name of each endpoint and the number seconds elapsed since last heartbeat messages was received noted next to each indicator
Click on the endpoint name "Sales"
The events list is displayed, filtered to show only the heartbeat events for the "Sales" endpoint
An active heartbeat event exists for the "Sales" endpoint
Create a new instance of the "Sales" endpoint
"Sales" endpoint Indicator should turn green within 1 minute (since it starts receiving heartbeat messages from the endpoint)
Next to the "Sales" endpoint indicator there is a small exclamation mark indicating there is an active event
Click on the endpoint name "Sales"
The events list is displayed, filtered to show only the heartbeat events for the "Sales" endpoint
The same active heartbeat alerts exists for the "Sales" endpoint (see step 9 above)
Select the events and click the "Acknowledge events" button
The active event is removed from the alerts list
In the "Sales" endpoint indicator the small exclamation mark is no longer visible (indicating there are no longer any active event)

dannycohen · 2013-09-24T06:54:14Z

replaces Particular/ServiceControl#43

indualagarsamy · 2013-10-02T09:36:11Z

The more I think about this, the more this feature does not make sense to me

In the following use case or in any other use case, where the indicators can self correct, for example:
When the heartbeat indicator turns green again, there is not much work for Opie.
Life is good.
Why does he need to ack / clear this alert that endpoint A was down at 16:04:33?
Opie can always look at the alert history to know that there was a problem if he needs to from a history perspective.

Also the top 10 events on the dashboard, shows that heartbeats got restored at xx:yy:zz. So Opie has context for both the failure and the restored event. If they happen close enough, he'll still see it.

I'd like to downvote this feature for the minimum viable product or what am I missing here?

@andreasohlund @johnsimons @dannycohen - thoughts?

dannycohen · 2013-10-02T11:00:46Z

Why does he need to ack / clear this alert that endpoint A was down at 16:04:33?

IMO - because it was down at 16:04:33
The problem with events is that nobody pays them any attention, and its easy to disregard.
In the case of the endpoint being down: this is an excellent case, IMO, of something that Opie must know about (why was it down ? did someone make changes to the endpoint or was it a failure ?), and acknowledge.

From a burden perspective - I'd say its the same as having emails sent and marked as read or deleted.
Optimally - there should not be many events like "endpoint was down" and if there are - then by all means, indicate it!

Same applies to custom check: if there was a change, a custom check failed and then returned to operate, it may be buried well below the 10 most recent events. What then ? how would Opie know about it ? (by actively looking in the events list ? that will not happen...)

andreasohlund · 2013-10-02T11:03:57Z

So it seems like we need to classify the events that Opie should need to
ack? (since not all of them will need it)

Or should we mark events as "seen" as Opie scrolls through the list of
events?

On Wed, Oct 2, 2013 at 1:00 PM, Danny Cohen notifications@github.comwrote:

Why does he need to ack / clear this alert that endpoint A was down at
16:04:33?

IMO - because it was down at 16:04:33
The problem with events is that nobody pays them any attention, and its
easy to disregard.
In the case of the endpoint being down: this is an excellent case, IMO, of
something that Opie must know about (why was it down ? did someone make
changes to the endpoint or was it a failure ?), and acknowledge.

From a burden perspective - I'd say its the same as having emails sent and
marked as read or deleted.
Optimally - there should not be many events like "endpoint was down" and
if there are - then by all means, indicate it!

Same applies to custom check: if there was a change, a custom check failed
and then returned to operate, it may be buried well below the 10 most
recent events. What then ? how would Opie know about it ? (by actively
looking in the events list ? that will not happen...)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-25530276
.

http://andreasohlund.net
http://twitter.com/andreasohlund

dannycohen · 2013-10-02T11:11:47Z

So it seems like we need to classify the events that Opie should need to
ack? (since not all of them will need it)

I'd say we need to ack everyhting we currently have:

Heartbeat misses
Failed Messages (ack or retry)
Custom checks (ack historical custom checks failures)
Process SLA violations

In the future we may (/will) add events that are less critical, like SLR attempts trends, or configuration changes etc.

Or should we mark events as "seen" as Opie scrolls through the list of
events?

This is a "implicit ack", which may be good enough. I'd say keep it as simple as possible for Alpha and have an explicit check box to ack (like the Read / Unread marker in email apps: It expects you to open or click on an email to be marked as read).

andreasohlund · 2013-10-02T11:14:55Z

This is a "implicit ack", which may be good enough. I'd say keep it as
simple as possible for Alpha and have an explicit check box to ack (like
the Read / Unread marker in email apps: It expects you to open or click on
an email to be marked as read).
I'd say they are equally easy to do technically. Which one is best from
Opies point of view?

On Wed, Oct 2, 2013 at 1:11 PM, Danny Cohen notifications@github.comwrote:

So it seems like we need to classify the events that Opie should need to
ack? (since not all of them will need it)

I'd say we need to ack everyhting we currently have:

Heartbeat misses

Failed Messages (ack or retry)

Custom checks (ack historical custom checks failures)

Process SLA violations

In the future we may (/will) add events that are less critical, like SLR
attempts trends, or configuration changes etc.

Or should we mark events as "seen" as Opie scrolls through the list of
events?

This is a "implicit ack", which may be good enough. I'd say keep it as
simple as possible for Alpha and have an explicit check box to ack (like
the Read / Unread marker in email apps: It expects you to open or click on
an email to be marked as read).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-25530755
.

http://andreasohlund.net
http://twitter.com/andreasohlund

dannycohen · 2013-10-02T11:21:09Z

I'd say they are equally easy to do technically. Which one is best from
Opie's point of view?

Lets go with the explicit "checkbox and click ack", and get feedback on that. We can always change to "implicit ack" later or have it be configurable, based on event criticality, Opie's preferences settings etc.

indualagarsamy · 2013-10-02T16:31:03Z

@dannycohen - I think your concern is valid. But I think, its more of a notification concern. I'd argue that it belongs with the neurelic plug in. i..e Heartbeats went down for an endpoint. Notify Opie if the condition persists for x time, etc. via email, sms, etc. Notify Opie's super, if Opie hasn't taken action, etc etc.

So, I am still not seeing value in the ack. When Opie acts, let's say fixes the endpoint, the indicator will get the right status. In which case, it is fixed. Opie acting on it is ack enough, without forcing Opie to go click a ack button.

dannycohen · 2013-10-02T16:38:31Z

I think your concern is valid. But I think, its more of a notification concern. I'd argue that it belongs with the neurelic plug in

For beta and RC, there will be no new relic plugin, so we need to provide something that allows SP to stand on its own.
In addition, even with NR, I see value in this.

indualagarsamy · 2013-10-02T16:46:32Z

Asking Opie to click a button after the fact is still not helping Opie.
To get Opie's attention, you'd have to notify Opie. If ack is a big concern, then I'd say not having notifications about the alert is equally a bigger gap IMO.

The way I see it is this:
Opie looks at the dashboard (everything looks green)
Why is he going to click the Alerts screen to go and click acks on alerts that have resolved itself?
What would make Opie want to go to the alerts screen? Opie didn't get notified. Opie looks at the dashboard 2 hours later, all green.
In this state of affairs, what is the value for Opie to go in and click on Acks on the list of alerts that resolved itself?

dannycohen · 2013-10-02T16:53:47Z

I'd say not having notifications about the alert is equally a bigger gap IMO.

Notifications will be provided by 3rd party tools.

Opie looks at the dashboard (everything looks green)
Why is he going to click the Alerts screen to go and click acks on alerts that have resolved itself?

That's the wrong description for the scenario.
Here's the same scenario with a minor change:

Opie looks at the dashboard.
Everything looks green, but there's an indication that something happened earlier in custom checks.
Opie goes to custom checks events, sees an issue that occureds 3 hours ago
After checking the root cause Opie decides to ack the events that occurred earlier

dannycohen · 2013-10-02T16:58:16Z

Here's the alternative scenario, without the requirement to ack:

During the night, the system was suffered intermittent connection issues and custom checks failed. In the early morning hours all returned to normal.
In the morning, Opie looks at the dashboard.
Everything looks green now and but there's no indication that something happened earlier in custom checks.
Opie is unaware that something happened, and happily continues on his daily routine

I agree that SMS / Email notifications are missing, but this will be provided via 3rd party integration. Still this scenario is not a valid one for SP.

indualagarsamy · 2013-10-02T17:24:23Z

I think notifications and acks go hand in hand. Implementing something partially here without the other is not making sense to me.
When you have the 3rd party notifications wired in, I think this would be solved.
Also, I don't want to pollute the individual indicators on the dashboard with potential false positives. i..e indicating something that happened, but is currently not a problem. I think those indiators are meant for what's the current situation.

indualagarsamy · 2013-10-02T17:27:58Z

I'd also strongly argue that this is not a critical feature for a MVP.

andreasohlund · 2013-10-02T17:32:48Z

+1. Can we put this onhold until we get some more user feedback?

Sent from my iPhone

On 2 okt 2013, at 19:27, Indu Alagarsamy notifications@github.com wrote:

I'd also stronlgly argue that this is not a critical feature for a MVP.

—
Reply to this email directly or view it on GitHub.

dannycohen · 2013-10-02T19:44:56Z

Can we put this on hold until we get some more user feedback?

NP. Lets do that.

indualagarsamy · 2013-10-15T23:58:30Z

@dannycohen - is this still targeted for beta? I thought we moved this out?

dannycohen · 2013-10-16T19:05:22Z

@indualagarsamy - Given the schedule, I do not see how we can complete this for Beta. Moving to RC.

We;ve had some discussions about this in the past, so I would like to make sure that we agree (or not) that from a requirements perspective, this makes sense to you.

//CC @andreasohlund

dannycohen · 2013-10-16T19:15:33Z

@indualagarsamy -
For the time being, I am leaving the references to acknowledgements in #12; (I want to reach agreement on the acknowledgement implementation issue)

indualagarsamy · 2013-11-06T09:10:14Z

@dannycohen - can we talk about this?

dannycohen · 2013-11-06T11:56:47Z

I'll schedule a chat for this evening (your morning)

dannycohen · 2013-11-07T20:54:18Z

@indualagarsamy - untill we manage to discuss this topic, please advise regarding the following issue:

This is how my Custom Checks look right now:

The problem is that the 2 failed custom checks cannot be "fixed" because I changed the code in the custom check.
- i.e. in order to "reset" these events (or make them disappear) I will need to re-deploy the old version of the custom check so that it will reset the custom check status.
- As you can imagine - this is unacceptable in a prod environment and also undoable since the custom check root cause of failure may be unfixable...
Another option is that I can go into the database and reset the event manually
- Scratch that... bad idea... ahhh...

So, how Can I, as Opie, make the Custom Check indicator green again, given that it is currently flashing red for failed custom checks that are obsolete and not fixable ?

//CC @andreasohlund / @johnsimons

johnsimons · 2013-11-07T21:13:57Z

Make sure the custom check id is the same accross different revisions

On Friday, November 8, 2013, Danny Cohen wrote:

@indualagarsamy https://github.com/indualagarsamy - untill we manage to
discuss this topic, please advise regarding the following issue:

This is how my Custom Checks look right now:

[image: image]https://f.cloud.github.com/assets/3889023/1495626/f04933b4-47ed-11e3-8af0-503a730e27dd.png

[image: image]https://f.cloud.github.com/assets/3889023/1495633/feb2d964-47ed-11e3-99dd-cf82dc950b87.png

The problem is that the 2 failed custom checks cannot be "fixed"
because I changed the code in the custom check.

i.e. in order to "reset" these events (or make them disappear) I
will need to re-deploy the old version of the custom check so that it will
reset the custom check status.

As you can imagine - this is unacceptable in a prod environment
and also undoable since the custom check fcause of failure may be
unfixable...

Another option is that I can go into the database and reset the
event manually

Scratch that... bad idea... ahhh...

So, how Can I, as Opie, make the Custom Check indicator green again, given
that it is currently flashing red for failed custom checks that are
obsolete and not fixable ?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/13#issuecomment-28005372
.

Regards
John Simons
NServiceBus

dannycohen · 2013-11-07T21:24:06Z

Make sure the custom check id is the same across different revisions

That's not a realistic solution.
In this case I intentionally changed the custom check id since I originally gave it a misleading name (unintentionally...).
Should I have kept the misleading name there just to avoid falling into this issue ? How do you suggest we can make sure users do not fall into this trap, and if they do (and they will) how do we get them out ?

Another similar scenario, is when a custom check needs to be retired (e.g. it checks something that always fails).
You can tell me to leave the custom check there, always returning a pass result, but that is as unattractive as leaving a misleading custom check id as-is because I can't get rid of old and obsolete events...

dannycohen · 2013-11-07T21:49:50Z

@indualagarsamy / @johnsimons -
Another issue:

I have these 3 failed messages which cannot succeed on retry due to compatibility issues. They are basically poison messages.

How can I make them disappear ?
(they are obsolete, and there's nothing I can or need to do to fix the messages themselves; I just want to get the failed messages indicator to be green, since there are no failed messages - the ones that are there are irrelevant obsolete that are obstructing the usage of ServicePulse Failed message indicator...)

andreasohlund · 2013-11-08T05:54:17Z

We need a Archive error message command

Sent from my iPhone

On 7 nov 2013, at 22:49, Danny Cohen notifications@github.com wrote:

@indualagarsamy / @johnsimons -
Another issue:

I have these 3 failed messages which cannot be retried due to compatibility issues. They are basically poison messages.

How can I make them disappear ?

—
Reply to this email directly or view it on GitHub.

andreasohlund · 2013-11-08T05:55:43Z

Seem like we need a manage custom checks page where you can delete old definitions?

Sent from my iPhone

On 7 nov 2013, at 22:24, Danny Cohen notifications@github.com wrote:

Make sure the custom check id is the same across different revisions

That's not a realistic requirement.
In this case I intentionally changed the custom check id since I originally gave it a misleading name (unintentionally...).
Should I have kept the misleading name there just to avoid falling into this issue ? How do you suggest we can make sure users do not fall into this trap, and if they do (and they will) how do we get them out ?

Another similar scenario, is when a custom check needs to be retired (e.g. it checks something that is faulty).
You can tell me to leave the custom check there, always returning a pass result, but that is as unattractive as leaving a misleading custom check id as-is because I can't get rid of old and obsolete events...

—
Reply to this email directly or view it on GitHub.

dannycohen · 2013-11-08T06:42:28Z

We need a Archive error message command

We can call it archive, hide, "mark as obsolete" etc.
Essentially we need a flag per event saying whether that event should no longer affect the indicator display / count.
I called this proposed flag "acknowledgement", but any name will do.

Seem like we need a manage custom checks page where you can delete old definitions?

Same as for error messages.
All other indicators will have the same issue. For example, endpoint instances that are no longer available (because the machine on which the endpoint instance resided is no longer in use; see #15).
So we can do it differently for every events / indicators type, or we can do it once per all types.

andreasohlund · 2013-11-08T08:13:53Z

I see multiple requirements in play here.

How about:

A way to reset indicator (right click on the indicator it self - reset?)
Add smart context menus on the Events (we have already support for this)
- For custom checks: Ignore/Reset/Remove
- For errors: retry/archive
- For heartbeats Reset/Ignore/Remove Endpoint

dannycohen · 2013-12-03T12:41:45Z

Closed and replaced by per-indicator GH issue using the terminology proposed by @andreasohlund in #13 (comment)

ghost assigned indualagarsamy Sep 24, 2013

This was referenced Sep 24, 2013

Implement alerts clearing Particular/ServiceControl#43

Closed

Implement Failed Messages alerts viewing Particular/ServiceControl#48

Closed

dannycohen mentioned this issue Oct 10, 2013

Implement Process SLA indicator and events #21

Closed

dannycohen mentioned this issue Oct 16, 2013

Implement custom checks & corresponding alerts Particular/ServiceControl#45

Closed

dannycohen mentioned this issue Nov 6, 2013

Implement endpoint configuration Indicator and management #15

Closed

dannycohen mentioned this issue Nov 11, 2013

Implement Heartbeat Events viewing #12

Closed

dannycohen closed this as completed Dec 3, 2013

dannycohen mentioned this issue Dec 3, 2013

Support Archiving of Failed Messages #73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement events acknowledgement #13

Implement events acknowledgement #13

dannycohen commented Sep 24, 2013

dannycohen commented Sep 24, 2013

indualagarsamy commented Oct 2, 2013

dannycohen commented Oct 2, 2013

andreasohlund commented Oct 2, 2013

dannycohen commented Oct 2, 2013

andreasohlund commented Oct 2, 2013

dannycohen commented Oct 2, 2013

indualagarsamy commented Oct 2, 2013

dannycohen commented Oct 2, 2013

indualagarsamy commented Oct 2, 2013

dannycohen commented Oct 2, 2013

dannycohen commented Oct 2, 2013

indualagarsamy commented Oct 2, 2013

indualagarsamy commented Oct 2, 2013

andreasohlund commented Oct 2, 2013

dannycohen commented Oct 2, 2013

indualagarsamy commented Oct 15, 2013

dannycohen commented Oct 16, 2013

dannycohen commented Oct 16, 2013

indualagarsamy commented Nov 6, 2013

dannycohen commented Nov 6, 2013

dannycohen commented Nov 7, 2013

johnsimons commented Nov 7, 2013

dannycohen commented Nov 7, 2013

dannycohen commented Nov 7, 2013

andreasohlund commented Nov 8, 2013

andreasohlund commented Nov 8, 2013

dannycohen commented Nov 8, 2013

andreasohlund commented Nov 8, 2013

dannycohen commented Dec 3, 2013

Implement events acknowledgement #13

Implement events acknowledgement #13

Comments

dannycohen commented Sep 24, 2013

dannycohen commented Sep 24, 2013

indualagarsamy commented Oct 2, 2013

dannycohen commented Oct 2, 2013

andreasohlund commented Oct 2, 2013

dannycohen commented Oct 2, 2013

andreasohlund commented Oct 2, 2013

dannycohen commented Oct 2, 2013

indualagarsamy commented Oct 2, 2013

dannycohen commented Oct 2, 2013

indualagarsamy commented Oct 2, 2013

dannycohen commented Oct 2, 2013

dannycohen commented Oct 2, 2013

indualagarsamy commented Oct 2, 2013

indualagarsamy commented Oct 2, 2013

andreasohlund commented Oct 2, 2013

dannycohen commented Oct 2, 2013

indualagarsamy commented Oct 15, 2013

dannycohen commented Oct 16, 2013

dannycohen commented Oct 16, 2013

indualagarsamy commented Nov 6, 2013

dannycohen commented Nov 6, 2013

dannycohen commented Nov 7, 2013

johnsimons commented Nov 7, 2013

dannycohen commented Nov 7, 2013

dannycohen commented Nov 7, 2013

andreasohlund commented Nov 8, 2013

andreasohlund commented Nov 8, 2013

dannycohen commented Nov 8, 2013

andreasohlund commented Nov 8, 2013

dannycohen commented Dec 3, 2013