Telemetry & KPI's for beta, to be defined #49832

mikecote · 2019-10-31T02:17:19Z

elasticmachine · 2019-10-31T02:17:30Z

Pinging @elastic/kibana-stack-services (Team:Stack Services)

pmuellr · 2019-10-31T14:59:04Z

I'm not familiar at all with how apps feed telemetry data now, but it appears that some folks collect stats along the way, and then dump at regular intervals or are polled by telemetry for the data.

One alternative to collecting internal stats, would be to use the event log to query for them. There are likely reasons why this doesn't make sense, but I'm going to pretend like it's something we'd like to be able to do, at some point. So, will be interested to see what we'll be feeding telemetry, to see if we can actually get that back out of the event log.

alexfrancoeur · 2019-11-20T20:53:36Z

Our telemetry today is fairly high level and for the sake of time, I think we can avoid ui_metrics for beta. Here are some of my initial thoughts on metrics to start capturing and goals we'd like to achieve.

Goals

Understand general product usage and scale at which the alerting framework is being utilized
To have X percentage of unique clusters using alerts by Y date
[more to come]

For our first release, I'm not sure how much effort we should put into having more granular telemetry given usage within the apps themselves will be limited. These are some basic metrics that come to mind and are all up for debate. I thought they'd help kick off a discussion. If we need to trim these down or add better ones, we can.

Alerting plugin
- enabled : true or false
Alerts overall
- Total count
- Total count active (in use)
- Total executions
Alert by type
- Total count
- Total count active (in use)
- Total executions
Actions overall
- Total count
- Total count active (in use)
- Total executions
Actions by type
- Total count
- Total count active (in use)
- Total executions
Tasks
- Total count
- Total count active (in use)
- Total executions
Alert instances
- I'm not sure if total counts make sense here. Min/max/avg listed below might be our best option to start

Nice to have metrics

Min / max / avg - tasks per second
Min / max / avg - alert schedule time
Min / max / avg - throttle time
Min / max / avg - connectors per alert
Min / max / avg - alert instances
Total count of alerts enabled / disabled
How frequently actions on an alert are taken. Acked, muted, etc.
Event log doc count (not for the initial release)
Lots more..

@peterschretlen what do you think?

peterschretlen · 2019-11-22T17:31:36Z

@alexfrancoeur that's a comprehensive list!

I think the metrics are good (assuming we can instrument and collect them) In addition I think it would be important to segment the metrics by spaces and solutions.

Certain solutions might generate a lot of alerts automatically ( like SIEM for example ), which might skew the numbers. Quantity may not equate to usage, I think it will depend on the app.

And for spaces, since alerts are segmented by space it would be good to know that this is being used (for example alerts for an app appearing in multiple spaces could suggest the isolation of spaces is being put to use, especially if there are different types or quantities of alerts )

I agree with @pmuellr the activity log would be a good place to get some of this information. In fact, from an Admin's perspective a lot of these metrics would be very useful to show in the management view and we might want to expose some or all of this in a stats API we could show in the UI.

pmuellr · 2020-01-08T18:36:04Z

some notes:

Actions overall -> Total count - I think this would be the number of actions created (# of action saved objects)

Actions overall -> Total count active (in use) - number of actions created that are actually used in an alert (or somewhere else, but currently just alerts AFAIK).

Alerts overall -> Total count - would be like the actions one - number of alerts created (# of alert saved objects)

Alerts overall -> Total count active (in use) - number of alerts created that are not disabled

We probably want to track execution failures as well as successes - assume Total executions is the successes + failures, then add a new metric for Total execution failures or such. For both Actions and Alerts.

Not clear if we really need the “overall” stats for Alerts/Actions, since that’s just denormalized sums of the “by type” ones. Can we just get those values for free somehow, wherever we make these stats available? Though it will be simple to calculate the “overall” stats, presumably, given the “by type” ones, so not a big deal.

It feels like Alert Instances should actually be under Alerts overall and Alerts by type instead of by itself. I think a total count here is probably a good start on those.

I suspect we can add all sorts of stats for Tasks, eh @gidi Morris???

pmuellr · 2020-01-08T19:42:49Z

Did a little digging on where alertInstances might be available, but not much luck. I was thinking they were probably persisted, but am not seeing them in the alert or task SO's. They might not be persisted.

If not, or even if they are, you can see what alertInstances are "in use" by looking at the alertInstances object at, after this call completes:

kibana/x-pack/legacy/plugins/alerting/server/task_runner/task_runner.ts

Lines 134 to 144 in 89e4daf

    
           const updatedAlertTypeState = await this.alertType.executor({ 
        
             alertId, 
        
             services: { 
        
               ...services, 
        
               alertInstanceFactory: createAlertInstanceFactory(alertInstances), 
        
             }, 
        
             params, 
        
             state: alertTypeState, 
        
             startedAt: this.taskInstance.startedAt!, 
        
             previousStartedAt, 
        
           });

That object is just a {} where every key is an alertInstance (a string). They are specific to the alert, so you'd combine them with the alert id (eg, ${alertId}::${alertInstance}, store those in a set somewhere, to get the total (distinct) number.

@mikecote may have a better answer

gmmorris · 2020-01-09T09:16:38Z

@pmuellr you tagged some random Gidi :)

Regarding Tasks, I'm not sure what we'd want to track as we could easily end up with lots of data that doesn't tell us much.

I don't know much about how we're using and visualising telemetry data, but can we differentiate between Task stats in systems that are in heavy use vs. light use? many alerts vs. none? Large clusters vs random single node installations?
Understanding these things could help me get a better grasp of what data to collect. 🤷‍♂

Regarding @alexfrancoeur list:

Tasks

Total count

Total count active (in use)

Total executions

Total Count doesn't include completed Tasks, as we don't keep Task history. It would include all scheduled tasks (one time and interval) and failed tasks (as we don't clean these out).
That means, Total count active (in use) would be Total Count without the failed ones?
What does Total executions mean? total amount of tasks since startup? Should be fine... 🤔 until we overflow 😆, perhaps we should to Total executions in past hour to avoid overflowing? That's how you'd usually track this kind of thing when monitoring a system, and it still gives you a good idea of overall execusions.

mikecote · 2020-01-15T23:30:20Z

@pmuellr

I was thinking they were probably persisted, but am not seeing them in the alert or task SO's. They might not be persisted.

They are persisted within the alert's task saved object. Within the attributes.state.alertInstances attribute of the object. It's possible the value wasn't there if no instances existed yet.

The updated state object gets built / created starting here:

kibana/x-pack/legacy/plugins/alerting/server/task_runner/task_runner.ts

Lines 194 to 197 in ea9a7b8

    
           return { 
        
             alertTypeState: updatedAlertTypeState, 
        
             alertInstances: instancesWithScheduledActions, 
        
           };

.

mikecote added Feature:Alerting Team:Stack Services labels Oct 31, 2019

bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019

mikecote assigned YulNaumenko Dec 13, 2019

YulNaumenko mentioned this issue Jan 8, 2020

Adds telemetry support to alerting and actions plugins #49832 #54214

Closed

3 tasks

YulNaumenko mentioned this issue Feb 25, 2020

Adds telemetry support to alerting and actions plugins #58081

Merged

2 tasks

YulNaumenko closed this as completed in #58081 Mar 12, 2020

mikecote added the v7.7.0 label Mar 12, 2020

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry & KPI's for beta, to be defined #49832

Telemetry & KPI's for beta, to be defined #49832

mikecote commented Oct 31, 2019 •

edited

Loading

elasticmachine commented Oct 31, 2019

pmuellr commented Oct 31, 2019

alexfrancoeur commented Nov 20, 2019

peterschretlen commented Nov 22, 2019

pmuellr commented Jan 8, 2020

pmuellr commented Jan 8, 2020

gmmorris commented Jan 9, 2020

mikecote commented Jan 15, 2020

Telemetry & KPI's for beta, to be defined #49832

Telemetry & KPI's for beta, to be defined #49832

Comments

mikecote commented Oct 31, 2019 • edited Loading

elasticmachine commented Oct 31, 2019

pmuellr commented Oct 31, 2019

alexfrancoeur commented Nov 20, 2019

peterschretlen commented Nov 22, 2019

pmuellr commented Jan 8, 2020

pmuellr commented Jan 8, 2020

gmmorris commented Jan 9, 2020

mikecote commented Jan 15, 2020

mikecote commented Oct 31, 2019 •

edited

Loading