Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry & KPI's for beta, to be defined #49832

Closed
mikecote opened this issue Oct 31, 2019 · 8 comments · Fixed by #58081
Closed

Telemetry & KPI's for beta, to be defined #49832

mikecote opened this issue Oct 31, 2019 · 8 comments · Fixed by #58081
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.7.0

Comments

@mikecote
Copy link
Contributor

mikecote commented Oct 31, 2019

cc @alexfrancoeur

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services (Team:Stack Services)

@pmuellr
Copy link
Member

pmuellr commented Oct 31, 2019

I'm not familiar at all with how apps feed telemetry data now, but it appears that some folks collect stats along the way, and then dump at regular intervals or are polled by telemetry for the data.

One alternative to collecting internal stats, would be to use the event log to query for them. There are likely reasons why this doesn't make sense, but I'm going to pretend like it's something we'd like to be able to do, at some point. So, will be interested to see what we'll be feeding telemetry, to see if we can actually get that back out of the event log.

@alexfrancoeur
Copy link

Our telemetry today is fairly high level and for the sake of time, I think we can avoid ui_metrics for beta. Here are some of my initial thoughts on metrics to start capturing and goals we'd like to achieve.

Goals

  • Understand general product usage and scale at which the alerting framework is being utilized
  • To have X percentage of unique clusters using alerts by Y date
  • [more to come]

For our first release, I'm not sure how much effort we should put into having more granular telemetry given usage within the apps themselves will be limited. These are some basic metrics that come to mind and are all up for debate. I thought they'd help kick off a discussion. If we need to trim these down or add better ones, we can.

  • Alerting plugin
    • enabled : true or false
  • Alerts overall
    • Total count
    • Total count active (in use)
    • Total executions
  • Alert by type
    • Total count
    • Total count active (in use)
    • Total executions
  • Actions overall
    • Total count
    • Total count active (in use)
    • Total executions
  • Actions by type
    • Total count
    • Total count active (in use)
    • Total executions
  • Tasks
    • Total count
    • Total count active (in use)
    • Total executions
  • Alert instances
    • I'm not sure if total counts make sense here. Min/max/avg listed below might be our best option to start

Nice to have metrics

  • Min / max / avg - tasks per second
  • Min / max / avg - alert schedule time
  • Min / max / avg - throttle time
  • Min / max / avg - connectors per alert
  • Min / max / avg - alert instances
  • Total count of alerts enabled / disabled
  • How frequently actions on an alert are taken. Acked, muted, etc.
  • Event log doc count (not for the initial release)
  • Lots more..

@peterschretlen what do you think?

@peterschretlen
Copy link
Contributor

@alexfrancoeur that's a comprehensive list!

I think the metrics are good (assuming we can instrument and collect them) In addition I think it would be important to segment the metrics by spaces and solutions.

Certain solutions might generate a lot of alerts automatically ( like SIEM for example ), which might skew the numbers. Quantity may not equate to usage, I think it will depend on the app.

And for spaces, since alerts are segmented by space it would be good to know that this is being used (for example alerts for an app appearing in multiple spaces could suggest the isolation of spaces is being put to use, especially if there are different types or quantities of alerts )

I agree with @pmuellr the activity log would be a good place to get some of this information. In fact, from an Admin's perspective a lot of these metrics would be very useful to show in the management view and we might want to expose some or all of this in a stats API we could show in the UI.

@bmcconaghy bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019
@pmuellr
Copy link
Member

pmuellr commented Jan 8, 2020

some notes:

Actions overall -> Total count - I think this would be the number of actions created (# of action saved objects)

Actions overall -> Total count active (in use) - number of actions created that are actually used in an alert (or somewhere else, but currently just alerts AFAIK).

Alerts overall -> Total count - would be like the actions one - number of alerts created (# of alert saved objects)

Alerts overall -> Total count active (in use) - number of alerts created that are not disabled

We probably want to track execution failures as well as successes - assume Total executions is the successes + failures, then add a new metric for Total execution failures or such. For both Actions and Alerts.

Not clear if we really need the “overall” stats for Alerts/Actions, since that’s just denormalized sums of the “by type” ones. Can we just get those values for free somehow, wherever we make these stats available? Though it will be simple to calculate the “overall” stats, presumably, given the “by type” ones, so not a big deal.

It feels like Alert Instances should actually be under Alerts overall and Alerts by type instead of by itself. I think a total count here is probably a good start on those.

I suspect we can add all sorts of stats for Tasks, eh @gidi Morris???

@pmuellr
Copy link
Member

pmuellr commented Jan 8, 2020

Did a little digging on where alertInstances might be available, but not much luck. I was thinking they were probably persisted, but am not seeing them in the alert or task SO's. They might not be persisted.

If not, or even if they are, you can see what alertInstances are "in use" by looking at the alertInstances object at, after this call completes:

const updatedAlertTypeState = await this.alertType.executor({
alertId,
services: {
...services,
alertInstanceFactory: createAlertInstanceFactory(alertInstances),
},
params,
state: alertTypeState,
startedAt: this.taskInstance.startedAt!,
previousStartedAt,
});

That object is just a {} where every key is an alertInstance (a string). They are specific to the alert, so you'd combine them with the alert id (eg, ${alertId}::${alertInstance}, store those in a set somewhere, to get the total (distinct) number.

@mikecote may have a better answer

@gmmorris
Copy link
Contributor

gmmorris commented Jan 9, 2020

@pmuellr you tagged some random Gidi :)

Regarding Tasks, I'm not sure what we'd want to track as we could easily end up with lots of data that doesn't tell us much.

I don't know much about how we're using and visualising telemetry data, but can we differentiate between Task stats in systems that are in heavy use vs. light use? many alerts vs. none? Large clusters vs random single node installations?
Understanding these things could help me get a better grasp of what data to collect. 🤷‍♂

Regarding @alexfrancoeur list:

  • Tasks

    • Total count
    • Total count active (in use)
    • Total executions

Total Count doesn't include completed Tasks, as we don't keep Task history. It would include all scheduled tasks (one time and interval) and failed tasks (as we don't clean these out).
That means, Total count active (in use) would be Total Count without the failed ones?
What does Total executions mean? total amount of tasks since startup? Should be fine... 🤔 until we overflow 😆, perhaps we should to Total executions in past hour to avoid overflowing? That's how you'd usually track this kind of thing when monitoring a system, and it still gives you a good idea of overall execusions.

@mikecote
Copy link
Contributor Author

@pmuellr

I was thinking they were probably persisted, but am not seeing them in the alert or task SO's. They might not be persisted.

They are persisted within the alert's task saved object. Within the attributes.state.alertInstances attribute of the object. It's possible the value wasn't there if no instances existed yet.

The updated state object gets built / created starting here:

return {
alertTypeState: updatedAlertTypeState,
alertInstances: instancesWithScheduledActions,
};
.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.7.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants