Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to get all active instances from Observability consumers #70169

Closed
cauemarcondes opened this issue Jun 29, 2020 · 16 comments · Fixed by #87596
Closed

API to get all active instances from Observability consumers #70169

cauemarcondes opened this issue Jun 29, 2020 · 16 comments · Fixed by #87596
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@cauemarcondes
Copy link
Contributor

cauemarcondes commented Jun 29, 2020

In the new Observability Overview page, we're planning to show two charts to give the user a clear picture of which alert is active at the moment.

In this chart, we want to show all active instances for all observability plugins (APM/Logs/Uptime/Metrics) grouped by type.
Screenshot 2020-06-25 at 12 40 51

And in this one, we want to show some alert detail and the number of active instances next to it.
Screenshot 2020-06-29 at 10 21 32

Current situation:
In the current API to get this information I have to first call _find to get all created alerts, then filter by Observability plugins (APM/Logs/Uptime/Metrics), and make an HTTP call for each alert to get the active instances.

What a need:
An API that returns all active instances and the alert details, with the possibility to filter by consumer and alert type.

Example API:

alerting.getInstances({ active: true, consumers: ['apm', 'uptime', 'metrics'] })

Example response:

[
  {
    "id": "b5ef31a1-7c9f-47f5-a0d4-69169fc2f407",
    "params": {
      "threshold": 1,
      "aggregationType": "avg",
      "windowSize": 5,
      "windowUnit": "m",
      "transactionType": "request",
      "environment": "ENVIRONMENT_ALL",
      "serviceName": "opbeans-java"
    },
    "consumer": "apm",
    "alertTypeId": "apm.transaction_duration",
    "schedule": {
      "interval": "10s"
    },
    "actions": [
      {
        "actionTypeId": ".webhook",
        "group": "threshold_met",
        "params": {
          "body": "{\"transaction\": \"transaction\"}"
        },
        "id": "4e6a507f-1238-49c1-8b55-c19e42076543"
      }
    ],
    "tags": ["apm", "service.name:opbeans-java"],
    "name": "Transaction duration | opbeans-java",
    "throttle": "15s",
    "enabled": true,
    "apiKeyOwner": "elastic",
    "createdBy": "elastic",
    "updatedBy": "elastic",
    "createdAt": "2020-06-25T14:27:19.820Z",
    "muteAll": false,
    "mutedInstanceIds": [],
    "scheduledTaskId": "fad2cf20-b6ef-11ea-9623-a57005710a46",
    "updatedAt": "2020-06-25T14:27:21.257Z",

    //All active instances
    "alertInstances": [
      {
        "state": {},
        "meta": {
          "lastScheduledActions": {
            "group": "threshold_met",
            "date": "2020-06-29T08:31:38.802Z"
          }
        }
      }
    ]
  }
]
@pmuellr pmuellr added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Jun 29, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@ogupte
Copy link
Contributor

ogupte commented Jun 29, 2020

Hello, APM service maps also needs this capability as well for 7.9. We need to be able to show display health indicators or all services in the service map which have active alerts violations. Right now, we can only get it to work by calling getAlertStatein parallel for each id we get from find, but it is prohibitively inefficient, especially for very large service maps. Something where we can get all the alert statuses in one go is required before we can integrate.

@sorenlouv
Copy link
Member

@mikecote I see you've added this to "Long Term". This is something we hope to be able to have available in 7.10. Is that possible?

@mikecote
Copy link
Contributor

@sqren I went over the recording of the triage session we had for this issue. I think we needed more clarifications on if this issue was still needed or if your requirements have changed based on the scope adjustment the homepage team made for 7.9 / 7.10. We placed it with the bulk APIs story (long term) and had an approach we believe could work for you without waiting on this API (some email thread from a few weeks ago).

@pmuellr can help on this. The approach that could work for now is to use the alert find API to get all observability related alerts (filter by alert type and/or consumer) and then use the task manager's fetch API for the alert's scheduledTaskId. With that result, each task will contain the state of an alert and you can then extract the instances from there.

We can always revisit and prioritize this issue no problem, probably in the scope of 7.11 once our work for GA is complete.

@kobelb
Copy link
Contributor

kobelb commented Jul 23, 2020

@XavierM has a PR which adds aggregations to the SavedObjectsClient, can we take advantage of this here?

@ogupte
Copy link
Contributor

ogupte commented Jul 28, 2020

@mikecote

... The approach that could work for now is to use the alert find API to get all observability related alerts (filter by alert type and/or consumer) and then use the task manager's fetch API for the alert's scheduledTaskId. With that result, each task will contain the state of an alert and you can then extract the instances from there.

From an other thread:

I believe the workaround that has been suggested was similar to the approach mentioned earlier. It works by fetching the task state for each matching alert returned in the find call. In setups with few alerts configured, it will add a few requests to the page load. But what if we're trying to load a service map with 10, 20, or more services where each has an alert configured? Are we OK adding x number of single requests to our initial page load?

From the alerting plugin context, it might be possible to obtain multiple alerts states in a single request, but it would require querying the task manager index filtered by job ids obtained in the initial find. This would result in the initial page load adding a constant 2 additional requests instead of the suggested 1 + x requests.

@pmuellr
Copy link
Member

pmuellr commented Aug 18, 2020

We've re-prioritized some work that - I think - will happen to work out very well for this requirement.

We will be formalizing the notion of an alert "status" per issue #51099 . We'll add a new status object to the alert saved object, which means you should be able to get the status from the alertClient find() API (or equivalent http call), including usual saved object filtering, fields, etc. I think this would mean having to retrieve all the alerts with find(), and manually generating the numbers, based on the alert type and status.

That gets us back down from 1+x or even 2 api calls, down to 1! (but with more data than actually required, I think)

I'm going to start working on this shortly, will note the PR here once it's under way.

@formgeist
Copy link
Contributor

@pmuellr Thanks for the heads up - it sounds very exciting!

@pmuellr
Copy link
Member

pmuellr commented Sep 30, 2020

Doing some bookkeeping, realized I didn't post the PR with the new 'alert status' field - it's here: #75553

But also, re-reading this, and realizing the original request, that still doesn't give us instance data, just the alert data. So, that still leaves us in a 1 + n requests state - 1 find() request to get the alerts, and then n calls to get the instance data.

It feels to me like we'll end up needing some new APIs, and I don't think we've talked about what those might look like, so here's a rough sketch:

  • new method on alerts client that takes find() parameters, and returns instance data about all the matching alerts; this would internally use find(), then make a single call (well, probably have to deal with pagination, but one "virtual" call) to the event log to query against all the alert SO's returned from find(). We'd likely need to process the events returned to get whatever data we're looking for, much like the current "get instance status" API (which returns instance data for a single alert)

  • http API that calls that new alerts client API

  • some changes to the event log to bypass the current checks on the saved object being queried for event data - that's done for security reasons (you need to be able to read an alert to see it's events) - because we've already done that check in the find() call to get the list of alerts

I should note this would be to get instance data beyond just the current state of known instances (eg, it could return data about recent instances which are no longer active, like the current "get instance status" API). If we only need the current list of instances, or count of instances, it's possible we could do a query over task manager to get the current alert instance data. This also wouldn't contain any instance status data like errors. Here's what that task manager data looks like (note, it's stored as a JSON string today, so we'd need to parse it after fetching and can't search over these "fields"); this shows an alert with one active instance, host-1:

{
  "alertInstances": {
    "host-1": {
      "state": {},
      "meta": {
        "lastScheduledActions": {
          "group": "threshold met",
          "date": "2020-09-30T21:40:14.771Z"
        }
      }
    }
  },
  "previousStartedAt": "2020-09-30T21:40:14.664Z"
}

@sorenlouv
Copy link
Member

Another issue which depends on being able to retrieve alert instances: #85479

Let me know if everything is clear or I should add more details.

@pmuellr
Copy link
Member

pmuellr commented Dec 15, 2020

Thx @sqren !

From #85479:

The alert instance should be displayed at the time it activated. On hover it should be possible to see the threshold and the value that exceeded the threshold.

So you'll need the time, threshold, and actual value.

Today, you can get the active-instance events from the event log to get the alert id, instance, I think action group (and a bit more). We don't currently store the threshold or the actual value, since there's no common value across alerts for those - but I have been thinking that it makes sense, if you can boil everything down to "simple values", and preferably numbers :-). That would be a new concept for alerting, but think it makes sense.

Presumably the application knows the "value that exceeded the threshold", unless it's no longer available (eg, ILM). But then the app wouldn't be able to show a pretty graph to annotate in the first place.

But if we're storing the threshold value (where else would an "older" version of a threshold value, if changed over time, be available?), it makes sense to store the metric value as well, so we should add those both at the same time.

In terms of "progressive enhancement" then, I'd hope we'll make those values available at some point in the future in the event log, but for today, all you'll have is the timestamp of when the alert/instance was "active".

@sorenlouv
Copy link
Member

but for today, all you'll have is the timestamp of when the alert/instance was "active".

Sounds great! Having the timestamp will still allow us to add alert annotations to charts which is a great start. Then we can enhance this down the road with the actual values.

@mikecote
Copy link
Contributor

Some notes from the 7.12 planning session

@peterschretlen

One of the outcomes of the working group workstreams was to have these instances as data, rather than state in a saved object. That might be worth considering here. Pulling the instance state out of a set of rules sounds challenging, and may not align with the long term direction. Having instances as data (or maybe as data in addition to state) might be worth considering.

@pmuellr

We may need to scope this issue down to "an API to get all instances for all visible alerts" - today, clients need to separately get a list of all the visible alerts, and then get the instances for each of them; 1 + N calls; I want to build an API to do this in one call (from the client's perspective). Adding additional data to the instances is something we can do independently.

@mikecote mikecote removed the v7.10.0 label Dec 18, 2020
@mikecote
Copy link
Contributor

Moved from 8.x - Candidates to To-Do in order to start working on this in 7.12.

@pmuellr
Copy link
Member

pmuellr commented Jan 27, 2021

note: I originally opened this as issue #88908, but moving here since it's really just relevant to this overall issue

It's not clear that this will be needed, but thought I'd outline generating instance data might work, when searching through multiple alerts. The thought here is that if the best we can do for now, is to generate a list of all the events for all the alerts, we'll need to have a standard way of having processing the events.

For the "Alert Details" page, we generate the list of instances and data from them, via this function, which is not currently exposed as an API:

export interface AlertInstanceSummaryFromEventLogParams {
alert: SanitizedAlert<{ bar: boolean }>;
events: IEvent[];
dateStart: string;
dateEnd: string;
}
export function alertInstanceSummaryFromEventLog(
params: AlertInstanceSummaryFromEventLogParams
): AlertInstanceSummary {

As we are getting more consumers of the event log coming on line, this function - or similar ones, or perhaps this one with more parameters/capabilities - could be useful if we only end up providing a way to get ALL the event log docs (eg, if we don't support a richer search mechanism). Otherwise, those consumers will be forced to implement similar logic in their own plugins.

We'd need to clean this up a bit to turn it into an API, and presumably if we did this, we'd also change it to support events from multiple alerts, and not just a single alert. And presumably, it would be a function on the alertsClient.

@mikecote
Copy link
Contributor

After discussing with @sqren yesterday, "alerts as data" is necessary for the 7.x Observability workflows, which remove the need for these APIs as a half-way measure.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
10 participants