[Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud #83309

chrisronline · 2020-11-12T16:58:00Z

Right now, the missing monitoring data alert will fire when you upgrade a stack product in cloud because the stack product's uuid also changes.

I'm not exactly sure what we should do about that, but it's not a great UX for cloud users.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-11-12T16:58:03Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

chrisronline · 2020-11-12T20:20:09Z

@ravikesarwani I'm curious to your thoughts here. Technically, the alert is working as intended.

This query:

POST .monitoring-kibana-*/_search?filter_path=aggregations.versions.buckets
{
  "size": 0,
  "aggs": {
    "versions": {
      "terms": {
        "field": "kibana_stats.kibana.version",
        "size": 10
      },
      "aggs": {
        "uuids": {
          "terms": {
            "field": "kibana_stats.kibana.uuid",
            "size": 10
          }
        },
        "latest": {
          "max": {
            "field": "timestamp"
          }
        }
      }
    }
  }
}

yields:

{
  "aggregations" : {
    "versions" : {
      "buckets" : [
        {
          "key" : "7.9.2",
          "doc_count" : 23061,
          "latest" : {
            "value" : 1.605197028651E12,
            "value_as_string" : "2020-11-12T16:03:48.651Z"
          },
          "uuids" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cf491eb4-90fc-4430-8510-6fdb24402a16",
                "doc_count" : 23061
              }
            ]
          }
        },
        {
          "key" : "7.10.0",
          "doc_count" : 1513,
          "latest" : {
            "value" : 1.605212208223E12,
            "value_as_string" : "2020-11-12T20:16:48.223Z"
          },
          "uuids" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "662bf661-0add-43c2-956e-59ff218ed37e",
                "doc_count" : 1513
              }
            ]
          }
        }
      ]
    }
  }
}

We can't really know if the 7.9.2 instance is unavailable intentionally or accidentally from the data we have.

Maybe we can talk to the cloud team and see if there are some APIs available to detect this scenario?

ravikesarwani · 2020-11-16T15:27:56Z

@Kushmaro Any thoughts from the Cloud team?
Looks like kibana instance in Cloud are upgraded (differently that ES, APM etc.) wherein new instances are created (from the perspective of data that we have).
This causes "Missing monitoring data" alert to be fired. We look back 1 day (by default) of monitoring data so users will have this alert firing for 1 day after upgrade in Cloud.

We need some way to identify and not alert in Cloud for this scenario.

Kushmaro · 2020-11-17T13:57:13Z

Thanks for the mention @ravikesarwani , looping in @zanbel as he's now the owner of the project. (fka make-it-action)
If this is the case for all Cloud Deployments, then yeah, it's definitely an issue I think.

Currently, I'm not even sure if Kibana can query cloud APIs, technically speaking (if I'm not mistaken) the GET /deployment API should return the status of all Kibana instances.

chrisronline · 2020-11-17T15:08:44Z

@zanbel @Kushmaro I don't know what's possible, but it'd be great if the cloud Kibana plugin could expose APIs that return some data that can help us detect this.

Kushmaro · 2020-11-17T15:25:05Z

@chrisronline we're actually working on a way to allow Kibana to make API calls to Cloud in the cloud platform team.
Mainly for the purpose of improving UX and make the experience more seamless, but this is of course also a very important case.

/cc @bevacqua (who's leading this project) & @jowiho

ravikesarwani · 2020-11-17T18:48:46Z

@chrisronline Looks like this issue also happens when configuration change is applied on Kibana instance in Cloud (based on new comments in SDH https://github.com/elastic/sdh-kibana/issues/958).
As a workaround my take would be to exclude "Kibana" for this alert type. This change in my view should be made for next 9.10.x release.

Working with Cloud team we can figure out a solution and then enable the alert for Kibana.

jowiho · 2020-11-17T18:51:10Z

Before we head towards a particular solution, let's make sure we understand the problem. Does Cloud update Kibana the wrong way? Or does Kibana have the wrong expectations of how it gets updated?

ravikesarwani · 2020-11-17T20:12:05Z

@jowiho thanks for your comments. Makes sense.
Do you or someone in the Cloud team can help comment how update of Kibana is done in ESS? Particular interest is the uuid as that's what ties in monitoring data to the instance we know off.

chrisronline · 2020-11-17T20:12:49Z

@jowiho How does Cloud update Kibana right now? Kibana persists the uuid inside of a data/uuid file but I'm not sure if Cloud maintains this during the upgrade. It seems to maintain it for APM and ES as we don't see this behavior for upgrades on those stack products on Cloud.

chrisronline · 2020-11-17T21:08:12Z

After speaking briefly with @AlexP-Elastic, it doesn't seem like they intentionally persists uuids across upgrades.

It appears that the Elasticsearch node_id is persisted when upgraded in Cloud, but I can verify that the ephemeral_id changes (we don't check this in our alert though).

It appears that APM's upgrade works the same way as Kibana, where a brand new uuid is generated and it's most likely a bug on our side that the alert isn't working in that case.

The alert is working as intended, in that it will detect missing monitoring data, but we are not currently capturing the upgrade scenario (which doesn't just affect Cloud). I'm not sure if we have a way of detecting the difference between a legitimate upgrade and an instance/node going down.

Maybe we can think about solving this by providing additional configuration for the alert, such as Only alert if a node/instance is not reporting AND the total number of nodes/instances is not x so a user with three Kibana instances can configure x=3 and we can use that to ensure we do not alert unnecessarily.

ravikesarwani · 2020-11-18T01:06:43Z

This alert need is critical for Elasticsearch and if its working there then I would say we make this available for ES. I am not in favor of making the alert configuration complicated.

As a next step we can expand to other stack components.
We should investigate if we can persist uuids across upgrade.
We should also investigate why APM upgrade works. As I understood we don't have any special code for APM server.

ravikesarwani · 2020-11-18T01:09:58Z

BTW is this alert applicable for beats? If it does, we need to understand the behavior there as well.

bevacqua · 2020-11-18T01:52:47Z

Related https://github.com/elastic/cloud/issues/67400

chrisronline · 2020-11-18T15:10:54Z

We should also investigate why APM upgrade works. As I understood we don't have any special code for APM server.

It's indeed a bug, and the fix is #83646

ravikesarwani · 2020-11-18T16:16:12Z

As part of upgrade we should be backing up the data and config directories in Cloud.
In fact this is something that we ask users to do for beats upgrade.
"Back up the data and config directories by copying them to another location."

Cloud team, can we look at doing this in the Cloud for Kibana and APM server upgrade?
This makes logical sense as well. The "data" directory can be used by the processes to store temporary/cache data and backing up and restoring that directory helps to recreate the original state after the upgrade.

Chris and I discussed this and for 7.10.1 we will make this alert applicable only for Elasticsearch.
We need to do this as a stop gap otherwise these false positives can get the customers to disable these alerts.
Once we resolve the issue working with the Cloud team we can enable this alert for other objects.
I think its critical to have this alert for APM, beats, logstash etc. as well.

AlexP-Elastic · 2020-11-18T16:25:34Z

Cloud team, can we look at doing this in the Cloud for Kibana and APM server upgrade?

This would be a massive change to our infrastructure. Currently we only persist data (other than the global YAML config) across containers where Elasticsearch does that for us (I mentioned to Chris, we don't even have the concept of "we are moving instance X as part of an upgrade", we view it as "we are creating some new instances and then deleting the old ones"). There is little prospect of this happening in the foreseeable future. (cc @andrew-moldovan )

A smaller change that might be useful (it's not clear to me, see below) would be to switch APM/Kibana/etc from "grow-shrink" by default to "rolling in place" by default - I think they work this way just for legacy reasons. (cc @anyasabo @jhalterman not sure if that is planned/in progress/done?).

This would decrease the chance that any given configuration change would trigger an alert (but "moves" due to hardware failure and some capacity increases would still do this)

anyasabo · 2020-11-18T16:48:29Z

Is there any other functionality for beats/apm/kibana that requires persistent data storage, other than the identifier? My mental model of beats and kibana is that they can be considered ephemeral and I can scale up and down as necessary, if we're supposed to be considering them stateful and want to persist data solely so we can make a particular alert work, that seems Not Great. Please help me out if my mental model here is wrong or if I'm misunderstanding something.

chrisronline · 2020-11-18T17:16:41Z

@anyasabo I honestly don't know, but it sounds like adapting how we think about these products to serve a single stack monitoring alert doesn't make much sense.

Perhaps we can solve this by relying on additional cloud APIs to give us more data. Cloud deployments know how many unique instances/nodes should exist and if we can access that data within the Stack Monitoring plugin, we can make smarter decisions about when to alert.

anyasabo · 2020-11-18T17:19:25Z

I also ask because it is not just ECE, ECK is in play as well. So I am not sure we want to rely on a cloud API as our first direction.

ravikesarwani · 2020-12-18T16:55:29Z

Can we revisit this for 7.12?
Looks like we have issues with Kibana and APM (its working but it maybe accidental because of an oversight) in ESS.
For ESS we need to solve the issue working with the Cloud team.

Beats is another area which is not affected by ESS and this alert may really be helpful.
My take is in Kubernetes environment this maybe noisy.

We need to discuss in 7.12 timeline and see what we can do.

Some options we need to explore from our side and working with Cloud team:

User can enable/disable alert for each product independently: ES, Kibana, Logstash, APM, Beats
The default configuration is tailored for each product and default optimized for that environment
Remove false positives in the ESS (don't alerts in known cases)

chrisronline · 2020-12-21T20:15:00Z

I opened #86683 to track the work around creating separate alerts for each stack product to satisfy:

The default configuration is tailored for each product and default optimized for that environment

User can enable/disable alert for each product independently: ES, Kibana, Logstash, APM, Beats

chrisronline added bug Fixes for quality problems that affect the customer experience Team:Monitoring Stack Monitoring team labels Nov 12, 2020

sgrodzicki removed the bug Fixes for quality problems that affect the customer experience label Nov 16, 2020

ravikesarwani changed the title ~~[Monitoring] Missing monitoring data alert firing for version upgrade~~ [Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud Nov 17, 2020

chrisronline mentioned this issue Nov 18, 2020

[Monitoring] Fix small issue with detecting missing monitoring data from APM #83646

Merged

chrisronline mentioned this issue Nov 18, 2020

[Monitoring] Only look at ES for the missing data alert for now #83659

Closed

chrisronline mentioned this issue Nov 19, 2020

[Monitoring] Only look at ES for the missing data alert for now #83839

Merged

chrisronline mentioned this issue Dec 15, 2020

[Stack Monitoring] [Test Scenario] Out of the box alerting #85841

Closed

23 tasks

ravikesarwani added the v7.12.0 label Dec 18, 2020

chrisronline mentioned this issue Mar 1, 2021

[Stack Monitoring] [Test Scenario] Out of the box alerting #93072

Closed

24 tasks

simianhacker mentioned this issue Apr 29, 2021

[Stack Monitoring] [Test Scenario] Out of the box alerting #98765

Closed

24 tasks

neptunian mentioned this issue Jul 6, 2021

[Stack Monitoring] [Test Scenario] Out of the box alerting #104440

Closed

35 tasks

jasonrhodes mentioned this issue Feb 28, 2022

[Stack Monitoring] Kibana should not report healthy when recent data is missing #126386

Closed

neptunian mentioned this issue Mar 3, 2022

[Stack Monitoring] Improve Missing Monitoring Data rule #126709

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud #83309

[Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud #83309

chrisronline commented Nov 12, 2020

elasticmachine commented Nov 12, 2020

chrisronline commented Nov 12, 2020

ravikesarwani commented Nov 16, 2020

Kushmaro commented Nov 17, 2020

chrisronline commented Nov 17, 2020

Kushmaro commented Nov 17, 2020

ravikesarwani commented Nov 17, 2020

jowiho commented Nov 17, 2020

ravikesarwani commented Nov 17, 2020

chrisronline commented Nov 17, 2020

chrisronline commented Nov 17, 2020

ravikesarwani commented Nov 18, 2020

ravikesarwani commented Nov 18, 2020

bevacqua commented Nov 18, 2020

chrisronline commented Nov 18, 2020

ravikesarwani commented Nov 18, 2020

AlexP-Elastic commented Nov 18, 2020

anyasabo commented Nov 18, 2020

chrisronline commented Nov 18, 2020

anyasabo commented Nov 18, 2020

ravikesarwani commented Dec 18, 2020 •

edited

Loading

chrisronline commented Dec 21, 2020

[Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud #83309

[Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud #83309

Comments

chrisronline commented Nov 12, 2020

elasticmachine commented Nov 12, 2020

chrisronline commented Nov 12, 2020

ravikesarwani commented Nov 16, 2020

Kushmaro commented Nov 17, 2020

chrisronline commented Nov 17, 2020

Kushmaro commented Nov 17, 2020

ravikesarwani commented Nov 17, 2020

jowiho commented Nov 17, 2020

ravikesarwani commented Nov 17, 2020

chrisronline commented Nov 17, 2020

chrisronline commented Nov 17, 2020

ravikesarwani commented Nov 18, 2020

ravikesarwani commented Nov 18, 2020

bevacqua commented Nov 18, 2020

chrisronline commented Nov 18, 2020

ravikesarwani commented Nov 18, 2020

AlexP-Elastic commented Nov 18, 2020

anyasabo commented Nov 18, 2020

chrisronline commented Nov 18, 2020

anyasabo commented Nov 18, 2020

ravikesarwani commented Dec 18, 2020 • edited Loading

chrisronline commented Dec 21, 2020

ravikesarwani commented Dec 18, 2020 •

edited

Loading