[Monitoring] Cluster Status alert triggers on transient yellow status #34814

DaveyDevOps · 2018-10-24T15:50:34Z

Elasticsearch version (bin/elasticsearch --version):
Version: 6.3.1, Build: default/zip/eb782d0/2018-06-29T21:59:26.107521Z, JVM: 1.8.0_162

Plugins installed: [x-pack, hdfs-repository]

JVM version (java -version):
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b34)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b34, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux hostname 4.4.114-92.64-default #1 SMP Thu Feb 1 19:18:19 UTC 2018 (c6ce5db) x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
/x-pack/plugin/monitoring/src/main/resources/monitoring/watches/elasticsearch_cluster_status.json
The included watch will trigger with yellow status but clear at next run.
When a new index is created it is natural for there to be "missing" replica as shards are started.

Would like to see the watch account for this behavior, maybe status should be checked twice before triggering?

Wait for active shards might be an option to avoid the yellow status, but prefer not to set that across all index and for all subsequent writes when only concerned about index creation.

Steps to reproduce:
We have set default number of replica to be 1 using a default template. Monitoring collection interval has been increased to 1 minute to reduce load on the system.

Create a new index (with at least 1 replica), should be able to observe status change in logs (active master)
Watch will only trigger if monitoring data collection coincides with status being yellow

Provide logs (if relevant):

Sent: Monday, October 22, 2018 12:08 PM
To: AdminEmail
Subject: [NEW] X-Pack Monitoring: Cluster Status (UUID) [YELLOW]

Elasticsearch cluster status is yellow. Allocate missing replica shards.

Sent: Monday, October 22, 2018 12:10 PM
To: AdminEmail
Subject: [RESOLVED] X-Pack Monitoring: Cluster Status (UUID) [GREEN]

This cluster alert has been resolved: Elasticsearch cluster status is yellow. Allocate missing replica shards.

Showing status change as new indices are created

[2018-10-21T19:00:04,341][INFO ][o.e.c.r.a.AllocationService] [master-hostname] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-6-2018.10.22][0]] ...]).
[2018-10-21T19:00:06,931][INFO ][o.e.c.r.a.AllocationService] [master-hostname] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-6-2018.10.22][0]] ...]).
[2018-10-21T19:00:23,796][INFO ][o.e.c.r.a.AllocationService] [master-hostname] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-logstash-6-2018.10.22][0]] ...]).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-10-24T19:26:23Z

Pinging @elastic/es-core-infra

chrisronline · 2018-10-25T16:40:01Z

Interesting.

Is there historical reasoning behind this @pickypg?

pickypg · 2018-10-26T19:15:31Z

@chrisronline It used to be that upon index creation that the cluster would be red, but now it's yellow (#18737 changed this in 5.0).

Not to be pedantic, but monitoring is reporting that the cluster stayed yellow long enough for it to catch it. One thing that may be worthwhile is for the existing cluster alert to fetch the two most recent cluster_stats documents and only alert if the two consecutive reports of yellow status.

elasticsearch/x-pack/plugin/monitoring/src/main/resources/monitoring/watches/elasticsearch_cluster_status.json

Line 30 in 5e0b524

"size": 1,

This would add further complexity to the Watch's condition, but it wouldn't be undoable.

My only problem with that approach as the default behavior is that it would hide legitimate cases of shards flapping. You would certainly not want that behavior with red state. Personally, I think waiting for the planned improvements to Cluster Alerts to fix this behavior by putting the user in more direct control would be superior to tweaking the existing Cluster Alerts, but I can definitely see how that may not be universally agreed upon belief.

DaveyDevOps · 2018-10-26T20:05:33Z

cluster stayed yellow long enough for it to catch it

Isn't it more just a matter of timing? If monitoring data is collected right when cluster status returns yellow. We don't see alerts every time a new index is created, just when the "stars align".

planned improvements to Cluster Alerts [...] putting the user in more direct control

Those sound lovely, can you provide more information/references?

pickypg · 2018-10-26T20:12:12Z

Isn't it more just a matter of timing? If monitoring data is collected right when cluster status returns yellow. We don't see alerts every time a new index is created, just when the "stars align".

Correct. That's what I meant by "the cluster stayed yellow long enough for it to catch it". Under ordinary conditions, the primary should be created in milliseconds, followed by the same story for the replica and the likelihood of catching it should be pretty rare. An overburdened cluster will be slower, which increases the likelihood of this happening.

Those sound lovely, can you provide more information/references?

I don't think the team is quite ready to discuss it, but needless to say the inability to tweak your own Cluster Alerts is a bit of a frustration that we share.

DaveyDevOps · 2018-10-26T20:38:49Z

If the planned improvements are being tracking I think this issue could be closed. If they don't yet have a "home" maybe track on this issue (or not).

Other random thoughts...
Would including the cluster's "human" name in alerts be one of those improvements.
More generally, perhaps watcher could have a check count. Where condition has to be true "check count" number of times before action is triggered, maybe set this at action level. That might allow escalation path e.g. if not resolved in an hour start emailing your boss.

pickypg · 2018-10-26T20:42:36Z

Would including the cluster's "human" name in alerts be one of those improvements.

Yes. :)

ypid-geberit · 2020-07-08T08:05:19Z

Is there any GitHub issue I can "watch" for current progress?

Compared to fully-featured monitoring systems, the watches fall short. Something like max_check_attempts is used for decades to control notification volume. (I tried to write my own watch once to do metrics alerting but see this as a failed attempt. The watcher infrastructure is not ideal for this use case and I am not even sure that it should be extended for it)

It seems not deterministic at which second the watch is scheduled. Is this correct? At least I checked two clusters and one schedules the watch "X-Pack Monitoring: Cluster Status" at second 17, the other at 32. If there is the possibility that the watch gets scheduled close to second 0, the false positive notification rate will increase.

That just to put:

Not to be pedantic, but monitoring is reporting that the cluster stayed yellow long enough for it to catch it.

Into perspective. I am not saying it should not be investigated, I am just saying that admins might not have the time to do this immediately and thus have the option of ignoring this email and potentially miss longer lasting, real issues or possibly get false positives as long as max_check_attempts is not supported.

chrisronline · 2020-07-08T13:59:49Z

@ypid-geberit Thanks for your feedback.

This would be a good ticket to add this feedback: elastic/kibana#42960

We have an outstanding PR to change the underlying technology powering the alerts and as a result, we will have full control over the alert definition and execution within Kibana. Unfortunately, we are not able to fully convert the existing watches over until we get resolution here. I'd suggest adding your thoughts to the above kibana issue and we will take them into account when fully converting these watches over.

mayya-sharipova added the :Data Management/Watcher label Oct 24, 2018

pickypg changed the title ~~X-Pack Monitoring: Cluster Status triggers on transient yellow status~~ [Monitoring] Cluster Status alert triggers on transient yellow status Oct 26, 2018

pickypg closed this as completed Oct 26, 2018

ypid-geberit mentioned this issue Feb 15, 2021

[Stack Monitoring] Alerting Phase -1 elastic/kibana#42960

Open

10 tasks

kunisen mentioned this issue May 18, 2022

[Feature request] Separate red and yellow cluster health alerts or make it configurable elastic/kibana#132392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Cluster Status alert triggers on transient yellow status #34814

[Monitoring] Cluster Status alert triggers on transient yellow status #34814

DaveyDevOps commented Oct 24, 2018 •

edited

Loading

elasticmachine commented Oct 24, 2018

chrisronline commented Oct 25, 2018

pickypg commented Oct 26, 2018

DaveyDevOps commented Oct 26, 2018

pickypg commented Oct 26, 2018

DaveyDevOps commented Oct 26, 2018

pickypg commented Oct 26, 2018

ypid-geberit commented Jul 8, 2020 •

edited

Loading

chrisronline commented Jul 8, 2020

[Monitoring] Cluster Status alert triggers on transient yellow status #34814

[Monitoring] Cluster Status alert triggers on transient yellow status #34814

Comments

DaveyDevOps commented Oct 24, 2018 • edited Loading

elasticmachine commented Oct 24, 2018

chrisronline commented Oct 25, 2018

pickypg commented Oct 26, 2018

DaveyDevOps commented Oct 26, 2018

pickypg commented Oct 26, 2018

DaveyDevOps commented Oct 26, 2018

pickypg commented Oct 26, 2018

ypid-geberit commented Jul 8, 2020 • edited Loading

chrisronline commented Jul 8, 2020

DaveyDevOps commented Oct 24, 2018 •

edited

Loading

ypid-geberit commented Jul 8, 2020 •

edited

Loading