Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Cluster Status alert triggers on transient yellow status #34814

Closed
DaveyDevOps opened this issue Oct 24, 2018 · 9 comments
Closed

Comments

@DaveyDevOps
Copy link

DaveyDevOps commented Oct 24, 2018

Elasticsearch version (bin/elasticsearch --version):
Version: 6.3.1, Build: default/zip/eb782d0/2018-06-29T21:59:26.107521Z, JVM: 1.8.0_162

Plugins installed: [x-pack, hdfs-repository]

JVM version (java -version):
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b34)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b34, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux hostname 4.4.114-92.64-default #1 SMP Thu Feb 1 19:18:19 UTC 2018 (c6ce5db) x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
/x-pack/plugin/monitoring/src/main/resources/monitoring/watches/elasticsearch_cluster_status.json
The included watch will trigger with yellow status but clear at next run.
When a new index is created it is natural for there to be "missing" replica as shards are started.

Would like to see the watch account for this behavior, maybe status should be checked twice before triggering?

Wait for active shards might be an option to avoid the yellow status, but prefer not to set that across all index and for all subsequent writes when only concerned about index creation.

Steps to reproduce:
We have set default number of replica to be 1 using a default template. Monitoring collection interval has been increased to 1 minute to reduce load on the system.

  1. Create a new index (with at least 1 replica), should be able to observe status change in logs (active master)
  2. Watch will only trigger if monitoring data collection coincides with status being yellow

Provide logs (if relevant):

Sent: Monday, October 22, 2018 12:08 PM
To: AdminEmail
Subject: [NEW] X-Pack Monitoring: Cluster Status (UUID) [YELLOW]

Elasticsearch cluster status is yellow. Allocate missing replica shards.

Sent: Monday, October 22, 2018 12:10 PM
To: AdminEmail
Subject: [RESOLVED] X-Pack Monitoring: Cluster Status (UUID) [GREEN]

This cluster alert has been resolved: Elasticsearch cluster status is yellow. Allocate missing replica shards.

Showing status change as new indices are created

[2018-10-21T19:00:04,341][INFO ][o.e.c.r.a.AllocationService] [master-hostname] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-6-2018.10.22][0]] ...]).
[2018-10-21T19:00:06,931][INFO ][o.e.c.r.a.AllocationService] [master-hostname] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-6-2018.10.22][0]] ...]).
[2018-10-21T19:00:23,796][INFO ][o.e.c.r.a.AllocationService] [master-hostname] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-logstash-6-2018.10.22][0]] ...]).
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@chrisronline
Copy link
Contributor

Interesting.

Is there historical reasoning behind this @pickypg?

@pickypg pickypg changed the title X-Pack Monitoring: Cluster Status triggers on transient yellow status [Monitoring] Cluster Status alert triggers on transient yellow status Oct 26, 2018
@pickypg
Copy link
Member

pickypg commented Oct 26, 2018

@chrisronline It used to be that upon index creation that the cluster would be red, but now it's yellow (#18737 changed this in 5.0).

Not to be pedantic, but monitoring is reporting that the cluster stayed yellow long enough for it to catch it. One thing that may be worthwhile is for the existing cluster alert to fetch the two most recent cluster_stats documents and only alert if the two consecutive reports of yellow status.

This would add further complexity to the Watch's condition, but it wouldn't be undoable.

My only problem with that approach as the default behavior is that it would hide legitimate cases of shards flapping. You would certainly not want that behavior with red state. Personally, I think waiting for the planned improvements to Cluster Alerts to fix this behavior by putting the user in more direct control would be superior to tweaking the existing Cluster Alerts, but I can definitely see how that may not be universally agreed upon belief.

@DaveyDevOps
Copy link
Author

cluster stayed yellow long enough for it to catch it

Isn't it more just a matter of timing? If monitoring data is collected right when cluster status returns yellow. We don't see alerts every time a new index is created, just when the "stars align".

planned improvements to Cluster Alerts [...] putting the user in more direct control

Those sound lovely, can you provide more information/references?

@pickypg
Copy link
Member

pickypg commented Oct 26, 2018

Isn't it more just a matter of timing? If monitoring data is collected right when cluster status returns yellow. We don't see alerts every time a new index is created, just when the "stars align".

Correct. That's what I meant by "the cluster stayed yellow long enough for it to catch it". Under ordinary conditions, the primary should be created in milliseconds, followed by the same story for the replica and the likelihood of catching it should be pretty rare. An overburdened cluster will be slower, which increases the likelihood of this happening.

Those sound lovely, can you provide more information/references?

I don't think the team is quite ready to discuss it, but needless to say the inability to tweak your own Cluster Alerts is a bit of a frustration that we share.

@DaveyDevOps
Copy link
Author

If the planned improvements are being tracking I think this issue could be closed. If they don't yet have a "home" maybe track on this issue (or not).

Other random thoughts...
Would including the cluster's "human" name in alerts be one of those improvements.
More generally, perhaps watcher could have a check count. Where condition has to be true "check count" number of times before action is triggered, maybe set this at action level. That might allow escalation path e.g. if not resolved in an hour start emailing your boss.

@pickypg
Copy link
Member

pickypg commented Oct 26, 2018

Would including the cluster's "human" name in alerts be one of those improvements.

Yes. :)

@pickypg pickypg closed this as completed Oct 26, 2018
@ypid-geberit
Copy link

ypid-geberit commented Jul 8, 2020

Is there any GitHub issue I can "watch" for current progress?

Compared to fully-featured monitoring systems, the watches fall short. Something like max_check_attempts is used for decades to control notification volume. (I tried to write my own watch once to do metrics alerting but see this as a failed attempt. The watcher infrastructure is not ideal for this use case and I am not even sure that it should be extended for it)

It seems not deterministic at which second the watch is scheduled. Is this correct? At least I checked two clusters and one schedules the watch "X-Pack Monitoring: Cluster Status" at second 17, the other at 32. If there is the possibility that the watch gets scheduled close to second 0, the false positive notification rate will increase.

That just to put:

Not to be pedantic, but monitoring is reporting that the cluster stayed yellow long enough for it to catch it.

Into perspective. I am not saying it should not be investigated, I am just saying that admins might not have the time to do this immediately and thus have the option of ignoring this email and potentially miss longer lasting, real issues or possibly get false positives as long as max_check_attempts is not supported.

@chrisronline
Copy link
Contributor

@ypid-geberit Thanks for your feedback.

This would be a good ticket to add this feedback: elastic/kibana#42960

We have an outstanding PR to change the underlying technology powering the alerts and as a result, we will have full control over the alert definition and execution within Kibana. Unfortunately, we are not able to fully convert the existing watches over until we get resolution here. I'd suggest adding your thoughts to the above kibana issue and we will take them into account when fully converting these watches over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants