-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring] Cluster Status alert triggers on transient yellow status #34814
Comments
Pinging @elastic/es-core-infra |
Interesting. Is there historical reasoning behind this @pickypg? |
@chrisronline It used to be that upon index creation that the cluster would be red, but now it's yellow (#18737 changed this in 5.0). Not to be pedantic, but monitoring is reporting that the cluster stayed yellow long enough for it to catch it. One thing that may be worthwhile is for the existing cluster alert to fetch the two most recent Line 30 in 5e0b524
This would add further complexity to the Watch's My only problem with that approach as the default behavior is that it would hide legitimate cases of shards flapping. You would certainly not want that behavior with |
Isn't it more just a matter of timing? If monitoring data is collected right when cluster status returns yellow. We don't see alerts every time a new index is created, just when the "stars align".
Those sound lovely, can you provide more information/references? |
Correct. That's what I meant by "the cluster stayed yellow long enough for it to catch it". Under ordinary conditions, the primary should be created in milliseconds, followed by the same story for the replica and the likelihood of catching it should be pretty rare. An overburdened cluster will be slower, which increases the likelihood of this happening.
I don't think the team is quite ready to discuss it, but needless to say the inability to tweak your own Cluster Alerts is a bit of a frustration that we share. |
If the planned improvements are being tracking I think this issue could be closed. If they don't yet have a "home" maybe track on this issue (or not). Other random thoughts... |
Yes. :) |
Is there any GitHub issue I can "watch" for current progress? Compared to fully-featured monitoring systems, the watches fall short. Something like max_check_attempts is used for decades to control notification volume. (I tried to write my own watch once to do metrics alerting but see this as a failed attempt. The watcher infrastructure is not ideal for this use case and I am not even sure that it should be extended for it) It seems not deterministic at which second the watch is scheduled. Is this correct? At least I checked two clusters and one schedules the watch "X-Pack Monitoring: Cluster Status" at second 17, the other at 32. If there is the possibility that the watch gets scheduled close to second 0, the false positive notification rate will increase. That just to put:
Into perspective. I am not saying it should not be investigated, I am just saying that admins might not have the time to do this immediately and thus have the option of ignoring this email and potentially miss longer lasting, real issues or possibly get false positives as long as max_check_attempts is not supported. |
@ypid-geberit Thanks for your feedback. This would be a good ticket to add this feedback: elastic/kibana#42960 We have an outstanding PR to change the underlying technology powering the alerts and as a result, we will have full control over the alert definition and execution within Kibana. Unfortunately, we are not able to fully convert the existing watches over until we get resolution here. I'd suggest adding your thoughts to the above kibana issue and we will take them into account when fully converting these watches over. |
Elasticsearch version (
bin/elasticsearch --version
):Version: 6.3.1, Build: default/zip/eb782d0/2018-06-29T21:59:26.107521Z, JVM: 1.8.0_162
Plugins installed: [x-pack, hdfs-repository]
JVM version (
java -version
):java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b34)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b34, mixed mode)
OS version (
uname -a
if on a Unix-like system):Linux hostname 4.4.114-92.64-default #1 SMP Thu Feb 1 19:18:19 UTC 2018 (c6ce5db) x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
/x-pack/plugin/monitoring/src/main/resources/monitoring/watches/elasticsearch_cluster_status.json
The included watch will trigger with yellow status but clear at next run.
When a new index is created it is natural for there to be "missing" replica as shards are started.
Would like to see the watch account for this behavior, maybe status should be checked twice before triggering?
Wait for active shards might be an option to avoid the yellow status, but prefer not to set that across all index and for all subsequent writes when only concerned about index creation.
Steps to reproduce:
We have set default number of replica to be 1 using a default template. Monitoring collection interval has been increased to 1 minute to reduce load on the system.
Provide logs (if relevant):
Showing status change as new indices are created
The text was updated successfully, but these errors were encountered: