Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide migration path for monitoring alerting from Watcher to Kibana alerting #50032

Closed
jakelandis opened this issue Dec 10, 2019 · 40 comments
Closed

Comments

@jakelandis
Copy link
Contributor

jakelandis commented Dec 10, 2019

A new alerting system in Kibana is currently under development : https://www.elastic.co/blog/alerting-in-the-elastic-stack

As part of the this effort we want to enable users a migration path for the default Watches that ship with Monitoring to this new system. This requires some way for Kibana to tell Elasticsearch it is now handling the default alerts. Once Kibana is handling the default alerts, Elasticsearch should a) remove existing default monitoring Watches b) prevent new default monitoring Watches from being used.

Note - this only includes the 6 default Watches that ship with monitoring and does not include custom watches.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Monitoring)

@chrisronline
Copy link
Contributor

Any update on when this might be available? Does 7.8 look realistic?

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@chrisronline
Copy link
Contributor

To provide a little color to this issue, I wanted to explain the need for this for all interested parties.

Elasticsearch exporters will attempt to create the six monitoring cluster alert watches every time the exporter is executed (on a master node). This code will check a few notable things before creating the watches:

  1. Does the user have the right license for watcher? Source
  2. Is the watch in the blacklist? If it is, skip adding it. Source

The blacklisting feature is configured like:

cluster_alerts.management.blacklist
Prevents the creation of specific cluster alerts. It also removes any applicable watches that already exist in the current cluster.
You can add any of the following watch identifiers to the blacklist:

elasticsearch_cluster_status
elasticsearch_version_mismatch
elasticsearch_nodes
kibana_version_mismatch
logstash_version_mismatch
xpack_license_expiration
For example: ["elasticsearch_version_mismatch","xpack_license_expiration"].

This feature is nice, but it's configured at the exporter level (instead of at the Elasticsearch cluster level) meaning that the blacklist only applies for existing exporters. If a user were to add a new exporter in the future, this blacklist wouldn't copy over.

So we aren't left with many options to disable these alerts. We can disable watcher on the cluster, but that will also disable any custom watches the user has setup.

We need a way to delete any existing watcher-based cluster alerts, as well as a way to disable new ones from being created.

FYI, there is some background in this thread.

Please correct me if I'm mistaken on any of these points, as I'm not an expert with how Elasticsearch does this.

@ravikesarwani
Copy link

Had a quick sync up to discuss this issue between @jakelandis @chrisronline @igoristic @ravikesarwani.
We will try to get this code delivered for 7.10 with both changes on ES and SM.

  • ES team will provide an API/cluster setting that when set will remove the 6 watchers and also modify the code so that the 6 watcher's aren't created again.
  • Stack monitoring team will create the 6 replacement Kibana alerts when user visits the stack monitoring UI and will call the API/cluster setting that ES team provides.

@ravikesarwani
Copy link

ravikesarwani commented Aug 31, 2020

@jakelandis Any update on the progress on the API? Is this getting completed for 7.10?

@sgrodzicki
Copy link

@jakelandis Any update on the progress on the API? Is this getting completed for 7.10?

Also, is this the right issue to track or will there be another implementation issue?

@jakelandis
Copy link
Contributor Author

Also, is this the right issue to track or will there be another implementation issue?

This is the right issue. @jbaiera will be leading the work.

@sgrodzicki
Copy link

@jbaiera do you already have an idea of when this might land in master?

@jbaiera
Copy link
Member

jbaiera commented Sep 4, 2020

@sgrodzicki I'm getting a plan of action together for how this migration is going to work in Elasticsearch. For the moment, my hopes are to get this in before 7.10 feature freeze, but I'm still getting up to speed on the inner workings of the Monitoring plugin. I'll have a more confident answer soon.

@sgrodzicki
Copy link

Getting this in just before feature freeze wouldn't give us enough time to implement this in Stack Monitoring. Can we prioritize this work or document those APIs to be able to work with mocks?

@jbaiera
Copy link
Member

jbaiera commented Sep 9, 2020

Agreed on the timing for sure. I'm starting up the implementation work for this but had a couple of questions in regards to the migration strategy:

Each exporter type (local and http) will remove configured alerts if they are excluded via the blacklist. The easiest way we could approach this is to add a setting that treats all monitoring alerts as blacklisted, and let each exporter remove them on their own. The downside to this is that these will only be removed if

  1. The legacy monitoring exporters are still enabled
  2. Either the legacy Elasticsearch collectors are running or monitoring data arrives at the monitoring bulk action

For us to clean up the watches, the exporter configurations need to be present (the watches may exist in a remote cluster that we can't access if we don't have its address/auth). The bigger issue comes in here with the fact that the exporters are configured to set these alerts up lazily when they receive data to send to a monitoring index, which does not get called unless the old collectors are enabled and sending data. This leads to a decision we have to make and that I'd like some feedback on:

Do we expect there to be a situation where we need to remove these watches, but legacy stats collection is already disabled? If so, we'll need to move the logic that creates these resources so that they can be cleaned up outside of the regular exporting loop on top of adding the setting to prohibit their creation going forward.

@ravikesarwani
Copy link

ravikesarwani commented Sep 9, 2020

The scenario I am thinking about is:

  • User was using internal collection and hence watcher's were created
  • User has now disabled internal collection and migrated to Metricbeat (this is what we are promoting)
    In new version (7.10+) we will create replacement Kibana alerts but ES cannot delete the watches since http exporter is disabled.

What is the work around for the user in this case? Can these be easily deleted by the users manually?

@ravikesarwani
Copy link

Looks like user cannot deactivate or delete a system watch. So feels like there may not be a simple work around. Am I missing something?

@jbaiera
Copy link
Member

jbaiera commented Sep 9, 2020

Well, in the case that a user has removed any configurations that the monitoring plugin would use to contact a monitoring cluster there's not much we can do from the plugin to delete any watches left behind. Even if we extract the necessary removal logic from the exporters into a REST action, we're still dependent on the remote monitoring cluster settings to be present.

Looks like user cannot deactivate or delete a system watch. So feels like there may not be a simple work around. Am I missing something?

I don't think you can remove the watch from the Kibana UI (options are greyed out) but I was able to remove the watch by hitting the DELETE endpoint on watcher.

Though, that still assumes that whatever process is deleting the watch is doing so with knowledge of the cluster address its being hosted on. Are we thinking that if monitoring is disabled/removed then the user just perform those deletes on their own? There's little risk of the watches re-appearing if their exporter configuration is removed or decommissioned already.

@ravikesarwani
Copy link

Are we thinking that if monitoring is disabled/removed then the user just perform those deletes on their own? There's little risk of the watches re-appearing if their exporter configuration is removed or decommissioned already.

@jbaiera Do you see any other possible options?

If we delete in the cases we can, and also make Kibana UI looses up so that users can delete "system watch" easily (assuming that these 6 monitoring watches are the only system watch we have) then I think we have workable solution.

@chrisronline Is is easy to relax kibana UI to allow deletion of the "system watch" when automated way is not possible? Is it an acceptable solution or do you see any side effects?

@chrisronline
Copy link
Contributor

Is is easy to relax kibana UI to allow deletion of the "system watch" when automated way is not possible? Is it an acceptable solution or do you see any side effects?

This is fine to do.

Do we expect there to be a situation where we need to remove these watches, but legacy stats collection is already disabled?

Yes, this will be the case and it is also possible for watchers to exist on some clusters where there isn't a dedicated Kibana so they will not have the luxury of using a UI.

I don't think we can solve all users with this, but we can certainly help the majority of users by cleaning up these unnecessary watches.

@jbaiera
Copy link
Member

jbaiera commented Sep 10, 2020

I see this migration going one of a few ways:

  1. The monitoring plugin is enabled on a cluster and is actively sending data to a monitoring index. This configuration will continuously make sure that cluster alerts are created if able to. We can add a setting here that will tell the plugin to remove any existing cluster alerts and stop creating them. We also add a migrate API that toggles this setting on. There are some benefits to this being an endpoint as detailed in the next scenario.

  2. The exporting of stats has been disabled for any reason. While disabled, the plugin will not attempt to recreate any cluster alerts. In this case, the plugin still has all of the information about its exporters. The migrate API would be called which would toggle the above mentioned setting to keep it from creating alerts if it is ever re-enabled. The API would then iterate over each of the currently configured exporters, removing their alerts directly.

  3. Settings that define exporters in monitoring have been removed, and thus the plugin can no longer reach one or more remote clusters. The alerts on those remote clusters would need to be removed manually by an administrator. In the migrate API we attempt to clean up all that we can (which may include a local exporter) but any remote clusters not in the configuration cannot be cleaned up automatically.

Does this seem like a reasonable way to handle the migration of cluster alerts?

@chrisronline
Copy link
Contributor

@jbaiera Yes, that sounds perfect. The API call is exactly what we're looking for, and this API call ensures that all existing (at least existing and known) watches are removed and guarantees no new ones are created.

@ravikesarwani
Copy link

Yes, this is great and covers most cases automatically.

@jbaiera
Copy link
Member

jbaiera commented Sep 18, 2020

First part of this work is up as a PR (#62668). I'll begin working on the migration API next.

I have some questions regarding some finer points about the migration API's implementation. One concern I'm having after finishing the setting PR up is that watch removal can fail for any number of reasons. A remote monitoring cluster can experience network failures or no longer be available. Local seems least likely to have failures when deleting these alerts, but the watcher index could be unavailable. These failures are usually transient to the plugin because in the case of a setup failure, it just blocks until the next collection cycle. API's are a one shot attempt.

We discussed that the API would make its best attempt at cleaning out the watches on any configured exporters. How far should that best attempt go? Should we block and retry until we're reasonably confident we can or can't delete all the watches or just make one attempt and shrug if some attempts fail? Additionally, what kind of information should we return? I was thinking of a list of all configured exporters and whether they were successfully cleaned, but I'm not sure if that's even helpful for the stack monitoring solution. The less we return the faster this work can get done.

@jbaiera
Copy link
Member

jbaiera commented Oct 28, 2020

@ravikesarwani I'm wrapping up the work required for accepting the rest request. I'll check back here once the PR is up and again once it's merged.

@jbaiera
Copy link
Member

jbaiera commented Oct 29, 2020

PR for the migrate action is up now! #64373

@sgrodzicki
Copy link

sgrodzicki commented Nov 19, 2020

PR for the migrate action is up now! #64373

@jbaiera what would be the ETA for this? We want to assess whether we will be able to make changes in Kibana for 7.11.

@jbaiera
Copy link
Member

jbaiera commented Dec 17, 2020

This work is completed now with #64373.

@chrisronline
Copy link
Contributor

Thanks for the work here folks!

In terms of #50032 (comment), is there plans to add this in 7.12?

@chrisronline
Copy link
Contributor

@jbaiera @jakelandis Ping on #50032 (comment)

@jbaiera
Copy link
Member

jbaiera commented Jan 11, 2021

@chrisronline This should be available starting in 7.11

@chrisronline
Copy link
Contributor

@jbaiera Are you referring to #64373? I'm not seeing the functionality we discussed in #50032 (comment) but maybe I'm missing it?

@jbaiera
Copy link
Member

jbaiera commented Jan 13, 2021

Ah, sorry @chrisronline, I misunderstood. I'll reopen this until after that is complete. I'm not sure if it's slotted in for 2.12 at the moment.

@jbaiera
Copy link
Member

jbaiera commented Jul 29, 2021

Circling back here to close this out - We won't be updating the migrate API to hand off emails. We consider those credentials and we're not keen on exposing those through an API call for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests