Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add action to decommission legacy monitoring cluster alerts #64373

Merged
merged 21 commits into from
Dec 14, 2020

Conversation

jbaiera
Copy link
Member

@jbaiera jbaiera commented Oct 29, 2020

This follows on from #62668 by adding an action that will proactively remove any watches that monitoring has configured. The action toggles on the new setting that informs the cluster to tear down any previously created cluster alerts, and after that is accepted, the action immediately attempts a best-effort refresh of cluster alert resources in order to force their removal in case collection is disabled or delayed.

Since resources are controlled lazily by the existing monitoring exporters, extra care was taken to ensure that any in-flight resource management operations do not race against any resource actions taken by the migration action. Resource installation code was updated with callbacks to report any errors instead of just logging them.

…porters.

Specifically refresh alerts as part of the migration instead of re-running resource installation.
We don't want to re-publish old templates if all the old monitoring resources have already been
removed.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Monitoring)

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Oct 29, 2020
Copy link
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting a bit lost in the introduction of more state via ExporterResourceStatus .

It seems to add quite a bit of complexity having multiple possible deploy states. Under what circumstances wouldn't your code to do the main work on the success of setting MIGRATION_DECOMMISSION_ALERTS be sufficient to avoid race conditions ? Is there a reasonable trade off here, such that we could simplify the code to only check a single state and error early via the migrate REST endpoint if not in that state (or obtain that state in a reasonable amount of time) ?

EDIT: for example, would it be possible to try to aquire (with a timeout) the semaphore from MonitoringExecution effectively pausing monitoring while we do the migration avoiding the race conditions. Since this is a one time call...I think that would be an OK tradeoff. Monitoring is generally no faster then every 10s, and in a healthy system executes pretty quickly.

EDIT2: ^^ that won't actually work exactly as describe due to the "router" also using the exporters...but the idea is the same...is there a trade off where we can make the use of the contended resources mutually exclusive to avoid complexity of managing so much state ?

@jbaiera
Copy link
Member Author

jbaiera commented Nov 11, 2020

I've gone through and made some changes. There is now a migration coordinator object that is acquired when a migration action begins. Every exporter checks this coordinator before running its installation steps. If the coordinator is acquired already, the exporter blocks as if its resources are not ready yet. Once the migration operations complete, the migration action releases the coordinator, and exporters are able to perform their installation tasks as normal.

Since we no longer have to worry about exporters running their installation tasks DURING a migration, I've removed the retry code and IN_PROGRESS states. If an exporter is ever in progress during a migration (should be impossible with the coordinator in place), I have it fail the migration for sanity's sake.

@jbaiera
Copy link
Member Author

jbaiera commented Nov 12, 2020

@elasticmachine update branch

Copy link
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments

@sgrodzicki
Copy link

@jbaiera what would be the ETA for this? We want to assess whether we will be able to make changes in Kibana for 7.11.

@jbaiera
Copy link
Member Author

jbaiera commented Dec 7, 2020

@sgrodzicki Sorry for the late response, been on vacation for a bit. This should be good to go once it passes review, which I hope shouldn't be much longer.

@jbaiera
Copy link
Member Author

jbaiera commented Dec 7, 2020

@elasticmachine update branch

@jbaiera
Copy link
Member Author

jbaiera commented Dec 10, 2020

@elasticmachine update branch

@jbaiera
Copy link
Member Author

jbaiera commented Dec 10, 2020

@elasticmachine run elasticsearch-ci/2

@jbaiera
Copy link
Member Author

jbaiera commented Dec 10, 2020

@elasticmachine run elasticsearch-ci/bwc

@jakelandis
Copy link
Contributor

@elasticmachine update branch

Copy link
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (a couple nitpicks around naming), thanks for the iterations.

@jbaiera jbaiera merged commit abda989 into elastic:master Dec 14, 2020
@jbaiera jbaiera deleted the monitoring-decommission-watch-action branch December 14, 2020 18:37
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Dec 14, 2020
* elastic/master: (33 commits)
  Add searchable snapshot cache folder to NodeEnvironment (elastic#66297)
  [DOCS] Add dynamic runtime fields to docs (elastic#66194)
  Add HDFS searchable snapshot integration (elastic#66185)
  Support canceling cross-clusters search requests (elastic#66206)
  Mute testCacheSurviveRestart (elastic#66289)
  Fix cat tasks api params in spec and handler (elastic#66272)
  Snapshot of a searchable snapshot should be empty (elastic#66162)
  [ML] DFA _explain API should not fail when none field is included (elastic#66281)
  Add action to decommission legacy monitoring cluster alerts (elastic#64373)
  move rollup_index param out of RollupActionConfig (elastic#66139)
  Improve FieldFetcher retrieval of fields (elastic#66160)
  Remove unsed fields in `RestAnalyzeAction` (elastic#66215)
  Simplify searchable snapshot CacheKey (elastic#66263)
  Autoscaling remove feature flags (elastic#65973)
  Improve searchable snapshot mount time (elastic#66198)
  [ML] Report cause when datafeed extraction encounters error (elastic#66167)
  Remove suggest reference in some API specs (elastic#66180)
  Fix warning when installing a plugin for different ESversion (elastic#66146)
  [ML] make `xpack.ml.max_ml_node_size` and `xpack.ml.use_auto_machine_memory_percent` dynamically settable (elastic#66132)
  [DOCS] Add `require_alias` to Bulk API (elastic#66259)
  ...
jbaiera added a commit that referenced this pull request Dec 15, 2020
…4373) (#66309)

Adds an action that will proactively remove any watches that monitoring has configured. The action
toggles on a new setting that informs the cluster to tear down any previously created cluster alerts,
and after that is accepted, the action immediately attempts a best-effort refresh of cluster alert
resources in order to force their removal in case collection is disabled or delayed.

Since resources are controlled lazily by the existing monitoring exporters, extra care was taken to
ensure that any in-flight resource management operations do not race against any resource actions
taken by the migration action. Resource installation code was updated with callbacks to report any
errors instead of just logging them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants