Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new operator flag to control Elasticsearch health observation intervals #5861

Merged
merged 5 commits into from
Jul 18, 2022

Conversation

pebrc
Copy link
Collaborator

@pebrc pebrc commented Jul 6, 2022

Fixes #5839

Introduce new global flag to conrol Elasticsearch health observation interval.

  • annotations on individual Elasticsearch resources take precedence to avoid breaking existing customisations for users
  • a non-positve value disables asynchronous observation completely. Only one synchronous observation happens during reconciliation. This disables also timely automatic pod disruption budget adjustment on health changes. As a side effect of client-go cache refreshes observations still happen every 10 hours due to reconciliation. This means disabling the asynchronous observation has the same effect as setting the observation interval to 10 hours.

@pebrc pebrc added >enhancement Enhancement of existing functionality v2.4.0 labels Jul 6, 2022
@thbkrkr
Copy link
Contributor

thbkrkr commented Jul 6, 2022

I understand that we need to do something if the provided interval is negative but I'm not sure that we should disable the observer. What does it mean to manage an ES for ECK without an observer? Isnt it dangerous to reconcile ES without knowing the real health?
How about validating that the provided interval is in a range that makes sense (5s < i < 1h) and crashing the operator if it's not.

@pebrc
Copy link
Collaborator Author

pebrc commented Jul 6, 2022

What does it mean to manage an ES for ECK without an observer? Isnt it dangerous to reconcile ES without knowing the real health?

I think you are right in general that disabling the observer is maybe a step too far. At least without compensating for it by having a synchronous observation in the reconcilation loop. I forgot that the synchronous observation only happens when the observer is first constructed which only happens on settings changes (e.g. certificate changed or similar)

However I also think that the notion of "real health" is flawed. We are alway working with a health observation that is by default up 10 seconds old and if Elasticsearch is slow to respond potentially even older. So adjusting the observation interval just moves the needle on the staleness scale from at worst 10 seconds to maybe 10 hours stale.

I am moving this back to draft mode to see if I can come up with a solution and also address the negative value issue for the annotation.

@pebrc pebrc marked this pull request as draft July 6, 2022 13:28
@pebrc pebrc marked this pull request as ready for review July 7, 2022 07:43
@pebrc
Copy link
Collaborator Author

pebrc commented Jul 7, 2022

@thbkrkr I have made it so that what I wrote in the OP is now true: when asynchronous observation is diabled a synchronous observation is made on each reconciliation. This stil has some drawbacks as the operator cannot react to changes in Elasticsearch health but at least each reconciliation is working with non-stale health data when it happens.

But I think I am also open to going back to your idea of simply validating a positive interval.

Copy link
Contributor

@thbkrkr thbkrkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour looks good to me.
I left some minor comments on names and constants.

pkg/controller/elasticsearch/observer/manager_test.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/observer/observer.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/observer/observer.go Outdated Show resolved Hide resolved
cmd/manager/main.go Show resolved Hide resolved
@pebrc pebrc merged commit a228626 into elastic:main Jul 18, 2022
fantapsody pushed a commit to fantapsody/cloud-on-k8s that referenced this pull request Feb 7, 2023
…ervals (elastic#5861)

Annotations on individual Elasticsearch resources take precedence to avoid breaking existing customisations for users.
A non-positve value disables asynchronous observation completely. Only one synchronous observation happens during reconciliation. This disables also timely automatic pod disruption budget adjustment on health changes. As a side effect of client-go cache refreshes observations still happen every 10 hours due to reconciliation. This means disabling the asynchronous observation has the same effect as setting the observation interval to 10 hours.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality v2.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose health check interval as operator-level configuration option
2 participants