-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaling Elasticsearch: Introduce a dedicated custom resource #5978
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
Last commit enables advanced validation:
status:
conditions:
- lastTransitionTime: "2022-08-31T09:58:03Z"
message: Autoscaler is unhealthy
status: "True"
type: Active
- lastTransitionTime: "2022-08-31T09:58:03Z"
message: 'ElasticsearchAutoscaler.autoscaling.k8s.elastic.co "autoscaling-sample"
is invalid: spec.policies[0].resources.nodeCount.min: Invalid value: -1: min
count must be equal or greater than 0'
status: "False"
type: Healthy
- lastTransitionTime: "2022-08-31T09:58:03Z"
message: Autoscaler is unhealthy
status: "False"
type: Online
- lastTransitionTime: "2022-08-30T11:53:45Z"
status: "False"
type: Limited
observedGeneration: 2 Note that the custom resource is considered as "unhealthy" in that case:
|
6c6c958
to
fb4f91a
Compare
Unless I'm missing something there is nothing specific to the generation of the OpenAPIV3 schema of Quantity values in this PR. With the With quotes: |
@naemono see also:
apiVersion: v1
kind: Pod
metadata:
name: cpu-demo
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr
image: vish/stress
resources:
limits:
cpu: "1"
requests:
cpu: "0.5"
args:
- -cpus
- "2" I don't understand how the example you mentioned is not rejected by the API server tbh. |
I think I'm simply assuming that the
It's probably not worth digging into to find why the validation is slightly different, but it certainly seems to be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to look through the source code and I it looks really good, for me it is ready to merge. I only found a few nits here and there and one incorrect usage of the log API. I also ran some tests and found some issues. However I am not sure if they are related to this PR or whether we have the same already with annotation based autoscaling.
What I am seeing is that the Elasticsearch resource never fully reconciles (or at least only briefly). This is because the desired nodes API integration throws errors like the following:
2022-09-19T20:45:16.341+0200 ERROR manager.eck-operator Reconciler error {"service.version": "2.5.0-SNAPSHOT+8831df33", "controller": "elasticsearch-controller", "object": {"name":"es","namespace":"default"}, "namespace": "default", "name": "es", "reconcileID": "896949ed-99d7-4f71-a02f-c839d853a9aa", "error": "elasticsearch client failed for https://es-es-internal-http.default.svc:9200/_internal/desired_nodes/651bb9ea-360b-462f-87a1-fe5415d78ac5/8?error_trace=true: 400 Bad Request: {Status:400 Error:{CausedBy:{Reason: Type:} Reason:Desired nodes with history [651bb9ea-360b-462f-87a1-fe5415d78ac5] and version [8] already exists with a different definition Type:
It seems that the same generation of the ES resource is reconciled multiple times (expected I would say) but that the desired nodes differ within one generation. I have not dug deeper yet why exactly this is It seems likely that this is related to changing resources for the autoscaled node sets (the part that confuses me here is that if the resources change I would also expect the generation to change)
< "storage": "3221225472b",
---
> "storage": "2147483648b",
I think this is because we take the desired storage form the volume claim which are being resized one by one while the Elasticsearch resource is not changing.
Because the nodes reconciliation has already happened at this point the autoscaling itself is not negatively affected and the cluster keeps scaling correctly. But the Elasticsearch status is stuck in applying changes and the log is full of the reconciler errors.
pkg/controller/autoscaling/elasticsearch/autoscaler/autoscaler_test.go
Outdated
Show resolved
Hide resolved
pkg/utils/stringsutil/strings.go
Outdated
truncated := "" | ||
count := 0 | ||
for _, char := range s { | ||
truncated += string(char) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could you not just re-slice the underlying byte array once your have reached n
instead of concatenating the runes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if this is what you had in mind: fe77306
}, | ||
}, | ||
checker: yesCheck, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing the wantValidationError
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually support the case where the user is not relying on the default volume claim. I added a comment in the unit test here.
I hit a similar issue in #5979 I think we need to revisit the way we use the desired nodes API for the next release 😕 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work!
Nit: almost all k8s resources displayed via kubectl have an "AGE" column, not this one. No big deal but my eyes aren't used to it.
I spotted a small limitation I think, we can't create an autoscaler for an ES that has 1 nodeSet without explicit `node.roles, which seems acceptable to me.
nodeSets:
- name: default
count: 3
config:
node.store.allow_mmap: false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@naemono I think I addressed your comments, please let me know if I missed something 🙇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Nice work @barkbay
Follow up of #5978 which has been merged into main with the wrong controller tools version.
…stic#5978) This (huge) commit introduces a dedicated Kubernetes Resource Definition, and an associated controller, to configure Elasticsearch autoscaling with ECK.
Follow up of elastic#5978 which has been merged into main with the wrong controller tools version.
This PR introduces a dedicated Kubernetes Resource Definition, and an associated controller, to configure Elasticsearch autoscaling with ECK.
Naming and group
This PR introduces a new resource named
ElasticsearchAutoscaler
. It lives in a group namedautoscaling.k8s.elastic.co
:This is mostly to let the door open to any additional autoscaling resources that we may want to add in the future.
ElasticsearchAutoscaler specification
The new CRD is very similar to the actual autoscaling annotation, with the obvious difference that a
elasticsearchRef
must be provided by the user:Only one cluster can be managed by a given
ElasticsearchAutoscaler
. This is similar to the K8SHorizontalPodAutoscaler
andVerticalPodAutoscaler
. It also makes it easier to understand the autoscaler status and the relationship between the autoscaler and the Elasticsearch cluster.Status
The status consists of 2 main elements:
conditions
provides an overall state of the reconciliation status.policies
holds the calculated resources and any error/important messages.Printed columns
The conditions
Limited
,Active
andHealthy
are printed as part of the output ofkubectl get elasticsearchautoscaler.autoscaling.k8s.elastic.co/<autoscaler_name>
:TARGET
: the name of Elasticsearch cluster which is autoscaledACTIVE
: True when the ElasticsearchAutoscaler resource is managed by the operator, and the target Elasticsearch cluster does exist.HEALTHY
: True if resources have been calculated for all the autoscaling policies and no error has been encountered during the reconciliation process.LIMITED
: True when a resource limit is reachedFor each printed condition there is an additional message in the related condition.
Noteworthy change
By default there is now a ratio of
1:1
between the CPUlimit
and therequest
. This is to comply with the current desired nodes API implementation.Testing
cloud-on-k8s/test/e2e/es/autoscaling_test.go
Lines 89 to 90 in 79ef072
TODO: