-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Transport Cert Secret Size Overrun With Big Scale Out #6954
Comments
One thing you can do to work around this limitation is to create multiple node sets with the data role and scale each of those up until you start running into the size limitation of k8s secrets which seems to be around 150-200 nodes. You can then keep adding node sets until reach the desired scale. See this issue for more context on the current model of one secret for transport certificates per node set. |
@pebrc are there any plans to address this? it's been several years since the workaround was implemented. we run a very large deployment of many ES clusters (of which this operator has been fantastically helpful), so when adding some of our more larger clusters, i bumped into this error. quite a surprise, you can imagine. |
I'm wondering if we could stop reconciling that |
The work around did "work", but it is a whole lot of unnecessary complexity for something we don't even use (we disable security and dont use the certs at all as we use our own network framework on k8s). There's just a lot of extra tooling we have to update to ensure that node sets "data-0", "data-1", ..., "data-N" are all found and reconciled correctly. Still finding some bugs due to this. |
Related to #6954 It offers users a workaround for the problem with too many certificates in the transport certificate secret. They can configure external transport cert provisioning and disable self-signed transport certificates. When using a solution eg. like cert-manager's csi-driver as [documented here ](https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-transport-settings.html#k8s-transport-third-party-tools) this should allow for larger node sets of more than 250 nodes. The large cluster scenario is certainly an an edge case but on smaller clusters the disabling of certificate provisioning might still be attractive [reducing the amount of work the operator has to do in this area.](#1841) Note the new option to disable the self-signed transport certificates below: ```yaml apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: es spec: version: 8.6.2 transport: tls: certificateAuthorities: configMapName: trust selfSignedCertificates: disabled: true # <<<< new option nodeSets: - name: mixed count: 3 config: xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/cert-manager-certs/tls.key xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/cert-manager-certs/tls.crt node.store.allow_mmap: false podTemplate: spec: containers: - name: elasticsearch env: - name: PRE_STOP_ADDITIONAL_WAIT_SECONDS value: "5" volumeMounts: - name: transport-certs mountPath: /usr/share/elasticsearch/config/cert-manager-certs volumes: - name: transport-certs csi: driver: csi.cert-manager.io readOnly: true volumeAttributes: csi.cert-manager.io/issuer-name: ca-cluster-issuer csi.cert-manager.io/issuer-kind: ClusterIssuer csi.cert-manager.io/dns-names: "${POD_NAME}.${POD_NAMESPACE}.svc.cluster.local" ``` The option does not remove existing certificates from the secret so that the cluster keeps working during the transition if this option is turned on on an existing cluster. I also opted to remove the symlinking of certificates into the `emptyDir` config volume. I tried to figure out why we did this in the first place and am not sure. The reason I could think of was that we wanted to have static and predictable certificate and key file names across all nodes (`transport.tls.crt` and `transport.tls.key`) But we can just use the `POD_NAME` environment variable to link directly into the mounted certificate secret volume. The reason to change this behaviour now is again to support the transition between externally provisioned certs and self-signed certs provisioned by ECK: if a user flips the switch to disable and then re-enable the self-signed certs, but does this accidentally without also configuring the config settings for the transport layer there is an edge case where an Elasticsearch pod will crashloop and cannot recover if we use symlinking: 1. disable self-signed transport certs 2. scale the cluster up by one or more nodes 3. new nodes won't come up because certs are missing (user error) 4. user tries to recover by re-enabling self-signed certs 5. ES keeps bootlooping on the new nodes because the symlink is missing By removing the symlinking the node can recover as soon as the certificates appear in the filesystem. --------- Co-authored-by: Michael Morello <michael.morello@gmail.com> Co-authored-by: Michael Montgomery <mmontg1@gmail.com>
We have implemented an option to turn off the ECK managed self-signed certificates in #7925 which is going to ship with the next release of ECK. This should cover the case you mentioned @nullren. This means we now have two workarounds for large clusters: Either:
My vote would be to close this issue unless there are additional concerns we did not address with these changes. |
@pebrc that works for me. Thank you! |
Bug Report
What did you do?
What did you expect to see?
What did you see instead? Under which circumstances?
Failed remediations
Environment
ECK version: 2.8.0
Kubernetes information:
kubectl version: v1.27.2
Resource definition:
Continuous loop of reconciliation failures and timeout accompanied by the following.
The text was updated successfully, but these errors were encountered: