-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
es-transport-certificates secrets hits 1048576 character limit #3734
Comments
Some thoughts: From a quick look: each instance has a certificate (4KB) + key (2KB) in that secret, that's a bit more than 6KB per instance. So ~150 nodes seems to be indeed a maximum, which is definitely a bug we must fix. One way to improve the situation could be to setup one secret per NodeSet, instead of one cluster-wide secret. This would not totally fix the problem, since a NodeSet with 150 nodes would still hit the 1MB size limit. But at least it offers a direct workaround to the problem: if that limit gets reached, it is possible to split the nodeSet into 2 different nodeSets, leading to 2 secrets of half the size. Another way to fix it would be to maintain more than one secret for transport certificates, and to fill up those secrets according to the number of nodes (eg. first 100 nodes certs go to the first secret, next 100 nodes to the second, etc.). However if we don't want to rotate all Pods once a new secret is required we must make sure empty additional secrets are pre-mounted on the Pod. We can also look at certificates compression but I'm not sure there's much we can do there, it would also bring some additional complexity to handle decompression. Another option would be to stop serving certificates through Secrets, but instead use eg. an HTTP API in each Elasticsearch Pod that ECK would use to serve the cert once the Pod is up. That's something we used to do in the past by running an additional process in the Pod but we stopped doing it due to the additional complexity involved. |
In newer versions of Kubernetes Generic Ephemeral Volumes might be another solution that does not suffer from the size limitation on secrets and can be prepopulated with cert data. But we need to support pre 1.19 as well. I think the one secret per NodeSet approach seems to strike the best balance between offering a workaround to this problem and adding a lot of extra complexity. Unrelated and not relevant for the solution of this problem: I would be curious about the size of the individual nodes in that 300 node cluster mentioned in the OP. Especially if they are not 64G RAM (ie 32G heap) each it would probably also be an option to see if scaling the nodes up first would make sense and thus reducing the overall size of the cluster. |
@sebgl I like the "one secret per NodeSet" approach. It would also be helpful if the operator didn't fail when creating the secret and would just limit the number of secrets at some point. That would cause some nodes to not join the cluster, but ~150-200 of them could still join. We created a 200 node NodeSet first which somehow didn't hit the limit yet, then we tried to change the cluster to 3 NodeSet with 50 nodes each and did the mistake to do it in one big change, so the previous StatefulSet was still there with 200 nodes, 2 new StatefulSets were created with 50 nodes each. Only 4 nodes joined the cluster successfully as that was probably the number of certificates that still could fit into the secret. Reducing the first StatefulSet would probably have helped, but the operator probably couldn't do that, since it still tried reconcile the cluster by updating the secret with 300 secrets before doing any other changes. We didn't have enough patience to try to recover and just deleted the cluster and created a new one with 3x 50 nodes per NodeSet. @pebrc we we're using 32G VMs with ~15G heap. Increasing RAM could have been an option, but eventually we figured out, that for our usecase we need less than 200 VMs, so it was not an issue anymore. |
Summarizing options we discussed out of band today. If I missed any please correct me: A) Creating one secret per nodeset B) Compressing the certificates in the secret. C) Creating one secret per pod, and having a sidecar that watches for the appropriate secret (since the pod knows its hostname and can derive the secret name) and pulls it from the k8s API. Downside is the complexity, number of secrets we create, and primarily that we now need to give Elasticsearch pods permissions to the k8s api. Upside is that there is no limit on number of nodes in a nodeset. IMO we at least want to do A because it's simple and should cover ~all use cases. The others seem more complex and not necessary with the current usage data we have. If we had information that users wanted to create >150 count nodesets then I could see something like C being worthwhile. When we discussed it in person, our memory was that even 150 node clusters were not unheard of but not rare. We did not have data at the time on how that was broken down by nodeset though. We may be able to find more. Since we only use |
#3828 should help to mitigate this issue. |
Bug Report
What did you do?
created a cluster with 200 nodes, then tried adding two more NodeSets with 50 nodes each (to better distribute data nodes across availability zones)
What did you expect to see?
I expected to see new nodes joining the cluster
What did you see instead? Under which circumstances?
most nodes stuck in the init phase. The
elastic-internal-init-filesystem
initContainers got stuck with the following log message:elastic-operator-0 logs:
After recreating the cluster with 150 nodes, I see that the
*-es-transport-certificates
secret is already near the 1MB limit:Environment
1.2.1-b5316231
GKE (1.16.13-gke.1)
The text was updated successfully, but these errors were encountered: