Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

es-transport-certificates secrets hits 1048576 character limit #3734

Closed
markauskas opened this issue Sep 8, 2020 · 5 comments
Closed

es-transport-certificates secrets hits 1048576 character limit #3734

markauskas opened this issue Sep 8, 2020 · 5 comments
Assignees
Labels
>bug Something isn't working

Comments

@markauskas
Copy link

markauskas commented Sep 8, 2020

Bug Report

What did you do?

created a cluster with 200 nodes, then tried adding two more NodeSets with 50 nodes each (to better distribute data nodes across availability zones)

What did you expect to see?

I expected to see new nodes joining the cluster

What did you see instead? Under which circumstances?

most nodes stuck in the init phase. The elastic-internal-init-filesystem initContainers got stuck with the following log message:

waiting for the transport certificates (/mnt/elastic-internal/transport-certificates/my-cluster-es-data-node-c-9.tls.key)

elastic-operator-0 logs:

"error":"Secret \"my-cluster-es-transport-certificates\" is invalid: data: Too long: must have at most 1048576 characters"

After recreating the cluster with 150 nodes, I see that the *-es-transport-certificates secret is already near the 1MB limit:

% kubectl get secret my-cluster-es-transport-certificates -o yaml | wc -c
  900618

% kubectl get secret my-cluster-es-transport-certificates
NAME                                  TYPE     DATA   AGE
my-cluster-es-transport-certificates   Opaque   307    145m

Environment

  • ECK version:

1.2.1-b5316231

  • Kubernetes information:

GKE (1.16.13-gke.1)

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T21:52:18Z", GoVersion:"go1.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-gke.1", GitCommit:"688c6543aa4b285355723f100302d80431e411cc", GitTreeState:"clean", BuildDate:"2020-07-21T02:37:26Z", GoVersion:"go1.13.9b4", Compiler:"gc", Platform:"linux/amd64"}
  • Logs:
Reconciler error","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","controller":"elasticsearch-controller","request":"elasticsearch/my-cluster","error":"Secret \"my-cluster-es-transport-certificates\" is invalid: data: Too long: must have at most 1048576 characters","errorCauses":[{"error":"Secret \"my-cluster-es-transport-certificates\" is invalid: data: Too long: must have at most 1048576 characters"}],"error.stack_trace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.2/pkg/util/wait/wait.go:88"}
@botelastic botelastic bot added the triage label Sep 8, 2020
@sebgl sebgl added the >bug Something isn't working label Sep 8, 2020
@botelastic botelastic bot removed the triage label Sep 8, 2020
@sebgl
Copy link
Contributor

sebgl commented Sep 10, 2020

Some thoughts:

From a quick look: each instance has a certificate (4KB) + key (2KB) in that secret, that's a bit more than 6KB per instance. So ~150 nodes seems to be indeed a maximum, which is definitely a bug we must fix.

One way to improve the situation could be to setup one secret per NodeSet, instead of one cluster-wide secret. This would not totally fix the problem, since a NodeSet with 150 nodes would still hit the 1MB size limit. But at least it offers a direct workaround to the problem: if that limit gets reached, it is possible to split the nodeSet into 2 different nodeSets, leading to 2 secrets of half the size.

Another way to fix it would be to maintain more than one secret for transport certificates, and to fill up those secrets according to the number of nodes (eg. first 100 nodes certs go to the first secret, next 100 nodes to the second, etc.). However if we don't want to rotate all Pods once a new secret is required we must make sure empty additional secrets are pre-mounted on the Pod.

We can also look at certificates compression but I'm not sure there's much we can do there, it would also bring some additional complexity to handle decompression.

Another option would be to stop serving certificates through Secrets, but instead use eg. an HTTP API in each Elasticsearch Pod that ECK would use to serve the cert once the Pod is up. That's something we used to do in the past by running an additional process in the Pod but we stopped doing it due to the additional complexity involved.

@pebrc
Copy link
Collaborator

pebrc commented Sep 10, 2020

In newer versions of Kubernetes Generic Ephemeral Volumes might be another solution that does not suffer from the size limitation on secrets and can be prepopulated with cert data. But we need to support pre 1.19 as well.

I think the one secret per NodeSet approach seems to strike the best balance between offering a workaround to this problem and adding a lot of extra complexity.

Unrelated and not relevant for the solution of this problem: I would be curious about the size of the individual nodes in that 300 node cluster mentioned in the OP. Especially if they are not 64G RAM (ie 32G heap) each it would probably also be an option to see if scaling the nodes up first would make sense and thus reducing the overall size of the cluster.

@markauskas
Copy link
Author

@sebgl I like the "one secret per NodeSet" approach. It would also be helpful if the operator didn't fail when creating the secret and would just limit the number of secrets at some point. That would cause some nodes to not join the cluster, but ~150-200 of them could still join. We created a 200 node NodeSet first which somehow didn't hit the limit yet, then we tried to change the cluster to 3 NodeSet with 50 nodes each and did the mistake to do it in one big change, so the previous StatefulSet was still there with 200 nodes, 2 new StatefulSets were created with 50 nodes each. Only 4 nodes joined the cluster successfully as that was probably the number of certificates that still could fit into the secret. Reducing the first StatefulSet would probably have helped, but the operator probably couldn't do that, since it still tried reconcile the cluster by updating the secret with 300 secrets before doing any other changes. We didn't have enough patience to try to recover and just deleted the cluster and created a new one with 3x 50 nodes per NodeSet.

@pebrc we we're using 32G VMs with ~15G heap. Increasing RAM could have been an option, but eventually we figured out, that for our usecase we need less than 200 VMs, so it was not an issue anymore.

@anyasabo
Copy link
Contributor

Summarizing options we discussed out of band today. If I missed any please correct me:

A) Creating one secret per nodeset
This gives us max 150 nodes per nodeset rather than 150 nodes per cluster. We could enforce the limit in the openapi spec. Users can create identical nodesets with different names to work around the limit if they need >150 nodes of a type.

B) Compressing the certificates in the secret.
We would need a sidecar to watch for the file and decompress it. Testing showed ~30% compression, which would get us up to around 200. The pod would not need any additional k8s permissions though.

C) Creating one secret per pod, and having a sidecar that watches for the appropriate secret (since the pod knows its hostname and can derive the secret name) and pulls it from the k8s API. Downside is the complexity, number of secrets we create, and primarily that we now need to give Elasticsearch pods permissions to the k8s api. Upside is that there is no limit on number of nodes in a nodeset.

IMO we at least want to do A because it's simple and should cover ~all use cases. The others seem more complex and not necessary with the current usage data we have. If we had information that users wanted to create >150 count nodesets then I could see something like C being worthwhile. When we discussed it in person, our memory was that even 150 node clusters were not unheard of but not rare. We did not have data at the time on how that was broken down by nodeset though. We may be able to find more.


Since we only use certificate transport verification and not full (yet, see: #2833), I was wondering why we could not just use one certificate for all nodes in a cluster. It was brought up that ES does something with the transport cert for identity but reading the docs here it wasn't clear what. If anyone can fill in the gaps here that would be useful.

@barkbay
Copy link
Contributor

barkbay commented Oct 15, 2020

#3828 should help to mitigate this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants