Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Transport Cert Secret Size Overrun With Big Scale Out #6954

Closed
rtluckie opened this issue Jun 23, 2023 · 7 comments
Closed

Data Transport Cert Secret Size Overrun With Big Scale Out #6954

rtluckie opened this issue Jun 23, 2023 · 7 comments
Assignees
Labels
>enhancement Enhancement of existing functionality v2.14.0

Comments

@rtluckie
Copy link

rtluckie commented Jun 23, 2023

Bug Report

What did you do?

  • Attempted to scale the data replicas to 250.

What did you expect to see?

  • successful scale up

What did you see instead? Under which circumstances?

  • It appears that the ECK operator will overflow the max k8s secret size (1MB) for the transport certs if you scale the data nodes to >250.
  • The operator gets stuck in a scale up loop while it tries to reconcile the cert secret. Even after scaling down the operator does not seem to recover.
"Secret "elasticsearch-XXX-es-data-es-transport-certs" is invalid: data: Too long: must have at most 1048576 bytes" error

Failed remediations

  • issue transport cert as documents here
  • issue wildcard transport certs as documents here

Environment

  • ECK version: 2.8.0

  • Kubernetes information:

    • Cloud: GKE v1.26.3-gke.1000
  • kubectl version: v1.27.2

  • Resource definition:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-myapp
spec:
  version: 8.6.1
  http:
    tls:
      selfSignedCertificate:
        disabled: true
  nodeSets:
  - config:
      action:
        auto_create_index: false
      node.roles:
      - master
    count: 3
    name: election
    podTemplate:
      metadata:
        annotations:
          linkerd.io/inject: enabled
        labels:
          ec.ai/component: elasticsearch
          ec.ai/component_group: myapp-service
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cloud.google.com/gke-spot
                  operator: DoesNotExist
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: elasticsearch.k8s.elastic.co/cluster-name
                    operator: In
                    values:
                    - elasticsearch-myapp
                topologyKey: topology.kubernetes.io/zone
              weight: 100
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: elasticsearch.k8s.elastic.co/cluster-name
                  operator: In
                  values:
                  - elasticsearch-myapp
              topologyKey: kubernetes.io/hostname
        automountServiceAccountToken: true
        containers:
        - name: elasticsearch
          resources:
            limits:
              cpu: "2"
              memory: 5Gi
            requests:
              cpu: "1"
              memory: 5Gi
        initContainers:
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          image: busybox:1.28
          name: sysctl
          securityContext:
            privileged: true
        - command:
          - sh
          - -c
          - bin/elasticsearch-plugin install --batch analysis-icu
          name: analysis-icu
        - command:
          - sh
          - -c
          - bin/elasticsearch-plugin install --batch repository-gcs
          name: repository-gcs
        priorityClassName: app-critical-preempting
        serviceAccount: myapp-elasticsearch
        serviceAccountName: myapp-elasticsearch
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 8Gi
        storageClassName: standard-rwo
  - config:
      action:
        auto_create_index: false
      node.roles:
      - data
    count: 200
    name: data
    podTemplate:
      metadata:
        annotations:
          linkerd.io/inject: enabled
        labels:
          ec.ai/component: elasticsearch
          ec.ai/component_group: myapp-service
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: node_pool
                  operator: In
                  values:
                  - n2d-custom-8-65536
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: elasticsearch.k8s.elastic.co/cluster-name
                    operator: In
                    values:
                    - elasticsearch-myapp
                topologyKey: topology.kubernetes.io/zone
              weight: 100
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: elasticsearch.k8s.elastic.co/cluster-name
                  operator: In
                  values:
                  - elasticsearch-myapp
              topologyKey: kubernetes.io/hostname
        automountServiceAccountToken: true
        containers:
        - name: elasticsearch
          resources:
            limits:
              cpu: "7"
              memory: 56Gi
            requests:
              cpu: "7"
              memory: 56Gi
        initContainers:
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          image: busybox:1.28
          name: sysctl
          securityContext:
            privileged: true
        - command:
          - sh
          - -c
          - bin/elasticsearch-plugin install --batch analysis-icu
          name: analysis-icu
        - command:
          - sh
          - -c
          - bin/elasticsearch-plugin install --batch repository-gcs
          name: repository-gcs
        priorityClassName: app-high-preempting
        serviceAccount: myapp-elasticsearch
        serviceAccountName: myapp-elasticsearch
        tolerations:
        - effect: NoSchedule
          key: n2d-custom-8-65536
          operator: Equal
          value: "true"
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi
        storageClassName: standard-rwo
  • Logs:

Continuous loop of reconciliation failures and timeout accompanied by the following.

 Secret "elasticsearch-myapp-es-data-es-transport-certs.v1" is invalid: data: Too long: must have at most 1048576 character
@botelastic botelastic bot added the triage label Jun 23, 2023
@pebrc
Copy link
Collaborator

pebrc commented Jun 24, 2023

One thing you can do to work around this limitation is to create multiple node sets with the data role and scale each of those up until you start running into the size limitation of k8s secrets which seems to be around 150-200 nodes. You can then keep adding node sets until reach the desired scale. See this issue for more context on the current model of one secret for transport certificates per node set.

@pebrc pebrc added the >enhancement Enhancement of existing functionality label Jul 7, 2023
@botelastic botelastic bot removed the triage label Jul 7, 2023
@nullren
Copy link

nullren commented Mar 20, 2024

@pebrc are there any plans to address this? it's been several years since the workaround was implemented. we run a very large deployment of many ES clusters (of which this operator has been fantastically helpful), so when adding some of our more larger clusters, i bumped into this error. quite a surprise, you can imagine.

@barkbay
Copy link
Contributor

barkbay commented Mar 21, 2024

I'm wondering if we could stop reconciling that Secret if we use a CSI driver to manage the certificates for example? (Or give an option to the user skip the reconciliation of that Secret?)

@pebrc
Copy link
Collaborator

pebrc commented Mar 22, 2024

@barkbay I think that's a good idea.

@nullren we don't have concrete plans to address this right now. Did the workaround, using multiple node sets instead of one big one, have drawbacks for you that made you want to stick with a single node set?

@nullren
Copy link

nullren commented Apr 3, 2024

@barkbay I think that's a good idea.

@nullren we don't have concrete plans to address this right now. Did the workaround, using multiple node sets instead of one big one, have drawbacks for you that made you want to stick with a single node set?

The work around did "work", but it is a whole lot of unnecessary complexity for something we don't even use (we disable security and dont use the certs at all as we use our own network framework on k8s). There's just a lot of extra tooling we have to update to ensure that node sets "data-0", "data-1", ..., "data-N" are all found and reconciled correctly. Still finding some bugs due to this.

@pebrc pebrc self-assigned this Jun 27, 2024
pebrc added a commit that referenced this issue Jul 22, 2024
Related to #6954

It offers users a workaround for the problem with too many certificates in the transport certificate secret. They can configure external transport cert provisioning and disable self-signed transport certificates. When using a solution eg. like cert-manager's csi-driver as [documented here ](https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-transport-settings.html#k8s-transport-third-party-tools) this should allow for larger node sets of more than 250 nodes. 

The large cluster scenario is certainly an an edge case but on smaller clusters the disabling of certificate provisioning might still be attractive [reducing the amount of work the operator has to do in this area.](#1841)

Note the new option to disable the self-signed transport certificates below:

```yaml
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: es
spec:
  version: 8.6.2 
  transport:
    tls:
      certificateAuthorities:
        configMapName: trust 
      selfSignedCertificates:
        disabled: true # <<<< new option
  nodeSets:
  - name: mixed
    count: 3
    config:
      xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/cert-manager-certs/tls.key
      xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/cert-manager-certs/tls.crt
      node.store.allow_mmap: false
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          env:
          - name: PRE_STOP_ADDITIONAL_WAIT_SECONDS
            value: "5"
          volumeMounts: 
          - name: transport-certs
            mountPath: /usr/share/elasticsearch/config/cert-manager-certs
        volumes: 
        - name: transport-certs
          csi: 
            driver: csi.cert-manager.io
            readOnly: true
            volumeAttributes: 
              csi.cert-manager.io/issuer-name: ca-cluster-issuer
              csi.cert-manager.io/issuer-kind: ClusterIssuer
              csi.cert-manager.io/dns-names: "${POD_NAME}.${POD_NAMESPACE}.svc.cluster.local"
```

The option does not remove existing certificates from the secret so that the cluster keeps working during the transition if this option is turned on on an existing cluster. 

I also opted to remove the symlinking of certificates into the `emptyDir` config volume. I tried to figure out why we did this in the first place and am not sure. The reason I could think of was that we wanted to have static and predictable certificate and key file names across all nodes (`transport.tls.crt` and `transport.tls.key`) But we can just use the `POD_NAME` environment variable to link directly into the mounted certificate secret volume. 

The reason to change this behaviour now is again to support the transition between externally provisioned certs and self-signed certs provisioned by ECK: if a user flips the switch to disable and then re-enable the self-signed certs, but does this accidentally without also configuring the config settings for the transport layer there is an edge case where an Elasticsearch pod will crashloop and cannot recover if we use symlinking: 
1. disable self-signed transport certs
2. scale the cluster up by one or more nodes
3. new nodes won't come up because certs are missing (user error) 
4. user tries to recover by re-enabling self-signed certs
5. ES keeps bootlooping on the new nodes because the symlink is missing 

By removing the symlinking the node can recover as soon as the certificates appear in the filesystem. 

---------

Co-authored-by: Michael Morello <michael.morello@gmail.com>
Co-authored-by: Michael Montgomery <mmontg1@gmail.com>
@pebrc
Copy link
Collaborator

pebrc commented Jul 23, 2024

We have implemented an option to turn off the ECK managed self-signed certificates in #7925 which is going to ship with the next release of ECK. This should cover the case you mentioned @nullren. This means we now have two workarounds for large clusters:

Either:

  1. split a node set into mulitple node sets
    or
  2. disable the transport certs and provision them externally (e.g. with cert-manager)

My vote would be to close this issue unless there are additional concerns we did not address with these changes.

@nullren
Copy link

nullren commented Jul 24, 2024

@pebrc that works for me. Thank you!

@pebrc pebrc closed this as completed Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality v2.14.0
Projects
None yet
Development

No branches or pull requests

5 participants