Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to disable self-signed transport certs #7925

Merged
merged 14 commits into from
Jul 22, 2024

Conversation

pebrc
Copy link
Collaborator

@pebrc pebrc commented Jul 1, 2024

Related to #6954

It offers users a workaround for the problem with too many certificates in the transport certificate secret. They can configure external transport cert provisioning and disable self-signed transport certificates. When using a solution eg. like cert-manager's csi-driver as documented here this should allow for larger node sets of more than 250 nodes.

The large cluster scenario is certainly an an edge case but on smaller clusters the disabling of certificate provisioning might still be attractive reducing the amount of work the operator has to do in this area.

Note the new option to disable the self-signed transport certificates below:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: es
spec:
  version: 8.6.2 
  transport:
    tls:
      certificateAuthorities:
        configMapName: trust 
      selfSignedCertificates:
        disabled: true # <<<< new option
  nodeSets:
  - name: mixed
    count: 3
    config:
      xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/cert-manager-certs/tls.key
      xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/cert-manager-certs/tls.crt
      node.store.allow_mmap: false
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          env:
          - name: PRE_STOP_ADDITIONAL_WAIT_SECONDS
            value: "5"
          volumeMounts: 
          - name: transport-certs
            mountPath: /usr/share/elasticsearch/config/cert-manager-certs
        volumes: 
        - name: transport-certs
          csi: 
            driver: csi.cert-manager.io
            readOnly: true
            volumeAttributes: 
              csi.cert-manager.io/issuer-name: ca-cluster-issuer
              csi.cert-manager.io/issuer-kind: ClusterIssuer
              csi.cert-manager.io/dns-names: "${POD_NAME}.${POD_NAMESPACE}.svc.cluster.local"

The option does not remove existing certificates from the secret so that the cluster keeps working during the transition if this option is turned on on an existing cluster.

I also opted to remove the symlinking of certificates into the emptyDir config volume. I tried to figure out why we did this in the first place and am not sure. The reason I could think of was that we wanted to have static and predictable certificate and key file names across all nodes (transport.tls.crt and transport.tls.key) But we can just use the POD_NAME environment variable to link directly into the mounted certificate secret volume.

The reason to change this behaviour now is again to support the transition between externally provisioned certs and self-signed certs provisioned by ECK: if a user flips the switch to disable and then re-enable the self-signed certs, but does this accidentally without also configuring the config settings for the transport layer there is an edge case where an Elasticsearch pod will crashloop and cannot recover if we use symlinking:

  1. disable self-signed transport certs
  2. scale the cluster up by one or more nodes
  3. new nodes won't come up because certs are missing (user error)
  4. user tries to recover by re-enabling self-signed certs
  5. ES keeps bootlooping on the new nodes because the symlink is missing

By removing the symlinking the node can recover as soon as the certificates appear in the filesystem.

@botelastic botelastic bot added the triage label Jul 1, 2024
@pebrc pebrc marked this pull request as draft July 1, 2024 07:18
@pebrc pebrc added the >enhancement Enhancement of existing functionality label Jul 1, 2024
@botelastic botelastic bot removed the triage label Jul 1, 2024
@botelastic botelastic bot removed the triage label Jul 1, 2024
@pebrc pebrc marked this pull request as ready for review July 1, 2024 08:55
@pebrc
Copy link
Collaborator Author

pebrc commented Jul 1, 2024

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

@pebrc
Copy link
Collaborator Author

pebrc commented Jul 1, 2024

I just realised I forgot to update the docs. Will push another commit.

pkg/apis/elasticsearch/v1/elasticsearch_types.go Outdated Show resolved Hide resolved
}
// defensive copy of the current secret so we can check whether we need to update later on
currentTransportCertificatesSecret := secret.DeepCopy()
if es.Spec.Transport.TLS.SelfSignedEnabled() {
if results.WithResults(reconcilePodTransportCertificates(ctx, es, ca, secret, pods, rotationParams)); results.HasError() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option does not remove existing certificates from the secret so that the cluster keeps working during the transition if this option is turned on on an existing cluster.

IIUC it also means that the certificates are not going to be removed or reconciled, which may be unexpected from a user pov (the ones who want to get back some memory for example), and, in some very edge cases, may not allow the cluster to keep working if the certificates have just expired.

That being said I understand it is not trivial to detect when they are no longer required. Maybe by checking that a Pod has a specific annotation (as I would expect the config hash to not always be updated), for example self.signed.disabled: "true", and then:

  • Remove the certificate if the annotation is present
  • Keep reconciling the certificate if it's not there

?

Copy link
Collaborator Author

@pebrc pebrc Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea. If I understand correctly we would:

  1. Add the annotation when the transport certs are disabled: this forces are restart of the pods
  2. once we observe the annotation present on a given pod we can remove its transport cert from the secret

In practice users will probably combine turning off the transports certs with other spec changes that configure an alternative source for the certs but we need the explicit marker.

Would you keep the transport CA around once all certs have been removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add the annotation when the transport certs are disabled: this forces are restart of the pods
  2. once we observe the annotation present on a given pod we can remove its transport cert from the secret

Yes

Would you keep the transport CA around once all certs have been removed?

I think it is fine to remove it in that case, only the new CA from transport.tls.certificateAuthorities.configMapName is useful once all the Pods are running with the externally provided certs.

Copy link
Contributor

@naemono naemono left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial walk through code, looks good. Couple of minor things. Haven't tested as of yet.

pkg/apis/elasticsearch/v1/elasticsearch_types.go Outdated Show resolved Hide resolved
pkg/apis/elasticsearch/v1/elasticsearch_types.go Outdated Show resolved Hide resolved
pkg/apis/elasticsearch/v1/elasticsearch_types.go Outdated Show resolved Hide resolved
},
assertSecrets: func(t *testing.T, secret map[string][]byte) {
t.Helper()
assert.Contains(t, secret, "ca.crt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of "ca.crt" in this file; should we just use the certificates.CAFileName in pkg/controller/common/certificates?

pebrc and others added 3 commits July 12, 2024 16:17
Co-authored-by: Michael Montgomery <mmontg1@gmail.com>
…ansport-settings.asciidoc

Co-authored-by: Michael Montgomery <mmontg1@gmail.com>
@pebrc pebrc requested a review from naemono July 17, 2024 07:46
Copy link
Contributor

@naemono naemono left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pebrc Overall I approve this, but I am curious about transitioning between the 2 options of remote-issued transport certificates, and ECK self-signed certificates. I attempted to go from cert-manager issues transport => ECK transport and it just fails with:

exception caught on transport layer:
javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Since you mention in the original message to support the transition between externally provisioned certs and self-signed certs provisioned by ECK, it sounds as though this transition would be supported. Is that indeed the case?

Is there some needed documentation to transition between the 2 states?


The example assumes that a `ClusterIssuer` by the name of `ca-cluster-issuer` exists and a PEM encoded version of the CA certificate is available in a ConfigMap (in the example named `trust`). The CA certificate must be in a file called `ca.crt` inside the ConfigMap in the same namespace as the Elasticsearch resource.
<2> The example assumes that a `ClusterIssuer` by the name of `ca-cluster-issuer` exists and a PEM encoded version of the CA certificate is available in a ConfigMap (in the example named `trust`). The CA certificate must be in a file called `ca.crt` inside the ConfigMap in the same namespace as the Elasticsearch resource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the below Certificate.spec.issuerRef.name needs to be updated to ca-cluster-issuer, not selfsigned-issuer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea was that we have a self-signed issuer which issues a certificate (root-ca-secret) which is then used in secod issuer the ca-cluster-issuer to work around the restriction of the CSI driver that it cannot work with self-signed issuers

@@ -111,12 +113,13 @@ spec:
driver: csi.cert-manager.io
readOnly: true
volumeAttributes:
csi.cert-manager.io/issuer-name: ca-cluster-issuer
csi.cert-manager.io/issuer-name: ca-cluster-issuer <2>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that self-signed issuers cannot be used with the CSI driver? (The certificaterequests created by the csi driver do not include cert-manager.io/private-key-secret-name annotation)

@pebrc
Copy link
Collaborator Author

pebrc commented Jul 20, 2024

Is there some needed documentation to transition between the 2 states?

I think the important part when transitioning from one state to the other is to keep the CA ie. the trust bundle around until all nodes have rolled over to the self-signed mode and vice versa.

For the transition to the externally created certificates we have included extra logic to keep the self-signed ECK CA around until there is no longer a pod requiring it.

For the transiition form externally provided certificates to the ECK self-signed ones one has to apply multiple iterations:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: es
spec:
  version: 8.6.2 
  transport:
    tls:
      certificateAuthorities:
        configMapName: trust 
      selfSignedCertificates:
        disabled: false
  nodeSets:
  - name: mixed
    count: 3
    config:
      node.store.allow_mmap: false

which switches back to self-signed ECK certs and removes the driver volume and the config attributes and then in a second step also remove the trust in the extra CA in .spec.transport.tls.certificateAuthorities. This should avoid the certificate errors you saw.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from your PR but the example in the documentation is incomplete, the reason is that the cluster issuer selfsigned-issuer is not provided:

  - lastTransitionTime: "2024-07-22T07:51:15Z"
    message: 'Referenced "ClusterIssuer" not found: clusterissuer.cert-manager.io
      "selfsigned-issuer" not found'
    reason: Pending
    status: "False"
    type: Ready

In order to document a working example should we add that issuer to the snippet?

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-issuer
spec:
  selfSigned: {}

@pebrc
Copy link
Collaborator Author

pebrc commented Jul 22, 2024

In order to document a working example should we add that issuer to the snippet?

Yes I think that's a good idea.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@pebrc
Copy link
Collaborator Author

pebrc commented Jul 22, 2024

I added a bit to the docs to make the self-signed issuer explicit in ef8386b

@pebrc pebrc enabled auto-merge (squash) July 22, 2024 09:39
@pebrc pebrc merged commit cd161b3 into elastic:main Jul 22, 2024
5 checks passed
@barkbay barkbay added >feature Adds or discusses adding a feature to the product and removed >enhancement Enhancement of existing functionality labels Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Adds or discusses adding a feature to the product v2.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants