-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to disable self-signed transport certs #7925
Conversation
buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT |
I just realised I forgot to update the docs. Will push another commit. |
} | ||
// defensive copy of the current secret so we can check whether we need to update later on | ||
currentTransportCertificatesSecret := secret.DeepCopy() | ||
if es.Spec.Transport.TLS.SelfSignedEnabled() { | ||
if results.WithResults(reconcilePodTransportCertificates(ctx, es, ca, secret, pods, rotationParams)); results.HasError() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option does not remove existing certificates from the secret so that the cluster keeps working during the transition if this option is turned on on an existing cluster.
IIUC it also means that the certificates are not going to be removed or reconciled, which may be unexpected from a user pov (the ones who want to get back some memory for example), and, in some very edge cases, may not allow the cluster to keep working if the certificates have just expired.
That being said I understand it is not trivial to detect when they are no longer required. Maybe by checking that a Pod has a specific annotation (as I would expect the config hash to not always be updated), for example self.signed.disabled: "true"
, and then:
- Remove the certificate if the annotation is present
- Keep reconciling the certificate if it's not there
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea. If I understand correctly we would:
- Add the annotation when the transport certs are disabled: this forces are restart of the pods
- once we observe the annotation present on a given pod we can remove its transport cert from the secret
In practice users will probably combine turning off the transports certs with other spec changes that configure an alternative source for the certs but we need the explicit marker.
Would you keep the transport CA around once all certs have been removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add the annotation when the transport certs are disabled: this forces are restart of the pods
- once we observe the annotation present on a given pod we can remove its transport cert from the secret
Yes
Would you keep the transport CA around once all certs have been removed?
I think it is fine to remove it in that case, only the new CA from transport.tls.certificateAuthorities.configMapName
is useful once all the Pods are running with the externally provided certs.
Co-authored-by: Michael Morello <michael.morello@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial walk through code, looks good. Couple of minor things. Haven't tested as of yet.
docs/orchestrating-elastic-stack-applications/elasticsearch/transport-settings.asciidoc
Outdated
Show resolved
Hide resolved
pkg/controller/elasticsearch/certificates/transport/reconcile_test.go
Outdated
Show resolved
Hide resolved
}, | ||
assertSecrets: func(t *testing.T, secret map[string][]byte) { | ||
t.Helper() | ||
assert.Contains(t, secret, "ca.crt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of "ca.crt"
in this file; should we just use the certificates.CAFileName
in pkg/controller/common/certificates
?
Co-authored-by: Michael Montgomery <mmontg1@gmail.com>
…ansport-settings.asciidoc Co-authored-by: Michael Montgomery <mmontg1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pebrc Overall I approve this, but I am curious about transitioning between the 2 options of remote-issued transport certificates, and ECK self-signed certificates. I attempted to go from cert-manager issues transport => ECK transport and it just fails with:
exception caught on transport layer:
javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Since you mention in the original message to support the transition between externally provisioned certs and self-signed certs provisioned by ECK
, it sounds as though this transition would be supported. Is that indeed the case?
Is there some needed documentation to transition between the 2 states?
|
||
The example assumes that a `ClusterIssuer` by the name of `ca-cluster-issuer` exists and a PEM encoded version of the CA certificate is available in a ConfigMap (in the example named `trust`). The CA certificate must be in a file called `ca.crt` inside the ConfigMap in the same namespace as the Elasticsearch resource. | ||
<2> The example assumes that a `ClusterIssuer` by the name of `ca-cluster-issuer` exists and a PEM encoded version of the CA certificate is available in a ConfigMap (in the example named `trust`). The CA certificate must be in a file called `ca.crt` inside the ConfigMap in the same namespace as the Elasticsearch resource. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the below Certificate.spec.issuerRef.name
needs to be updated to ca-cluster-issuer
, not selfsigned-issuer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea was that we have a self-signed issuer which issues a certificate (root-ca-secret
) which is then used in secod issuer the ca-cluster-issuer
to work around the restriction of the CSI driver that it cannot work with self-signed issuers
@@ -111,12 +113,13 @@ spec: | |||
driver: csi.cert-manager.io | |||
readOnly: true | |||
volumeAttributes: | |||
csi.cert-manager.io/issuer-name: ca-cluster-issuer | |||
csi.cert-manager.io/issuer-name: ca-cluster-issuer <2> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention that self-signed issuers cannot be used with the CSI driver? (The certificaterequests
created by the csi driver do not include cert-manager.io/private-key-secret-name
annotation)
I think the important part when transitioning from one state to the other is to keep the CA ie. the trust bundle around until all nodes have rolled over to the self-signed mode and vice versa. For the transition to the externally created certificates we have included extra logic to keep the self-signed ECK CA around until there is no longer a pod requiring it. For the transiition form externally provided certificates to the ECK self-signed ones one has to apply multiple iterations: apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: es
spec:
version: 8.6.2
transport:
tls:
certificateAuthorities:
configMapName: trust
selfSignedCertificates:
disabled: false
nodeSets:
- name: mixed
count: 3
config:
node.store.allow_mmap: false which switches back to self-signed ECK certs and removes the driver volume and the config attributes and then in a second step also remove the trust in the extra CA in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not from your PR but the example in the documentation is incomplete, the reason is that the cluster issuer selfsigned-issuer
is not provided:
cloud-on-k8s/docs/orchestrating-elastic-stack-applications/elasticsearch/transport-settings.asciidoc
Line 151 in f913214
name: selfsigned-issuer |
- lastTransitionTime: "2024-07-22T07:51:15Z"
message: 'Referenced "ClusterIssuer" not found: clusterissuer.cert-manager.io
"selfsigned-issuer" not found'
reason: Pending
status: "False"
type: Ready
In order to document a working example should we add that issuer to the snippet?
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: selfsigned-issuer
spec:
selfSigned: {}
Yes I think that's a good idea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
I added a bit to the docs to make the self-signed issuer explicit in ef8386b |
Related to #6954
It offers users a workaround for the problem with too many certificates in the transport certificate secret. They can configure external transport cert provisioning and disable self-signed transport certificates. When using a solution eg. like cert-manager's csi-driver as documented here this should allow for larger node sets of more than 250 nodes.
The large cluster scenario is certainly an an edge case but on smaller clusters the disabling of certificate provisioning might still be attractive reducing the amount of work the operator has to do in this area.
Note the new option to disable the self-signed transport certificates below:
The option does not remove existing certificates from the secret so that the cluster keeps working during the transition if this option is turned on on an existing cluster.
I also opted to remove the symlinking of certificates into the
emptyDir
config volume. I tried to figure out why we did this in the first place and am not sure. The reason I could think of was that we wanted to have static and predictable certificate and key file names across all nodes (transport.tls.crt
andtransport.tls.key
) But we can just use thePOD_NAME
environment variable to link directly into the mounted certificate secret volume.The reason to change this behaviour now is again to support the transition between externally provisioned certs and self-signed certs provisioned by ECK: if a user flips the switch to disable and then re-enable the self-signed certs, but does this accidentally without also configuring the config settings for the transport layer there is an edge case where an Elasticsearch pod will crashloop and cannot recover if we use symlinking:
By removing the symlinking the node can recover as soon as the certificates appear in the filesystem.