Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OSDOCS#10055]: Document manual rotation of etcd signer certificates #77406

Merged
merged 1 commit into from
Jun 21, 2024

Conversation

tmalove
Copy link
Contributor

@tmalove tmalove commented Jun 12, 2024

OSDOCS-10055

Note: This PR has been reviewed, however this version includes updates from the initial peer view. Thanks!

Version(s):
4.16+

Link to docs preview: etcd certificates (Updated 6/21/2024)

QE review:

  • [ x] QE has approved this change.

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 12, 2024
@ocpdocs-previewbot
Copy link

ocpdocs-previewbot commented Jun 12, 2024

@tmalove
Copy link
Contributor Author

tmalove commented Jun 12, 2024

/retest

$ oc get secret -n openshift-etcd etcd-signer -ojsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'
----

. Recreate the signer, if the remaining lifetime is close to the current date, by deleting the signer and wait for the status pod rollout:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static pod

$ oc wait --for=condition=Progressing=False --timeout=15m clusteroperator/etcd
----

. After the CA rotates, you can switch the original CA with the new one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CA doesn't rotate (yet), etcd restarts :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-) Thanks Thomas!


:_mod-docs-content-type: PROCEDURE
[id="rotating-certificate-authority_{context}"]
=== Rotating the etcd certificate authority
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good as a whole! we need to also mention the metrics-signer-ca rotation, which is equally important for metrics and alerting

.Procedure

. Verify the remaining lifetime of the new signer certificate:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmalove Try adding a + here and see if that helps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lahinson the problem was the file suffix...it is corrected now, thanks!

@tmalove tmalove force-pushed the etcd-osdocs-10055-tlove branch 3 times, most recently from e678190 to ca09778 Compare June 14, 2024 18:53
@tmalove
Copy link
Contributor Author

tmalove commented Jun 17, 2024

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jun 17, 2024
@tmalove
Copy link
Contributor Author

tmalove commented Jun 17, 2024

/retest

@geliu2016
Copy link

/lgtm
and I tried doc steps with ocp 4.16, and etcd restarted failure after removed etcd-signer, I will try more to make sure root cause.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2024
@tjungblu
Copy link
Contributor

/lgtm

@geliu2016 ensure that you don't delete the signer in the openshift-config namespace, that would indeed create failures

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 20, 2024
@tmalove
Copy link
Contributor Author

tmalove commented Jun 20, 2024

/label peer-review-needed

@openshift-ci openshift-ci bot added the peer-review-needed Signifies that the peer review team needs to review this PR label Jun 20, 2024
@jeana-redhat jeana-redhat added the peer-review-in-progress Signifies that the peer review team is reviewing this PR label Jun 20, 2024
Copy link
Contributor

@jeana-redhat jeana-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestions but overall great addition!

/remove-label peer-review-in-progress
/remove-label peer-review-needed
/label peer-review-done


.Additional resources

* xref:../../security/certificate_types_descriptions/etcd-certificates.adoc#rotating-certificate-authority_cert-types-etcd-certificates[Rotating the etcd certificate].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* xref:../../security/certificate_types_descriptions/etcd-certificates.adoc#rotating-certificate-authority_cert-types-etcd-certificates[Rotating the etcd certificate].
* xref:../../security/certificate_types_descriptions/etcd-certificates.adoc#rotating-certificate-authority_cert-types-etcd-certificates[Rotating the etcd certificate]


.Procedure

. Verify the remaining lifetime of the new signer certificate:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Verify the remaining lifetime of the new signer certificate:
. Verify the remaining lifetime of the new signer certificate by running the following command:

Comment on lines 20 to 33
. Re-create the signer, if the remaining lifetime is close to the current date, by deleting the signer and wait for the static pod rollout:
+
[source,terminal]
----
$ oc delete secret -n openshift-etcd etcd-signer
----
+
[source,terminal]
----
$ oc wait --for=condition=Progressing=False --timeout=15m clusteroperator/etcd
----
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Re-create the signer, if the remaining lifetime is close to the current date, by deleting the signer and wait for the static pod rollout:
+
[source,terminal]
----
$ oc delete secret -n openshift-etcd etcd-signer
----
+
[source,terminal]
----
$ oc wait --for=condition=Progressing=False --timeout=15m clusteroperator/etcd
----
. If the remaining lifetime is close to the current date, re-create the signer by running the following commands:
.. Delete the signer:
+
[source,terminal]
----
$ oc delete secret -n openshift-etcd etcd-signer
----
.. Wait for the static pod rollout
+
[source,terminal]
----
$ oc wait --for=condition=Progressing=False --timeout=15m clusteroperator/etcd
----

$ oc wait --for=condition=Progressing=False --timeout=15m clusteroperator/etcd
----

. After `etcd` restarts, you can switch the original certificate authority(CA) in the `openshift-config` namespace with the new, rotated one in `openshift-etcd`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. After `etcd` restarts, you can switch the original certificate authority(CA) in the `openshift-config` namespace with the new, rotated one in `openshift-etcd`.
. After `etcd` restarts, switch the original CA in the `openshift-config` namespace with the new, rotated one in `openshift-etcd`.

Comment on lines 32 to 47
. After `etcd` restarts, you can switch the original certificate authority(CA) in the `openshift-config` namespace with the new, rotated one in `openshift-etcd`.
+
[source,terminal]
----
$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -
----
+
[source,terminal]
----
$ oc wait --for=condition=Progressing=False --timeout=25m clusteroperator/etcd
----
+
[source,terminal]
----
$ oc wait --for=condition=Progressing=False --timeout=25m clusteroperator/kube-apiserver
----

You can also use this single command to switch the CA:

[source,terminal]
----
$ oc adm wait-for-stable-cluster
----
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document a single way to do this, unless there is a specific use case for a second method. In this instance, if the two methods are equivalent, I would lean towards the single-command step for simplicity.

Suggested change
. After `etcd` restarts, you can switch the original certificate authority(CA) in the `openshift-config` namespace with the new, rotated one in `openshift-etcd`.
+
[source,terminal]
----
$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -
----
+
[source,terminal]
----
$ oc wait --for=condition=Progressing=False --timeout=25m clusteroperator/etcd
----
+
[source,terminal]
----
$ oc wait --for=condition=Progressing=False --timeout=25m clusteroperator/kube-apiserver
----
You can also use this single command to switch the CA:
[source,terminal]
----
$ oc adm wait-for-stable-cluster
----
. After `etcd` restarts, switch the original CA in the `openshift-config` namespace with the new, rotated one in `openshift-etcd` by running the following command:
+
[source,terminal]
----
$ oc adm wait-for-stable-cluster
----

If you need to use the multi-command version, it should be split into substeps (similar to my suggestion for step 2 above) so we have a single command per step.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The oc adm wait-for-stable-cluster cmd does the equivalent check as the previous two cmds (and more) but will wait until that required conditions are true for a period of 5 mins (--minimum-stable-period 5m) so in practice it might take longer to wait on that step than the individual cmds.

$ oc adm wait-for-stable-cluster -h
Wait for all OCP v4 clusteroperators to report Available=true, Progressing=false, Degraded=false.

Examples:
  # Wait for all clusteroperators to become stable
  oc adm clusteroperator wait-for-stable-cluster

  # Consider operators to be stable if they report as such for 5 minutes straight
  oc adm clusteroperator wait-for-stable-cluster --minimum-stable-period 5m

Options:
    --minimum-stable-period=5m0s:
	minimum duration to consider a cluster stable. Defaults to 5 minutes.

    --timeout=1h0m0s:
	duration before the command times out. Defaults to 1 hour.

That may be a long time from the user's perspective so maybe a note here that the oc adm wait-for-stable-cluster will wait for a minimum of 5 mins by default.

@tjungblu Do you think we need to wait that long in practice or if we can shorten it e.g oc adm wait-for-stable-cluster --minimum-stable-period 2m

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2m is enough for sure

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for confirming. Then we can combine those into the following. Note we still need the step to switch out the CA first and then wait with the singular cmd for 2m.

Suggested change
. After `etcd` restarts, you can switch the original certificate authority(CA) in the `openshift-config` namespace with the new, rotated one in `openshift-etcd`.
+
[source,terminal]
----
$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -
----
+
[source,terminal]
----
$ oc wait --for=condition=Progressing=False --timeout=25m clusteroperator/etcd
----
+
[source,terminal]
----
$ oc wait --for=condition=Progressing=False --timeout=25m clusteroperator/kube-apiserver
----
You can also use this single command to switch the CA:
[source,terminal]
----
$ oc adm wait-for-stable-cluster
----
. After `etcd` restarts, you can switch the original certificate authority(CA) in the `openshift-config` namespace with the new, rotated one in `openshift-etcd`.
+
[source,terminal]
----
$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -
----
+
You can then wait for the cluster operators to rollout and become stable:
[source,terminal]
----
$ oc adm wait-for-stable-cluster --minimum-stable-period 2m
----

== etcd certificate rotation alerts and metrics signer certificates

Two alert types inform users about pending `etcd` certificate expiration:
[horizontal]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know for sure if our tooling handles this well (looks good on the preview, but I haven't seen it before so wonder about the Portal 😅)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeana-redhat I did a quick look at it, but couldn't find an example online. I decided that because it was documented in our doc references, that it would not be an issue. :-) I will make a note to verify this list after it hits the portal! Thanks for that input!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh. I do not like that look at all! 😀

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please validate that this looks ok on docs.redhat after GA.

Comment on lines 14 to 17
You can rotate the certificate:

* When you receive an expiration alert
* When the private key is leaked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can rotate the certificate:
* When you receive an expiration alert
* When the private key is leaked
You can rotate the certificate for the following reasons:
* You receive an expiration alert
* The private key is leaked

* When you receive an expiration alert
* When the private key is leaked

NOTE: When a private key is leaked, you must rotate all of the certificates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly wonder if this is more important than a note 🤔

Suggested change
NOTE: When a private key is leaked, you must rotate all of the certificates.
[NOTE]
====
When a private key is leaked, you must rotate all of the certificates.
====

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeana-redhat I can't find it now where I saw this construct for an admonition documented (but I will!). However, I will make this change and also change it to 'IMPORTANT', because I agree that it is more impacting than having it as a note.

@openshift-ci openshift-ci bot added peer-review-done Signifies that the peer review team has reviewed this PR and removed peer-review-in-progress Signifies that the peer review team is reviewing this PR peer-review-needed Signifies that the peer review team needs to review this PR labels Jun 20, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 21, 2024
Copy link
Contributor

@jeana-redhat jeana-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two tiny nits otherwise LGTM

----

. Wait for the cluster Operators to rollout and stabilize by running the following command:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a + here to attach the command to the step

Suggested change
+

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -
----

. Wait for the cluster Operators to rollout and stabilize by running the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Wait for the cluster Operators to rollout and stabilize by running the following command:
. Wait for the cluster Operators to roll out and stabilize by running the following command:

@tmalove
Copy link
Contributor Author

tmalove commented Jun 21, 2024

/remove-label peer-review-done
/label peer-review-needed

@openshift-ci openshift-ci bot added peer-review-needed Signifies that the peer review team needs to review this PR and removed peer-review-done Signifies that the peer review team has reviewed this PR labels Jun 21, 2024
@mburke5678 mburke5678 added the peer-review-in-progress Signifies that the peer review team is reviewing this PR label Jun 21, 2024
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 21, 2024
Copy link

openshift-ci bot commented Jun 21, 2024

New changes are detected. LGTM label has been removed.

Comment on lines 16 to 17
* You receive an expiration alert
* The private key is leaked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per ISG: If the list items comprise only complete sentences, include a period after each sentence.

Suggested change
* You receive an expiration alert
* The private key is leaked
* You receive an expiration alert.
* The private key is leaked.


[IMPORTANT]
====
When a private key is leaked, you must rotate all of the certificates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume a standard user would know when a key is leaked and what that means....

$ oc get secret -n openshift-etcd etcd-signer -ojsonpath='{.metadata.annotations.auth\.openshift\.io/certificate-not-after}'
----

. Re-create the signer, if the remaining lifetime is close to the current date, by deleting the signer and wait for the static pod rollout by running the following commands:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that putting the lifetime phrase first ties in better with the previous step.

Suggested change
. Re-create the signer, if the remaining lifetime is close to the current date, by deleting the signer and wait for the static pod rollout by running the following commands:
. If the remaining lifetime is close to the current date, re-create the signer by deleting the signer and wait for the static pod rollout by running the following commands:

What should the user do if the remaining lifetime is not close to the current date?

@mburke5678
Copy link
Contributor

@tmalove A few nits. Otherwise LGTM. Don't forget to squash!

@mburke5678 mburke5678 added peer-review-done Signifies that the peer review team has reviewed this PR branch/enterprise-4.16 and removed peer-review-in-progress Signifies that the peer review team is reviewing this PR peer-review-needed Signifies that the peer review team needs to review this PR labels Jun 21, 2024
@mburke5678 mburke5678 added this to the Planned for 4.16 GA milestone Jun 21, 2024
@tmalove
Copy link
Contributor Author

tmalove commented Jun 21, 2024

/label merge-review-needed

@openshift-ci openshift-ci bot added the merge-review-needed Signifies that the merge review team needs to review this PR label Jun 21, 2024
@kalexand-rh kalexand-rh removed the merge-review-needed Signifies that the merge review team needs to review this PR label Jun 21, 2024

:_mod-docs-content-type: CONCEPT
[id="etcd-cert-alerts-metrics-signer_{context}"]
== etcd certificate rotation alerts and metrics signer certificates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
== etcd certificate rotation alerts and metrics signer certificates
= etcd certificate rotation alerts and metrics signer certificates

The first heading in a module is always an H1. Adjust the leveloffset in the assembly, if necessary.


:_mod-docs-content-type: PROCEDURE
[id="rotating-certificate-authority_{context}"]
== Rotating the etcd certificate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
== Rotating the etcd certificate
= Rotating the etcd certificate

The first heading in a module is always an H1. Adjust the leveloffset in the assembly, if necessary.

== Management

These certificates are only managed by the system and are automatically rotated.

== Services

etcd certificates are used for encrypted communication between etcd member peers, as well as encrypted client traffic. The following certificates are generated and used by etcd and other processes that communicate with etcd:
etcd certificates are used for encrypted communication between etcd member peers, and encrypted client traffic. The following certificates are generated and used by etcd and other processes that communicate with etcd:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
etcd certificates are used for encrypted communication between etcd member peers, and encrypted client traffic. The following certificates are generated and used by etcd and other processes that communicate with etcd:
etcd certificates are used for encrypted communication between etcd member peers and encrypted client traffic. The following certificates are generated and used by etcd and other processes that communicate with etcd:

== etcd certificate rotation alerts and metrics signer certificates

Two alert types inform users about pending `etcd` certificate expiration:
[horizontal]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please validate that this looks ok on docs.redhat after GA.

Copy link

openshift-ci bot commented Jun 21, 2024

@tmalove: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kalexand-rh kalexand-rh merged commit 29db4af into openshift:main Jun 21, 2024
3 checks passed
@kalexand-rh
Copy link
Contributor

/cherrypick enterprise-4.16

@openshift-cherrypick-robot

@kalexand-rh: new pull request created: #77935

In response to this:

/cherrypick enterprise-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch/enterprise-4.16 ok-to-test Indicates a non-member PR verified by an org member that is safe to test. peer-review-done Signifies that the peer review team has reviewed this PR size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants