Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Implement rotation of service-account signing key (take two) #11204

Closed
wants to merge 11 commits into from

Conversation

johngmyers
Copy link
Member

@johngmyers johngmyers commented Apr 11, 2021

Drops support for downgrading to kops versions before 1.18. Doing so will cause all keypairs to revert to the state they were prior to upgrading kops.

Downgrading to a kops version before 1.21 can cause a disruptive rotation of the service account token signing key.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 11, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from johngmyers after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@johngmyers johngmyers force-pushed the rotate-cmd3 branch 5 times, most recently from 633c721 to 3965d44 Compare April 11, 2021 06:43
@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Apr 11, 2021
@johngmyers johngmyers force-pushed the rotate-cmd3 branch 3 times, most recently from dc77313 to db3fb2d Compare April 11, 2021 18:06
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/kops-controller and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 11, 2021
@johngmyers johngmyers force-pushed the rotate-cmd3 branch 4 times, most recently from 6894f49 to 893995e Compare April 11, 2021 20:18
@johngmyers
Copy link
Member Author

One issue with this approach will be the effect on a downgrade of the kops version. Unless I take contortions with the on-disk keyset.yaml format, a rollback from kops 1.21 to a version without this patch will cause a potentially disruptive rotation of the service account key, as the previous versions won't read the new "PrimaryId" field and will incorrectly intuit that the "next" key is the primary one.

Any recommendations @justinsb ?

A related question is when do we want to drop support for downgrading kops to versions before 1.18? If we drop that support we can stop writing the legacy format files into the keystore.

@johngmyers
Copy link
Member Author

I believe this is most of what is needed. I still need to do some manual testing before removing the WIP. I should also add a release note.
/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 12, 2021
@olemarkus
Copy link
Member

I like that with this, one can explicitly move the primary marker. Relying on the highest serial number causes that one issue with premature exporting of admin credentials. I see you also change the pool logic, so this bit looks fully compatible with my rotate command.

One common use case for rotating CA is to deal with leavers that had access to the secret store. Since we preemptively provision next here, wouldn't people then have to do double rotation? A similar issue would be if one has to rotate when the CA expires. Both of these concerns are specific to CAs though.

My PR creates "next" as part of the rotation command. I wonder if not that could be done here? It is nice to do this as part of a fitask, but still ...

I didn't see the logic for deleting the older keys (docs says all keys but the primary will be removed), but based on your warning, I assume running the command twice in short succession would cause disruption. My PR mitigates that scenario by the command itself rotating the cluster, then moving the primary marker.

@johngmyers
Copy link
Member Author

To invalidate all credentials that existed at time T one would have to do three rotations. One to move next into primary, another to move primary to previous, and another to remove previous. It would be possible to have an alternative procedure: delete the "next" key, propagate out a newly created "next" key, wait, rotate, wait, delete the "previous" key, and propagate the deletion.

Creating "next" as part of fitasks simplifies the code. After the rotation command one needs to apply_cluster, so there's little reason to duplicate the logic there.

The older keys are deleted by the rotate command, by dint of not including them in the new Keyset.

I do not like the rotate command doing rolling updates, especially not before making further changes. As I had mentioned before, that risks failures that would then be hard to recover from. The canonical example would be if the machine running kops abruptly dies, for example during one of those rolling updates.

@olemarkus
Copy link
Member

If rolls are aborted mid-run, you have say 1 api server that trusts both new and old cert, 2 that only trust old CA.
A user connects using the old cert and trusts both CA, and can talk to all 3.
Pods trust both, and can talk to all three.

On the second rotate, it is similar, but user connects to with the new cert and can talk to all 3.

I don't think being aborted causes unrecoverable failures with this setup.

if you restart an aborted rotation command, it won't provision any new keys, and then start rotating again. Only inconvenience there is one has to rotate both time regardless. The first rotation could be dropped using your primary ID feature, as the new key would be promoted only after the first run, when we know all nodes trust it.

@hakman
Copy link
Member

hakman commented Apr 13, 2021

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 13, 2021

@johngmyers: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kops-e2e-k8s-containerd 930defa link /test pull-kops-e2e-k8s-containerd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@johngmyers
Copy link
Member Author

@olemarkus I've tried to put the reasons for my discomfort with an all-in-one-go rotation command in my review of #10516. I can see it could be challenging trying to converge on a UI with a discussion that is scattered over multiple PRs. We could try creating an issue covering rotation in general, but that reminds me of a certain comic strip concerning standards.

I see the benefit of deferring the creation and staging of the new key until after the admin decides that rotation is needed. That would suggest separate commands or subcommands for staging the new key, changing the primary, and distrusting the old key. For something as simple as the service-account key, it's roughly the same amount of work to just do three rotations and this PR's approach of pre-staging the new key is simpler when having a regular rotation schedule.

For the cluster CA, on the other hand, having separate stage, rotate, and distrust operations would only require distributing certificate-authority-data to out-of-cluster kubeconfigs two times instead of three.

@olemarkus
Copy link
Member

In short, I am happy with a command for distrusting a key.

I am thinking a new flag on the keyset for that, then the builder/fitask just need to check that in addition when deciding if one should provision next or not.

We document the full recipe for how to rotate the cluster and my full rotate command then just becomes copy pasting that entire recipe.

kops rotate ca distrust-primary 
kops update cluster... (provisions new key) 
kops rolling-update cluster... 
kops rotate ca promote-next (should this also patch ServiceAccounts or is that manual too?) 
... 

How does that sound?

@johngmyers
Copy link
Member Author

johngmyers commented Apr 14, 2021

Originally I was thinking distrusting would be deleting it from the keystore (followed by an apply and rolling update) but if we're to eventually move to more versioned state we might want to keep the old keys in the store marked with a distrusted bit.

I think the sequence would be something like:

kops rotate create-next [ca|service-account]
kops update cluster --yes
kops rolling-update cluster --instance-group-role master,apiserver --yes
something to patch the ServiceAccounts (if ca)
kops rolling-update cluster --force --yes (if ca)
# propagate new certificate-authority-data (if ca)
kops rotate change-primary [ca|service-account]
kops update cluster --yes
kops rolling-update cluster --yes
kops rotate distrust-previous [ca|service-account]
kops update cluster --yes
kops rolling-update cluster --instance-group-role master,apiserver --yes (if ca)
something to patch the ServiceAccounts (if ca)
kops rolling-update cluster --yes
# propagate new certificate-authority-data (if ca)

@olemarkus
Copy link
Member

I think that looks about right.

I guess create-next is the same idea, just more appropriate wording, since my distrust doesn't actually distrust. It would even be deployed as part of the bundle. Maybe deprecate is a better name? At least deprecated sounds like an okay name for that bit.

But yeah, think we are aligned on the approach and we are down to just naming things :)

@johngmyers
Copy link
Member Author

We could probably just go with kops create secret for the first step

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2021
@k8s-ci-robot
Copy link
Contributor

@johngmyers: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@johngmyers johngmyers closed this Jun 25, 2021
@johngmyers johngmyers deleted the rotate-cmd3 branch June 25, 2021 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api area/documentation area/kops-controller area/nodeup area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants