WIP: encrypt data at rest with KMS #1872

flavianmissi · 2025-10-22T09:14:51Z

do not review.

also split the Proposal into sub-sections.

openshift-ci · 2025-10-22T09:15:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jaypoulz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

we probably have more.

also improves some sentences and add some links

openshift-bot · 2025-11-28T01:15:55Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

flavianmissi · 2025-11-28T13:51:02Z

/remove-lifecycle stale

also remove foundations EP, taken over by Arda

ardaguclu

I continue to "Workflow Description". Before that I wanted to drop my opinions.

ardaguclu · 2025-12-08T06:16:52Z

enhancements/kube-apiserver/kms-shim.md

+
+## Summary
+
+Provide a lightweight KMS shim architecture that enables users to deploy and manage their own KMS plugins (AWS, Vault, Thales, etc.) while OpenShift handles the complexity of Unix socket communication required by the Kubernetes KMS v2 API. OpenShift provides a socket proxy container image that users deploy alongside their KMS plugins to translate between network communication (used by the shim in API server pods) and Unix socket communication (required by standard KMS v2 plugins). This creates a clear support boundary, reduces Red Hat's support burden, and allows users to deploy KMS plugins anywhere (in-cluster, external infrastructure, or separate clusters) and update them independently of OpenShift's release cycle.


I know that this EP is not ready for review. Just consider my comments about storing historical context in this discussion. We can discuss the details offline.

allows users to deploy KMS plugins anywhere (in-cluster, external infrastructure, or separate clusters)

I think, we should strongly recommend running KMS plugins as static pods (in documentation) that are NOT managed by api server to prevent chicken-egg problem and lower latency.

EP updated, PTAL

ardaguclu · 2025-12-08T06:26:59Z

enhancements/kube-apiserver/kms-shim.md

+* Providing KMS plugin images or implementations (only socket proxy image)
+* Automatic injection or mutation of user plugin pods
+* Prescribing how users deploy socket proxy (sidecar, separate pod, external - all valid)
+* Authentication between shim and socket proxy (out of scope for Tech Preview)


Is there any need to authentication in any time?. They will be residing under same pod as separate containers?.

The [kube|openshift|oauth]-apiserver and its shim will be different containers under the same pod, but the socket proxy will be on a different pod together with the kms plugin.
I'm not sure we can get away with no auth 🤔

If we drop supporting external services can be used as kms plugin and strictly enforce running kms plugins on the every control plane host as static pods, do we still need authentication?.

Maybe offering such flexibility is a foot-gun for us?. For instance, api server will get an error due to not getting response within 100ms from external kms plugin, customers will directly complain to us. In my opinion, we have a chance to enforce all the kms plugins to run as static pods.

but the socket proxy will be on a different pod together with the kms plugin.

Services within the cluster can communicate each other without relying on any authentication. Am I missing something?.

For instance, api server will get an error due to not getting response within 100ms from external kms plugin

Oh, I wasn't aware this was enforced in the code.

Services within the cluster can communicate each other without relying on any authentication. Am I missing something?

That's correct. Maybe I'm overthinking this, but with the kms-plugin as a side-car to the apiservers, only the apiservers themselves can access the plugin to decrypt resources. By placing the kms-plugin on its own pod, we introduce a route which is available to other workloads in the cluster.
I could be wrong, but an unguarded decryption service doesn't sound that much better than plain text 🤔

I could be wrong, but an unguarded decryption service doesn't sound that much better than plain text

That is a good point.

I wasn't aware this was enforced in the code.

I couldn't find the hard coded timeout value (probably I'm mistaken). There is a timeout value in kms api https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/#encrypting-your-data-with-the-kms-provider-kms-v2. We'll probably set it to some opinionated value which won't be too high (high values can make cluster unstable)

ardaguclu · 2025-12-08T06:30:18Z

enhancements/kube-apiserver/kms-shim.md

+1. **KMS Shim** (OpenShift-managed sidecar in API server pods): Translates Unix socket → HTTP/gRPC network calls
+2. **Socket Proxy** (OpenShift-provided image, user-deployed): Translates HTTP/gRPC network calls → Unix socket
+
+This architecture solves the **SELinux MCS isolation problem** that prevents different pods from sharing Unix sockets via hostPath, while allowing users to deploy standard upstream KMS v2 plugins without modification. Users deploy the socket proxy alongside their plugin using OpenShift's provided container image, giving them full control over the deployment architecture (in-cluster, external, or hybrid).


full control over the deployment architecture (in-cluster, external, or hybrid).

In my opinion, we should be very restrictive instead of full flexibility. Because circular dependency between kms plugin and api server is non-solvable outage.

Good point. I think this is mostly for in-cluster kms plugin though, since plugins external entirely remove the dependency (but increase the latency).
I'll see how I can update this to reflect that.

This solutions indeed provides such flexibility and no need of excluding it. But it deserves to mention in here and documentation about strongly not recommended.

Maybe we can totally get rid of mentioning external services or standard pods, etc. Customers must (you know I like using must statements :) ) run kms plugins as static pods.

Yeah that's true. I think sticking to static pods for kms-plugin deployments also gives us more flexibility in what kind authentication model we'll chose to use.
I'll make updates to reflect that.

What about kube apiserver can communicate via host network or uds in TP. After that we support aggregated api servers via Service resource in TP v2?

ardaguclu · 2025-12-08T06:36:47Z

enhancements/kube-apiserver/kms-shim.md

+
+**Key Innovation 1: Intelligent Routing in Shim**
+
+The shim maintains multiple endpoint configurations and routes requests intelligently based on operation type:


This intelligent mechanism binds plugin shim to the encryptionconfigurations of apiserver. Because order matters in encryptionconfigurations. In my opinion, it might be better to initiate a plugin shim container per each kms plugin would be simpler.

ardaguclu · 2025-12-08T06:41:53Z

enhancements/kube-apiserver/kms-shim.md

+The shim maintains multiple endpoint configurations and routes requests intelligently based on operation type:
+
+- **Encrypt requests**: Always sent to the **primary endpoint** (the `endpoint` field)
+- **Decrypt requests**: Try **primary endpoint** first, then fall back to **additional endpoints** (the `additionalEndpoints` field) if decryption fails


This sounds like not a duty of plugin shim, instead it is a duty of apiserver based on the order of apiserver. How can plugin shim manage this order?

- kms plugin 1 - identity - kmsplugin 2

If plugin shim falls back to kmsplugin 2, that is wrong. API server should fall back to identity, if kms plugin 1 fails.

That's an excellent point. I'm going to remove the "intelligent" bits of the design for now.

ardaguclu · 2025-12-08T06:44:44Z

enhancements/kube-apiserver/kms-shim.md

+**Key Innovation 2: User-Controlled Deployment with OpenShift-Provided Components**
+
+OpenShift provides the socket proxy container image, and users deploy it however they choose:
+- **In-cluster as sidecar**: User deploys plugin + socket proxy in same pod (simplest)


Since we are the owner of platform, we should enforce customers to use this option.

ardaguclu · 2025-12-08T06:45:21Z

enhancements/kube-apiserver/kms-shim.md

+
+OpenShift provides the socket proxy container image, and users deploy it however they choose:
+- **In-cluster as sidecar**: User deploys plugin + socket proxy in same pod (simplest)
+- **In-cluster as separate pods**: User deploys plugin and socket proxy in separate pods (if they prefer)


Due to OCP policies, I don't think this will ever work (even we couldn't have made it work).

ardaguclu · 2025-12-08T06:46:15Z

enhancements/kube-apiserver/kms-shim.md

+OpenShift provides the socket proxy container image, and users deploy it however they choose:
+- **In-cluster as sidecar**: User deploys plugin + socket proxy in same pod (simplest)
+- **In-cluster as separate pods**: User deploys plugin and socket proxy in separate pods (if they prefer)
+- **External infrastructure**: User deploys plugin + socket proxy on VMs, separate clusters, or cloud provider infrastructure


As far as I recall, every kms plugin call must be finished under 100ms. I think, it can't work in any case.

ardaguclu

I've completed my review with more comments.

ardaguclu · 2025-12-08T07:03:35Z

enhancements/kube-apiserver/kms-shim.md

+     encryption:
+       type: KMS
+       kms:
+         type: External


It is nice to reserve this. It gives us an option in the future to add internal kms.

ardaguclu · 2025-12-08T07:04:39Z

enhancements/kube-apiserver/kms-shim.md

+       kms:
+         type: External
+         external:
+           endpoint: http://vault-kms-plugin.kms-plugins.svc:8080


I was also thinking about service endpoint to use as unique identifier.

ardaguclu · 2025-12-08T07:12:57Z

enhancements/kube-apiserver/kms-shim.md

+         type: External
+         external:
+           endpoint: http://vault-kms-new.kms-plugins.svc:8080
+           additionalEndpoints:


Exposing these additionalEndpoints as configurable can very easily conflict with encryptionconfigurations. We shouldn't do that. User can update endpoint to http://vault-kms-new.kms-plugins.svc:8080 and our controllers need to keep the older kms plugins as active automatically (that leads us to plugin shim lifecycle management, I'm not sure why do we avoid that?)

We start migration whenever we see a new encryption configuration. AdditionalEnpoints breaks that rule too. Old endpoints in any case will be stored in encryptionconfigurations as fall back mechanism in the providers to be used by encryption controllers. This extra field may be removed entirely?

EDITED: additionalEndpoints makes almost all encryption controllers as redundant.

ardaguclu · 2025-12-08T07:18:19Z

enhancements/kube-apiserver/kms-shim.md

+
+#### Risk: User Deployment Errors
+
+**Risk:** Users incorrectly deploy plugin + socket proxy (wrong socket path, missing Service, port mismatch, etc.)


If KMS plugin is not deployed as static pods, API Server will depend on KMS plugin to decrypt resources with the same time KMS plugin will need to be initialized by API Server.

ardaguclu · 2025-12-08T07:19:32Z

enhancements/kube-apiserver/kms-shim.md

+
+**SELinux MCS (Multi-Category Security) Isolation Problem:**
+
+OpenShift uses SELinux MCS labels to provide pod-to-pod isolation. Each pod gets a unique MCS label (e.g., `s0:c111,c222`). When a container creates a file (including Unix sockets) via hostPath, the file inherits the container's MCS label:


Based on my some investigations, if we configure SCCs as privileged, pods can communicate with each other.

However, in my opinion main goal of this EP (e.g. plugin shim to convert unix socket <-> http) provides us more sustainable and flexible solution that can work with any plugin.

Thanks for taking the time to investigate 💪🏼 From our discussion with Ben today, it sounds like we can go ahead with your approach for the early TP if nothing else.
I'll update this PR to reflect that.

ardaguclu · 2025-12-08T08:03:00Z

enhancements/kube-apiserver/kms-shim.md

+   - API server reads old secrets, shim routes decrypt to **additional endpoints** (old key)
+   - API server re-encrypts, shim routes encrypt to **primary endpoint** (new key)
+7. Once migration completes, cluster admin removes `additionalEndpoints` from config
+8. Shim stops routing to **additional endpoints**, old plugin deployment can be deleted


Prune controller is supposed to clear the old provider configurations. Shim should stop routing, only when prune_controller deletes the configurations from encryptionconfgurations.

also clarify flow when kms config changes

also add note on authentication for GA

openshift-ci · 2025-12-08T12:58:35Z

@flavianmissi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`b2703fb`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2026-01-07T01:15:56Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

ardaguclu · 2026-01-07T04:55:05Z

/remove-lifecycle stale

flavianmissi added 6 commits October 20, 2025 12:13

initial version of kms encryption provider enhancement

923691e

start working on "Workflow Description"

cd1f13c

also split the Proposal into sub-sections.

remove title section description

aa6e146

start detailing plugin management during key rotation

5c98a6a

pull in api extensions from openshift#1682

3f91ebc

wip: kms plugin management

d9ea20b

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2025

openshift-ci bot requested review from TrilokGeer and hasbro17 October 22, 2025 09:15

flavianmissi mentioned this pull request Oct 22, 2025

WIP: Kms encryption ardaguclu/enhancements#1

Closed

add requirements for kms plugin management

235b431

we probably have more.

flavianmissi mentioned this pull request Oct 23, 2025

API-1690: KMS Encryption Provider for Etcd Secrets #1682

Closed

flavianmissi added 4 commits October 29, 2025 16:22

expand on kms plugin management and key rotation

9ad8976

also improves some sentences and add some links

improve kms plugin management section

7a02c83

improve rotation section

1dc67d0

wip: plugin management

2a37ac5

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 28, 2025

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 28, 2025

flavianmissi added 2 commits November 28, 2025 14:56

split kms enhancement in 3

13b3d48

fix dates

362bedf

ardaguclu mentioned this pull request Dec 3, 2025

CNTRLPLANE-2120: Add KMS foundations in encryption controllers in library-go #1900

Open

flavianmissi added 2 commits December 4, 2025 16:54

wip: update plugin management EP

ea05674

add shim EP

bb6d4ff

also remove foundations EP, taken over by Arda

ardaguclu reviewed Dec 8, 2025

View reviewed changes

kms-shim: add recommended kms-plugin deployment arch

4157ce9

flavianmissi added 2 commits December 8, 2025 10:23

kms-shim: simplify shim design

f02fbe7

also clarify flow when kms config changes

kms-shim: make static pods only supported approach

b2703fb

also add note on authentication for GA

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2026

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2026


		## Summary

		Provide a lightweight KMS shim architecture that enables users to deploy and manage their own KMS plugins (AWS, Vault, Thales, etc.) while OpenShift handles the complexity of Unix socket communication required by the Kubernetes KMS v2 API. OpenShift provides a socket proxy container image that users deploy alongside their KMS plugins to translate between network communication (used by the shim in API server pods) and Unix socket communication (required by standard KMS v2 plugins). This creates a clear support boundary, reduces Red Hat's support burden, and allows users to deploy KMS plugins anywhere (in-cluster, external infrastructure, or separate clusters) and update them independently of OpenShift's release cycle.


		Key Innovation 1: Intelligent Routing in Shim

		The shim maintains multiple endpoint configurations and routes requests intelligently based on operation type:


		#### Risk: User Deployment Errors

		Risk: Users incorrectly deploy plugin + socket proxy (wrong socket path, missing Service, port mismatch, etc.)


		SELinux MCS (Multi-Category Security) Isolation Problem:

		OpenShift uses SELinux MCS labels to provide pod-to-pod isolation. Each pod gets a unique MCS label (e.g., `s0:c111,c222`). When a container creates a file (including Unix sockets) via hostPath, the file inherits the container's MCS label:

WIP: encrypt data at rest with KMS #1872

Are you sure you want to change the base?

WIP: encrypt data at rest with KMS #1872

Uh oh!

Conversation

flavianmissi commented Oct 22, 2025

Uh oh!

openshift-ci bot commented Oct 22, 2025

Uh oh!

openshift-bot commented Nov 28, 2025

Uh oh!

flavianmissi commented Nov 28, 2025

Uh oh!

ardaguclu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ardaguclu Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flavianmissi Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ardaguclu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ardaguclu Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ardaguclu Dec 8, 2025 •

edited

Loading

flavianmissi Dec 8, 2025 •

edited

Loading

ardaguclu Dec 8, 2025 •

edited

Loading