KEP-2839: Add KEP for in-use protection #2840

mkimuram · 2021-07-27T14:27:11Z

One-line PR description: A generic feature to protect objects from deletion while it is in use

Issue link: In-use protection #2839

Other comments:

keps/sig-api-machinery/2839-in-use-protection/README.md

bgrant0607 · 2021-08-03T17:16:40Z

Cross-linking an old issue: kubernetes/kubernetes#10179

mkimuram · 2021-08-04T17:27:49Z

@lavalamp @bgrant0607

Sharing just a rough idea.

Just for preventing deletion, the below design will be enough (We still need to consider how to protect deletion of LienClaim itself). Also, update may be implemented in the same way, like adding "spec-update" to LienClaim.spec.prevents and appending uuid to *.metadata.spec-update-liens. However, it would be difficult to define preventing update for specific fields as we can specify through "field manager", by this way.

User creates LienClaim in their namespace with specifying targets and prevents.

apiVersion: v1beta1
kind: LienClaim
metadata:
  namespace: ns1
  name: my-lien
spec:
  targets:
  - apiVersion: v1
    kind: secret
    namespace: ns2
    name: important-secret
  prevents:
  - delete

Lien controller watches LienClaim and create cluster-scoped Lien and append its name to *-liens, or delete-liens in the below example, to the target if a new LienClaim is created. Note that a name for Lien is generated as UUID.

apiVersion: v1beta1
kind: Lien
metadata:
  name: 12345678-1234-1234-1234-1234567890ab
spec:
  targets:
  - apiVersion: v1
    kind: secret
    namespace: ns2
    name: important-secret
  source:
    namespace: ns1
    name: my-lien
  prevents:
  - delete

apiVersion: v1
kind: Secret
metadata:
  namespace: ns2
  name: important-secret
  delete-liens:
  - 12345678-1234-1234-1234-1234567890ab
type: Opaque
data:

Lien controller may also update LienClaim status if 2 succeeds

apiVersion: v1beta1
kind: LienClaim
metadata:
  namespace: ns1
  name: my-lien
spec:
  targets:
  - apiVersion: v1
    kind: secret
    namespace: ns2
    name: important-secret
  prevents:
  - delete
status:
  lien: 12345678-1234-1234-1234-1234567890ab
  phase: preventing

On deletion of resources, lien admission controller checks if metadata.delete-liens is empty. If not, return error to block deletion.
Lien controller watches LienClaim and delete the lien from *-liens on the target and delete Lien if the LienClaim is deleted.

mkimuram · 2021-08-04T17:31:09Z

From the viewpoint of consuming lien from secret protection, which is my biggest concern. How to create LienClaim before creating referencing objects and delete LienClaim after deleting referencing objects is need to be considered.
(I'm not sure if mutation hook or ownerReference can achieve it. Although, we can at least leave it for users.)

lavalamp · 2021-08-04T17:34:20Z

Why do you want to have lien objects? Why are text strings not sufficient (as used for finalizers). Adding additional object types adds more places to have races, I'm against it unless you can convince me it's absolutely necessary.

mkimuram · 2021-08-04T18:08:01Z

Why do you want to have lien objects?

It's to make it easier to manage who has control over the specific lien in the delete-liens field.

If multiple liens are set to a single resource, we will need to ensure that the specific lien is managed by the client/controller when deleting the lien. We may add the LienClaim's identity information to delete-liens but it will tend to be long and difficult to much.

Instead, we might just add LienClaim's uid. However, just adding uid, it is difficult for the owner of the resource to find from the uid. Also, I'm not sure if object uid can be regarded as unique. Creating an object with the name should ensure it.

Adding additional object types adds more places to have races, I'm against it unless you can convince me it's absolutely necessary.

Agree. If above concern can be solved by the other way, I agree to remove cluster-scoped Lien object.

lavalamp · 2021-08-04T18:16:46Z

You want to ACL liens? I don't see a need to enforce this any more than we do for finalizers. And I would not solve that problem by adding Lien or LienClaim objects. I don't think a solution here should require any additional object types.

mkimuram · 2021-08-04T18:51:45Z

If only talking about the use case for secret protection, the feature needed is like block deleting "Secret A" that can be used by "Pod B" and "PersistentVolume C" while the secret is used. So, something like below will be enough.

apiVersion: v1
kind: Secret
metadata:
  namespace: ns2
  name: A
  delete-liens: 
    - "ID to show that Pod B is using"
    - "ID to show that PersistentVolume C is using"
type: Opaque
data:

And to generalize it, I was thinking about a way to manage such ID for each reason to avoid conflict in deleting liens by lien system's side. However, we can leave it to consumers.

mkimuram · 2021-08-04T19:15:17Z

And also thinking about using a lien per controller, not per reason, like below.

apiVersion: v1
kind: Secret
metadata:
  namespace: ns2
  name: A
  delete-liens: 
    - "k8s.io/secret-protection-controller"
type: Opaque
data:

Then, I start to think about which will be easier "implementing own admission hook to block deletion" or "implementing controller to add lien"?

(Obviously, for users who would like to add such protection manually, lien systems is useful as this shows, but for controllers, it might not.)

lavalamp · 2021-08-04T19:28:09Z

There is no need for a lien controller. We would add code to kube-apiserver (reject deletion if there are liens, don't permit new liens if deletion timestamp is set).

We might want slightly more structure than raw text.

You can prototype today with e.g. specially formatted annotations and a webhook admission controller.

lavalamp · 2021-08-04T19:40:43Z

Sorry, I misread :) A lien per controller or per reason doesn't matter from the perspective of the lien mechanism, it's up to controller authors how they want to use the mechanism.

mkimuram · 2021-08-13T23:24:06Z

@lavalamp

You can prototype today with e.g. specially formatted annotations and a webhook admission controller.

I've implemented a prototype of lien as this.
It is separated to 4 commits, but only 2 commits, the first commit and the last commit, will be the points of interest.

I will update the KEP based on this prototype.

Note that I've tested below way and work as shown below:

Deploy:

ENABLE_ADMISSION_PLUGINS=Lien hack/local-up-cluster.sh

Test:

Without liens, deletion is not blocked

kubectl create secret generic test-secret --from-literal='username=my-app' --from-literal='password=39528$vdg7Jb'
kubectl get secret test-secret -o jsonpath='{.metadata.liens}{"\n"}'

kubectl delete secret test-secret
secret "test-secret" deleted

With liens, deletion is blocked, and once all liens are removed, deletion is not blocked

kubectl create secret generic test-secret --from-literal='username=my-app' --from-literal='password=39528$vdg7Jb'
kubectl patch secret test-secret -p '{"metadata":{"liens":["never delete me"]}}' --type=merge
secret/test-secret patched

kubectl get secret test-secret -o jsonpath='{.metadata.liens}{"\n"}'
[never delete me]

kubectl delete secret test-secret
Error from server (Forbidden): secrets "test-secret" is forbidden: deletion not allowed by liens

kubectl patch secret test-secret -p '{"metadata":{"liens":[]}}' --type=merge
secret/test-secret patched

kubectl delete secret test-secret
secret "test-secret" deleted

mkimuram · 2021-08-16T18:43:10Z

@lavalamp

I've updated the KEP. PTAL

Also, I will upate the prototype of secret-protection to rely on this in-use protection mechanism to check the feasibility and also update the KEP for secret-protection.

keps/sig-api-machinery/2839-in-use-protection/README.md

lavalamp · 2021-08-17T23:22:02Z

keps/sig-api-machinery/2839-in-use-protection/README.md

+
+To minimize issues caused by a cross namespace dependency, it should be considered to implement
+a mechanism to opt-in/opt-out adding liens across namespace in such controllers.
+For example, a controller may choose to add `Liens` only to resources that has `example.com/sample-controller/allow-cross-namespace-liens: true` annotation, if the dependency resource isn't in the same namespace.


I would recommend removing this whole section, honestly. I think it's a separate discussion for each controller and I don't think this KEP should recommend a general solution; I'm not convinced there's a problem at all.

Actually, I would like to confirm that this situation is allowed by design from community. If it is allowed, I agree to delete such a mechanism to avoid it completely.

fabiand · 2023-09-08T14:29:19Z

I've came across https://github.com/adobe/k8s-shredder#how-it-works

K8s-shredder will periodically run eviction loops, based on configured EvictionLoopInterval, trying to clean up all the pods from the parked nodes. Once all the pods are cleaned up, cluster-autoscaler should chime in and recycle the parked node.

To me this connects with the desire to make applications aware of interruption requests.
Without further effort, PDBs will reject evictions, apps don't know about it.

If we had an i.e. taint based draining, apps could watch this life-cycling - terminate themselfes properly, and free the node eventually.

k8s-shredder would not need to do (imperative) eviction loops, instead: Taint (declare) a node to be drained, watch for node to be free.

k8s-triage-robot · 2024-01-21T11:20:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

porridge · 2024-01-22T05:26:19Z

/remove-lifecycle stale

…

Message ID: ***@***.***>

turkenh · 2024-02-15T07:25:38Z

We have implemented a new API named Usage in Crossplane which is following a similar approach as proposed here, e.g. block deletion of objects in use with an admission webhook by returning a 409.

This works fine in most cases, except the degraded experience due to huge delays caused by the backoff in Kubernetes Garbage collector. We discussed some possible solutions during the design but couldn't find any elegant solution. Basically, we need a way to reset garbage collectors backoff on a given resources. Is there were a way to achieve this? Just like most controllers, if making a change on a given resource (e.g. add an annotation/label) could requeue the resource immediately to the garbage collector work queue, we could just do that. Apparently this is not possible.

Is there a way to reset garbage collectors backoff period by patching the resource somehow? Otherwise, the proposal here is in subject to the same problem I believe.

k8s-triage-robot · 2024-05-15T08:04:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

thockin · 2024-05-15T21:20:53Z

I repeat myself: #2840 (comment)

@deads2k any softening of position?

sftim · 2024-05-16T00:12:06Z

You know what, I'd like to repeat #2840 (comment) as well. I still recommend: write viable docs for the thing, then build it if we like what we see.

porridge · 2024-05-20T11:36:07Z

/remove-lifecycle stale

k8s-triage-robot · 2024-08-18T12:09:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

porridge · 2024-08-19T05:32:18Z

/remove-lifecycle stale

k8s-triage-robot · 2024-11-17T06:24:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

porridge · 2024-11-18T07:27:05Z

/remove-lifecycle stale

k8s-triage-robot · 2025-02-16T07:33:01Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

porridge · 2025-02-17T08:30:42Z

/remove-lifecycle stale

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 27, 2021

k8s-ci-robot requested review from deads2k and fedebongio July 27, 2021 14:27

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 27, 2021

This was referenced Jul 27, 2021

In-use protection #2839

Open

Add KEP for Secret/ConfigMap protection #2640

Closed

mkimuram commented Jul 27, 2021

View reviewed changes

keps/sig-api-machinery/2839-in-use-protection/README.md Show resolved Hide resolved

lavalamp reviewed Jul 27, 2021

View reviewed changes

keps/sig-api-machinery/2839-in-use-protection/README.md Show resolved Hide resolved

mkimuram force-pushed the issue/2839 branch 2 times, most recently from 4ec78cb to 659be2f Compare August 16, 2021 18:38