diff --git a/keps/prod-readiness/sig-auth/4412.yaml b/keps/prod-readiness/sig-auth/4412.yaml index b8562fca949..54dc8445cf4 100644 --- a/keps/prod-readiness/sig-auth/4412.yaml +++ b/keps/prod-readiness/sig-auth/4412.yaml @@ -1,3 +1,5 @@ kep-number: 4412 alpha: approver: "deads2k" +beta: + approver: "deads2k" diff --git a/keps/sig-auth/4412-projected-service-account-tokens-for-kubelet-image-credential-providers/README.md b/keps/sig-auth/4412-projected-service-account-tokens-for-kubelet-image-credential-providers/README.md index f3dcc8f58d9..a0f79e0a778 100644 --- a/keps/sig-auth/4412-projected-service-account-tokens-for-kubelet-image-credential-providers/README.md +++ b/keps/sig-auth/4412-projected-service-account-tokens-for-kubelet-image-credential-providers/README.md @@ -104,7 +104,6 @@ tags, and then generate with `hack/update-toc.sh`. - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) - - [Post Alpha](#post-alpha) - [Beta](#beta) - [GA](#ga) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) @@ -141,10 +140,10 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free @@ -687,6 +686,13 @@ For Beta and GA, add links to added tests together with links to k8s-triage for https://storage.googleapis.com/k8s-triage/index.html --> +This kubelet feature is fully tested with unit and e2e tests. + +For the node audience restriction changes in KAS, integration tests were added as part of the [implementation in v1.32 release](https://github.com/kubernetes/kubernetes/pull/128077). + +- [test/integration/auth/node_test.go](https://github.com/kubernetes/kubernetes/blob/master/test/integration/auth/node_test.go) +- [triage history](https://storage.googleapis.com/k8s-triage/index.html?text=TestNodeRestrictionServiceAccountAudience&test=test%2Fintegration%2Fauth) + ##### e2e tests +There is an existing e2e test for kubelet credential providers using gcp credential provider. + +- test/e2e_node/image_credential_provider.go: https://testgrid.k8s.io/sig-node-kubelet#kubelet-credential-provider + +As part of alpha implementation, the [e2e test has been updated](https://github.com/kubernetes/kubernetes/commit/2090a01e0a495301432276216bbf9af102fc431c) to cover the new credential provider configuration and the new behavior of the kubelet when the `TokenAttributes` field is set. + +We created a symlink to the existing gcp credential provider executable with a different name to use for testing service account token for credential provider. The credential provider has been updated to validate the following when plugin is run in service account token mode: + +1. Check the required annotations are sent as part of the `CredentialProviderRequest.ServiceAccountAnnotations` field. +2. Check the service account token is sent as part of the `CredentialProviderRequest.ServiceAccountToken` field. +3. Extract the claims from the service account token and validate the audience claim matches the `ServiceAccountTokenAudience` field in the kubelet's credential provider configuration. + ### Graduation Criteria - [x] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: `ServiceAccountTokenForKubeletCredentialProviders` + - Feature gate name: `KubeletServiceAccountTokenForCredentialProviders` - Components depending on the feature gate: kubelet -```go -FeatureSpec{ - Default: false, - LockToDefault: false, - PreRelease: featuregate.Alpha, -} -``` - - [x] Feature gate (also fill in values in `kep.yaml`) - Feature gate name: `ServiceAccountNodeAudienceRestriction` - Components depending on the feature gate: kube-apiserver -```go -FeatureSpec{ - Default: true, - LockToDefault: false, - PreRelease: featuregate.Beta, -} -``` +The purpose of the two feature gates is different, which is why they weren't named similarly. + +The `KubeletServiceAccountTokenForCredentialProviders` feature gate is used to enable the kubelet to use service account tokens for image pull in the kubelet credential provider. + +The `ServiceAccountNodeAudienceRestriction` feature gate is used to enable the kube-apiserver to validate the audience of the service account token requested by the kubelet. The feature gate in the Kubernetes API Server (KAS) was introduced to strictly enforce which audiences the kubelet can request tokens for. Before this change, the kubelet could request a token with any audience. With the feature gate enabled, the API server starts validating the requested audience. + +The KAS feature gate doesn't need to be enabled for the kubelet feature to work. It graduated to beta in v1.32 and is enabled by default. The two are unrelated in functionality, but the KAS gate was necessary to ensure strict enforcement of the allowed audiences the kubelet can request tokens for. + +If the KAS feature gate is not enabled, there will be no validation of the audience requested by the kubelet, and the kubelet will be able to request tokens for any audience. This is not recommended. ###### Does enabling the feature change any default behavior? @@ -933,7 +943,8 @@ Steps to disable the feature: 3. Restart the kubelet. These steps need to be performed on all nodes in the cluster. -After restarting the kubelet on all nodes, remove the audiences used by kubelet from the KAS `--allowed-kubelet-audiences` flag. +After restarting the kubelet on all nodes, remove the allowed audiences for which the kubelet is allowed to generate service account tokens for image pulls in KAS by +removing the previous `ClusterRole` or `Role` with the `request-serviceaccounts-token-audience` verb, along with the corresponding `ClusterRoleBinding` or `RoleBinding` that binds the role to the kubelet. ###### What happens if we reenable the feature if it was previously rolled back? @@ -974,6 +985,9 @@ rollout. Similarly, consider large clusters and how enablement/disablement will rollout across nodes. --> +Feature is enabled but exec plugin does not properly fetch and return credentials to the kubelet. +Impact is that kubelet cannot authenticate and pull credentials from those registries. + ###### What specific metrics should inform a rollback? +High error rates from `kubelet_credential_provider_plugin_error` and long durations from `kubelet_credential_provider_plugin_duration`. + ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +No, upgrade->downgrade->upgrade were not tested. Manual validation will be done prior to promoting this feature to beta in v1.34. + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +No. + ### Monitoring Requirements +New metrics: + +- `kubelet_credential_provider_config_hash` indicates the hash of the kubelet credential provider configuration file. This metric can be used by operators to determine if the kubelet credential provider configuration has changed. + ###### How can an operator determine if the feature is in use by workloads? +Operators can use `kubelet_credential_provider_config_hash` metric to determine if the kubelet credential provider configuration has changed. If the hash of the configuration file changes, it indicates that the kubelet credential provider configuration has been updated. + ###### How can someone using this feature know that it is working for their instance? -- [ ] Events - - Event Reason: -- [ ] API .status - - Condition name: - - Other field: -- [ ] Other (treat as last resort) - - Details: +Users can observe events for successful image pulls that use the service account token for image pull. + +- [x] Events + - Event Reason: " Successfully pulled image "xxx" in 11.877s (11.877s including waiting). Image size: xxx bytes." + +For registries or images configured to be pulled using a credential provider with a service account, a successful image pull seems to be the only way to confirm that it's working. If the credential provider is misbehaving, the kubelet will not be able to authenticate to the registry and pull images, which will result in image pull errors. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? @@ -1048,6 +1073,15 @@ These goals will help you determine what you need to measure (SLIs) in the next question. --> +On failure to fetch credentials from an exec plugin, the kubelet will retry after some period and invoke the plugin again. +The kubelet will retry whenever it attempts to pull an image, but until then, kubelet will not be able to authenticate to +the registry and pull images. The SLO for successfully invoking exec plugins should be based on the SLO for successfully +pulling images for the container registry in question. + +The SLOs defined in [Pod startup latency SLI/SLO details](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md) +don't apply to this feature because image pull SLI is explicitly excluded from the pod startup latency SLI/SLO. However, if the kubelet is unable to +pull images due to misconfiguration of the credential provider plugin, it will result in pod startup failures. + ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? +This feature depends on the existence of a credential provider plugin binary on the host and a configuration file for the plugin to be read by the kubelet. + ### Scalability +1.33: Alpha release +1.34: Beta release + ## Drawbacks