Skip to content

Latest commit

 

History

History
1493 lines (1200 loc) · 66.7 KB

File metadata and controls

1493 lines (1200 loc) · 66.7 KB

KEP-3085: Pod Conditions for Starting and Completion of Sandbox Creation

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests for meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Pod sandbox creation is a critical phase of a pod's lifecycle that the kubelet orchestrates across multiple components: in-tree volume plugins (ConfigMap, Secret, EmptyDir, etc), CSI plugins, container runtime (which in turn invokes CNI plugins) and associated runtime handler and other components. This KEP proposes a SandboxReady condition in pod status to indicate successful completion of pod sandbox creation by Kubelet. The SandboxReady condition will mark an important milestone in the pod's lifecycle similar to ContainersReady and the overall Ready conditions in pod status today.

Motivation

Today, the scheduler surfaces a specific pod condition: PodScheduled that clearly identifies whether a pod got scheduled by the scheduler and when scheduling completed. However, no specific conditions around initialization of successfully scheduled pods from the perspective of completion of pod sandbox creation is surfaced to cluster administrators in a scoped and consumable fashion.

There is an existing pod condition: Initialized that tracks execution of init containers. For pods without init containers, the Initialized condition is set when the Kubelet starts to process a pod before any sandbox creation activities start. For pods with init containers, the Initialized condition is set when init containers have been pulled and executed to completion. Therefore, the existing Initialized condition is insufficient and inaccurate for tracking completion of sandbox creation of all pods in a cluster. This distinction becomes especially relevant in multi-tenant clusters where individual tenants own the pod specs (including the set of init containers) while the cluster administrators are in charge of storage plugins, networking plugins and container runtime handlers.

A dedicated condition around readiness of the pod sandbox will benefit cluster operators (especially of multi-tenant clusters) who are responsible for configuration and operational aspects of the various components that play a role in pod sandbox creation: CSI plugins, CRI runtime and associated runtime handlers, CNI plugins, etc. The duration between lastTransitionTime field of the SandboxReady condition (with status set to true for a pod for the first time) and the existing PodScheduled condition will allow metrics collection services to surface total latency of all the components involved in pod sandbox creation as an SLI. Cluster operators can use this to publish SLOs around pod initialization to their customers who launch workloads on the cluster.

Custom pod controllers/operators can use a dedicated condition indicating completion of pod sandbox creation to make better decisions around how to reconcile a pod failing to become ready. As a specific example, a custom controller for managing pods that refer to PVCs associated with node local storage (e.g. Rook-Ceph) may decide to recreate PVCs (based on a specified PVC template) if the sandbox creation is repeatedly failing to complete. Such a controller can leave PVCs intact and only recreate pods if sandbox creation completes successfully but the pod's containers fail to become ready.

When a pod's sandbox no longer exists, the status of SandboxReady condition will be set to false. The duration between a pod's DeletionTimeStamp and subsequent lastTransitionTime of SandboxReady condition (with status set to false) will indicate the latency of pod termination. This can also be surfaced by metrics collection services as a SLI. Note that surfacing any dedicated conditions around termination of pod sandbox is unnecessary and beyond the scope of this KEP.

Individual container creation (including pulling images from a registry) takes place after the successful completion of pod sandbox creation. Updates to pod container status to report latencies associated with creation of individual containers within a pod is beyond the scope of this KEP.

Goals

  • Surface a new pod condition SandboxReady to indicate the successful completion of pod sandbox creation by Kubelet
  • Describe how the new pod condition can be consumed by external services to determine state and duration of pod sandbox creation.

Non-Goals

  • Modify the meaning of the existing Initialized condition
  • Specify metrics collection based on the conditions around pod sandbox creation
  • Specify additional conditions (beyond SandboxReady with status set to false) to indicate sandbox teardown
  • Surface beginning and completion of creation of individual containers

Proposal

This KEP proposes enhancements to the Kubelet to report the completion of pod sandbox creation as a pod condition with type: SandboxReady. Metric collection and monitoring services can use the fields associated with the SandboxReady condition to report sandbox creation state and latency either at a per-pod cardinality or aggregate the data based on various properties of the pod: number of volumes, storage class of PVCs, runtime class, custom annotations for CNI and IPAM plugins, arbitrary labels and annotations on pods, etc. Certain pod controllers can use the pod sandbox conditions to determine an optimal reconciliation strategy for pods and associated resources (like PVCs).

User Stories (Optional)

User Stories For Consuming Pod SandboxReady Condition

Surfacing the completion of pod sandbox creation as a pod condition in pod status can be consumed in different ways:

Story 1: Consuming SandboxReady Condition Per Pod In A Monitoring Service

A cluster operator may already depend on a service like Kube State Metrics for monitoring the state of their Kubernetes clusters. The cluster operator may want such a service to surface pod sandbox creation state and latency at a granular level for each pod (due to the ambiguity around Initialized state as described earlier). For this story, we are assuming the service has been enhanced to [1] consume the new SandboxReady pod condition as described in this KEP and [2] implement informers and state to distinguish between the first time a pod sandbox becomes ready and a subsequent instance of sandbox becoming ready (after sandbox destruction) over the lifetime of a pod.

The operator can use PromQL queries to aggregate and analyze data (around pod sandbox creation) based on custom pod labels and annotations (already surfaced by a service like Kube State Metrics) indicating specific workload types across different namespaces. For example, annotations and labels could be used to differentiate pod sandbox creation state and latencies for "sensitive database" workloads, "sensitive analysis" workloads and "untrusted build" workloads each of which maps to pods mounting PVCs from different storage classes (depending on the level of encryption desired), using a specific runtime class (depending on the level of isolation desired - microvm vs runc based) and specific IPAM characteristics around reachability of the pods. Access to the pod labels and annotations along with the sandbox latency data at a per-pod cardinality is essential to enable the aggregation based on factors that have special/custom meaning for the operator's cluster and tenants. The values associated with such labels and annotations may not map to distinct namespaces, existing pod fields or other API object fields in a Kubernetes cluster.

Depending on the metrics and monitoring pipeline, as the cluster scales up, cardinality of data at a per pod level (surfaced from a service like Kube State Metrics) may lead to excessive load on the monitoring backend like Prometheus. At such a point, the cluster operator may decide to create and deploy their own custom monitoring service that uses a pod informer and aggregates (based on custom pod labels and annotations) state and latency of pod sandbox creation into a histogram which is ultimately reported to Prometheus. As with the previous approach, access to the pod labels and annotations and the sandbox latency data at a per-pod cardinality is essential to enable the aggregation based on factors that have special/custom meaning for the operator's cluster and tenants and may not map to distinct namespaces pod fields or other API object fields in the cluster.

The data from the above monitoring services can be used as SLIs with associated SLOs configured around sandbox creation state and latency (besides other metrics like scheduling latency) for each specific workload type depending on specific user requirements such as: desired encryption of persistent data (if any), runtime isolation and network reachability (governed by different IPAM plugins).

Story 2: Consuming SandboxReady Condition In A Controller

A controller managing a set of pods along with associated resources like networking configuration, storage or arbitrary dynamic resources (in the future) can evaluate the SandboxReady condition to optimize the set of actions it executes when bringing up pods and encountering failures. Depending on whether the pod sandbox is ready, the controller may decide to destroy and re-create the associated resources that are required for the sandbox creation to complete or simply try to re-create the pod while keeping the resources intact.

A specific example of the above would be a controller for stateful application pods that mount PVCs that bind to node local PVs. Let's assume the stateful application has built-in data replication capabilities and the controller supports PVC templates to dynamically generate PVCs. When trying to bring up fresh pods (after earlier pods got terminated), there could be a problem with the CSI plugin that mounts the node local PV into the pod. In such a situation, the sandbox creation will not complete. Based on the SandboxReady condition, the controller may decided to create a fresh PVC. If sandbox creation does complete successfully but the pod fails to enter a Ready state, the controller will retain the PVC (to avoid any data replication) and only try to recreate the pod. Having access to pod sandbox conditions allows the controller to optimize it's reconciliation strategy and realize the desired state more efficiently.

SandboxReady Condition Fields In Different User Scenarios

In each of the scenarios below, nearly identical SandboxReady conditions that would result from different scenarios/problems are grouped together. The unique scenarios are detailed after describing the values associated with the fields of the SandboxReady condition. To make each scenario concrete, a specific set of timestamps in the future is chosen. The PodScheduled condition is mentioned in the stories but conditions after pod sandbox creation (e.g. Initialized and Ready) are skipped. A service monitoring latency of initial pod sandbox creation is assumed to implement a pod informer and appropriate state to distinguish between the first time a pod sandbox becomes ready versus a subsequent instance of readiness over the lifetime of the pod.

Scenario 1: Stateless pod scheduled on a healthy node and cluster

A user launches a simple, stateless runc based pod with no init containers in a healthy cluster. The pod gets successfully scheduled at 2022-12-06T15:33:46Z and pod sandbox is ready after three seconds at 2022-12-06T15:33:49Z.

The pod will report the following conditions in pod status at 2022-12-06T15:33:47Z (right after Kubelet worker starts processing the pod):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:47Z"
    status: "False"
    type: SandboxReady
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

The pod will report the following conditions in pod status at 2022-12-06T15:33:50Z (after pod sandbox creation is complete):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:49Z"
    status: "True"
    type: SandboxReady
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

A service monitoring latency of initial pod sandbox creation will record a latency of three seconds in this scenario based on the delta between lastTransitionTime timestamp associated with SandboxReady and PodScheduled conditions.

Scenario 2: Pods with startup delays due to problems with CSI, CNI or Runtime Handler plugins

In each of the scenarios under this section, problems or delays with infrastructural plugins like CSI/CNI/CRI result in a ten second delay for pod sandbox creation to complete. In each scenario, the pod gets successfully scheduled at 2022-12-06T15:33:46Z, pod sandbox is ready after ten seconds at 2022-12-06T15:33:56Z.

For each scenario below, the pod will report the following conditions in pod status at 2022-12-06T15:33:47Z (right after Kubelet worker starts processing the pod and the pod sandbox creation has started but not complete):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:47Z"
    status: "False"
    type: SandboxReady
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

For each scenario, the pod will report the following conditions in pod status at 2022-12-06T15:34:00Z (after pod sandbox is ready after ten seconds):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:56Z"
    status: "True"
    type: SandboxReady
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

A service monitoring duration of pod sandbox creation will record a latency of ten seconds in these scenarios based on the delta between lastTransitionTime timestamps associated with SandboxReady and PodScheduled conditions with status set to true. For each observation associated with a scenario below, the monitoring service also associates a label with the metric indicating RuntimeClass of the pods and StorageClass of PVCs referred by the pod. This enables further grouping of the data during analysis.

A cluster-wide SLO around initial pod sandbox creation latencies configured with a threshold of less than ten seconds will record a breach in these scenarios. Further analysis of the metrics based on labels indicating RuntimeClass of the pods and StorageClass of PVCs referred by the pod will enable the cluster administrators to isolate the cause of the breaches to specific infrastructure plugins as detailed below.

Stateful pod encountering sandbox creation delays from attaching a PV backed by a CSI plugin

A Stateful pod refers to a PVC bound to a PV backed by a CSI plugin. After the pod is scheduled on a node, the CSI plugin runs into problems in the storage control plane when trying to attach the PV to the node. This results in several retries that ultimately succeeds after nine seconds.

Stateless pod encountering sandbox creation delays from allocating IP from a CNI/IPAM plugin

A pod is scheduled on a node in an experimental pre-production cluster where the operator has configured a new CNI plugin using a centralized IP allocation mechanism. Due to a spike of load in the IP allocation service, the CNI plugin times out several times but ultimately succeeds getting an IP address and configuring the pod network after nine seconds.

Stateless pod encountering sandbox creation delays from microvm based sandbox initialization

A pod configured with a special microvm based runtime class is scheduled on a node. The runtimeclass handler encounters crashes in the guest kernel multiple times but ultimately initializes the virtual machine based sandbox environment successfully after nine seconds.

Story 3: Pod unable to start due to problems with CSI, CNI or Runtime Handler plugins

In each of the scenarios under this section, problems or delays with infrastructural plugins like CSI/CNI/CRI result in pod sandbox creation never completing. In each scenario, the pod gets successfully scheduled at 2022-12-06T15:33:46Z, but pod sandbox creation runs into problems that do not eventually resolve and results in repeated failures as kubelet tries to start the pod.

For each scenario below, the pod will report the following conditions in pod status at all times after 2022-12-06T15:33:47Z (after pod sandbox creation started until the pod is deleted manually or by a controller):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:47Z"
    reason: PodSandboxCreationInProgress
    status: "False"
    type: SandboxReady
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

A service monitoring state of pod sandbox creation will record a metric indicating failure to create pod sandbox beyond a configured duration.

A cluster-wide SLO around success rate of pod sandbox creation may record a breach due to the pod sandbox creation failures. Further analysis of the metrics aggregated based on labels (associated with the metrics) indicating RuntimeClass of the pods and StorageClass of PVCs referred by the pod will enable the cluster administrators to associate the failures to specific infrastructure plugins as detailed below.

Stateful pod encountering sandbox creation failures when attaching a PV backed by a CSI plugin

A Stateful pod refers to a PVC bound to a PV backed by a CSI plugin. After the pod is scheduled on a node, the CSI plugin runs into problems in the storage control plane when trying to attach the PV to the node. The failure to attach never resolves thus blocking pod sandbox creation.

Stateless pod encountering sandbox creation failures when allocating IP from a CNI/IPAM plugin

A pod is scheduled on a node in an experimental pre-production cluster where the operator has configured a new CNI plugin using a centralized IP allocation mechanism. Due to problems in the IP allocation service, the CNI plugin fails to get an IP address and is unable to configure the pod network. This blocks pod sandbox creation.

Stateless pod encountering sandbox creation failures from microvm based sandbox initialization

A pod configured with a special microvm based runtime class is scheduled on a node. The runtimeclass handler encounters crashes in the guest kernel repeatedly and is unable to initialize the virtual machine based sandbox environment.

Story 4: Pod Sandbox restart after a successful initial startup and crash

In each of the scenarios under this section, a pod sandbox is successfully created but eventually gets destroyed due to problems in the host or the sandbox environment. As a result, the pod sandbox has to be re-created by Kubelet. In each scenario, the pod is successfully scheduled at 2022-12-06T15:33:46Z and pod sandbox is ready after 5 seconds. The sandbox is destroyed after two hours. Re-creation of the sandbox runs into problems but eventually succeed after nine seconds.

The pod will report the following conditions in pod status at 2022-12-06T15:34:00Z (few seconds after initial pod sandbox is ready):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:52Z"
    status: "True"
    type: SandboxReady
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

The pod will report the following conditions in pod status at 2022-12-06T17:33:46Z (right after pod sandbox is destroyed):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T17:33:46Z"
    status: "False"
    type: SandboxReady
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

The pod will report the following conditions in pod status at 2022-12-06T17:34:00Z (few seconds after the new pod sandbox is ready):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T17:33:52Z"
    status: "True"
    type: SandboxReady
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

A service monitoring restarts associated with successfully created pod sandboxes will record a restart in these scenarios. A service measuring initial pod sandbox creation latency will need to implement logic (for example, using pod informers and state) to differentiate the initial pod sandbox creation from the latter pod sandbox creations resulting from node crashes/reboots or sandbox crashes.

Node crash

A regular runc based pod is scheduled on a node whose kernel crashes after two hours of the pod sandbox getting created successfully. The node restarts quickly (resulting in no pod evictions) and kubelet has to re-create the pod sandbox.

Sandbox crash

A pod is configured with a microvm based runtime handler. The virtual machine sandbox is created successfully but suffers a crash due to problems with the guest kernel after two hours of the pod creation. As a result, kubelet has to re-create the pod sandbox.

Story 5: Graceful pod sandbox termination

A user launches a pod that runs successfully but eventually deleted by a controller after several hours. The pod was scheduled at 2022-12-06T12:33:46Z and the sandbox became ready at 2022-12-06T12:33:48Z. The delete request is invoked at 2022-12-06T15:33:47Z and the pod is terminated by Kubelet at 2022-12-06T15:33:49Z

The pod will report the following conditions in pod status at 2022-12-06T15:33:46Z (right before the pod delete request is invoked):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T12:33:48Z"
    status: "True"
    type: SandboxReady
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T12:33:46Z"
    status: "True"
    type: PodScheduled

The pod will report the following conditions in pod status at 2022-12-06T15:33:49Z (right after the pod termination has been processed by Kubelet but the pod is yet to be completely deleted from API server):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:49Z"
    status: "False"
    type: SandboxReady
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T12:33:46Z"
    status: "True"
    type: PodScheduled

Notes/Constraints/Caveats (Optional)

A monitoring service measuring duration of initial sandbox creation of a pod should differentiate between the initial and subsequent sandbox creations (if any due to node crash/sandbox crash) and track them separately. This can be achieved using a pod informer whose event handler stores (in a persistent store or as custom annotations on the pod) the lastTransitionTime field for SandboxReady condition observed when it had status = true for the first time. Later, if the pod sandbox is recreated, the lastTransitionTime for the pod sandbox creation conditions can be differentiated from the data associated with initial sandbox creation based on whether the initial data exists (either in the persistent store or pod annotations).

Measuring duration of sandbox creation accurately beyond the initial sandbox creation is not possible with the SandboxReady condition alone. This is similar to other ready conditions like ContainersReady and overall pod Ready which gets updated after containers are restarted without a specific marker of when the process of restarting the containers or brining the pod back into a ready state began following an event like a node crash.

Risks and Mitigations

The main risk associated with SandboxReady is any potential confusion with the existing Initialized condition. Both the existing Initialized conditions and the new pod sandbox conditions refer to distinct stages in a pod's overall initialization. Documentation will help mitigate this risk.

Design Details

The Kubelet will set a new condition on a pod: SandboxReady to surface the successful completion of sandbox creation for a pod. A new PodConditionType corresponding to SandboxReady will be added in api/core/v1/types.go. No changes are required in the Pod Status API for this enhancement.

Determining status of sandbox creation for a pod

Today, syncPod() in Kubelet is invoked with the kubecontainer.PodStatus (distinct from the v1.PodStatus API) associated with a given pod. podSandboxChanged() in kubeGenericRuntimeManager evaluates the SandboxStatuses field in PodStatus to determine whether a new pod sandbox will need to be created for a pod. The same logic will be used to determine whether a sandbox is ready for a pod in the Kubelet status manager.

SandboxReady condition details

Kubelet will initially generate the SandboxReady condition as part of existing calls to generateAPIPodStatus() early during syncPod(). The status field will be set to true if a sandbox is ready (determined by invoking podSandboxChanged() as described above). The status field will be set to false if a sandbox is found to be not ready.

When Kubelet starts creating a sandbox, it will set a temporary PodSandboxCreationStarted annotation in the pod cache. The reason field for SandboxReady condition will be set to PodSandboxCreationInProgress if the PodSandboxCreationStarted annotation exists. The annotations will be cleared (in the pod cache) when sandbox creation is complete and the status field of SandboxReady is set to true. Note that this annotation will not be persisted in the API server.

Kubelet will generate the SandboxReady condition for the final time (in the life of a pod) as part of existing calls to generateAPIPodStatus() early during syncTerminatedPod(). Prior invocations of killPod() (as part of syncTerminatingPod) will result in the absence of a sandbox corresponding to the pod. As a result, the status field of the SandboxReady condition will be set to false (determined by invoking podSandboxChanged() as described above).

During periods of API server or etcd unavailability combined with a Kubelet restart/crash (covered in more details below), the lastTransitionTime field of SandboxReady condition that ultimately gets persisted upon Kubelet restarting and API server becoming available again is as close as possible to an actual change in the condition (that could not be persisted).

Changes of the status field will result in lastTransitionTime field getting updated (by the Kubelet Status Manager).

Enhancements in Kubelet Status Manager

Today, the Kubelet Status Manager surfaces APIs for other Kubelet components to issue pod status updates. It caches the pod status and issues patches to the API server when necessary. This infrastructure will be used for managing the new pod conditions as well.

The Kubelet Status Manager will surface a new GenerateSandboxReadyCondition API. This will be invoked by Kubelet's generateAPIPodStatus() to populate the pod status that is passed to setPodStatus. This is similar to the existing pod conditions generator functions: GeneratePodReadyCondition and GeneratePodInitializedCondition. If updates through generateAPIPodStatus() is found to be inaccurate (for example if Kubelet is very busy), invocation of GenerateSandboxReadyCondition could also be added right after createSandbox in kubeGenericRuntimeManager returns successfully.

updateStatusInternal() in the Kubelet Status Manager will be enhanced to mark updateLastTransitionTime for the new SandboxReady condition when changes in the status of the conditions are detected.

Unavailability of API Server or etcd along with Kubelet Restart

If pod sandbox creation completed successfully on a node but API server became unavailable, the Kubelet status manager will retry issuing the patches to the API server. However, the Kubelet may get restarted (or crash) while the API server is unavailable with the pod status updates not yet persisted. In such a situation (expected to be quite rare), the timestamp associated with the lastTransitionTime field in the new conditions will not be accurate due to inability to persist or cache them. The lastTransitionTime field will get updated on subsequent generateAPIPodStatus() calls based on the state of the CRI sandbox and the corresponding timestamps will be persisted. This aligns with handling of other Kubelet managed conditions (ContainersReady, (Pod) Ready) when API server is unavailable and Kubelet restarts resulting in the status manager cache getting dropped.

Test Plan

E2E tests will be introduced to cover the user scenarios mentioned above. Tests will involve launching pods with characteristics mentioned below and examining the pod status has the new SandboxReady condition with status and reason fields populated with expected values:

  1. A basic pod that launches successfully without any problems.
  2. A pod with references to a configmap (as a volume) that has not been created causing the pod sandbox creation to not complete until the configmap is created later.
  3. A pod whose node is rebooted leading to the sandbox being recreated.

Tests for pod conditions in the GracefulNodeShutdown e2e_node test will be enhanced to check the status of the new pod sandbox conditions are false after graceful termination of a pod.

Testing updates of Pod conditions in the Conformance Test Pods, completes the lifecycle of a Pod and the PodStatus will be enhanced to cover resetting the new pod sandbox conditions.

Graduation Criteria

Alpha

  • Kubelet will report pod sandbox conditions if the feature flag SandboxReadyCondition is enabled.
  • Initial e2e tests completed and enabled.

Beta

  • Gather feedback from cluster operators and developers of services or controllers that consume these conditions.
  • Implement suggestions from feedback as feasible.
  • Feature Flag removed.
  • Add more test cases and link to this KEP.

GA

  • All tests are passing with no known flakiness.
  • All feedback addressed around the new pod sandbox conditions.
  • No open decision items around the new pod sandbox conditions.

Upgrade / Downgrade Strategy

The new condition will be managed by the Kubelet. When upgrading a node to a version of the Kubelet that can set the new condition, new pods launched on that node will surface the new condition. If Kubelet on the node is later downgraded, there may remain evicted pods that are not deleted. Foe such pods, a node with a version of the Kubelet that does not support the new condition will continue to report pods associated with it with the new conditions.

Version Skew Strategy

The new condition will be managed by the Kubelet. Since the control plane components are not involved, handling of version skew is not applicable.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: SandboxReadyCondition
    • Components depending on the feature gate: Kubelet
Does enabling the feature change any default behavior?

No changes to any default behavior should result from enabling the feature.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature can be disabled once it has been enabled. However the new pod sandbox condition will get persisted in pods and would continue to be reported after the feature is disabled until those pods are deleted.

What happens if we reenable the feature if it was previously rolled back?

New pods created since re-enablement will report the new pod sandbox condition.

Are there any tests for feature enablement/disablement?

No

Rollout, Upgrade and Rollback Planning

Skipping this section at the Alpha stage and will populate at Beta.

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

Skipping this section at the Alpha stage and will populate at Beta.

How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

No, this feature does not have any dependencies. Other metric oriented services in the cluster may depend on this.

Scalability

Will enabling / using this feature result in any new API calls?

Yes, the new pod condition will result in the Kubelet Status Manager making additional PATCH calls on the pod status fields.

The Kubelet Status Manager already has infrastructure to cache pod status updates (including pod conditions) and issue the PATCH es in a batch.

Will enabling / using this feature result in introducing new API types?

No

Will enabling / using this feature result in any new calls to the cloud provider?

No

Will enabling / using this feature result in increasing size or count of the existing API objects?

Slight increase (a few bytes) of the Pod API object due to persistence of the additional condition in the pod status.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If etcd/API server is unavailable, pod status cannot be updated. So the SandboxReady condition associated with pod status cannot be updated either. The pod status manager already retries the API server requests later (based on data cached in the Kubelet) and that should help.

If pod sandbox creation completes for a pod on a node but API server becomes unavailable (before the sandbox creation condition can be patched) and Kubelet crashes or restarts (shortly after API server becoming and staying unavailable), the lastTransitionTime field may be inaccurate. This is described in the section above.

What are other known failure modes?

None so far

What steps should be taken if SLOs are not being met to determine the problem?

SLOs are not applicable to pod status fields. Overall Kubernetes node level SLOs may leverage this feature.

Implementation History

Drawbacks

The main drawback associated with the new pod sandbox conditions involves a slight potential increase in calls to the API Server from Kubelet to patch status = true for the new SandboxReady condition in a pod's status. Typically, this would involve an extra patch call for pod status in the lifetime of most pods (if the status manager does not batch them with other pod status updates): one when pod sandbox creation completes and another when the pod is terminated. However, there could be a higher number of patch calls to API Server if the pod sandbox environment (like a microvm) starts successfully and then crashes in a re-start loop.

Caching of updates to pod status by the pod status manager and batching pod status updates (which is already in place) can help mitigate frequent patch calls to API server.

Alternatives

Dedicated fields or annotations for the pod sandbox creation timestamps

Timestamps around completion of pod sandbox creation may be surfaced as a dedicated field in the pod status rather than a pod condition. However, since the successful creation of pod sandbox is essentially a "milestones" in the life of a pod (similar to Scheduled, Ready, etc), pod conditions is the ideal place to surface these and aligns well with the existing conditions like ContainersReady and overall Ready.

A dedicated annotation on the pod for surfacing this data is another potential approach. However, usage of annotations for Kubelet managed data is typically discouraged.

Surface pod sandbox creation latency instead of timestamps

Surfacing the amount of time it took to successfully create a pod sandbox is an alternative to surfacing the condition around completion of pod sandbox creation (whose delta from pod scheduled condition reflects the latency). The latency data would surface the same information from a pod initialization SLI perspective as mentioned in the Motivations section. Implementing this approach would require an API change on the pod status to surface the latency data (as this no longer fits the structure of a pod condition). This data cannot be consumed by other controllers as mentioned in User Stories section.

Report sandbox creation latency as an aggregated metric

The duration it took pod sandbox to become ready can be directly reported as a prometheus metrics aggregated in a histogram. However, aggregating the data at the Kubelet level prevents a metric collection service from classifying the data based on interesting fields on a pod (runtime class, storage class of PVCs, number of PVCs, etc) or using custom labels and annotations on pods that indicate workload characteristics (that the cluster operator may wish to use as a basis for aggregating the metrics).

This also prevents other controllers from acting on sandbox status as mentioned in User Stories section.

Report sandbox creation stages using Kubelet tracing

The Kubelet is being instrumented to emit traces based on OpenTelemetry around sandbox creation stages (as well several other parts of the pod lifecycle).

To implement the pod sandbox creation latency SLI/SLO use cases, the tracing infrastructure needs to be able to:

  • Collect all traces around CRI sandbox creation for all pods with no sampling.
  • Look-up pod fields from API server (associated with a pod's trace) like labels/annotations/storage classes of PVCs referred by the pod/runtimeclass/etc. that is of interest to cluster operators and their users for classifying and aggregating the metrics.
  • Look-up a pod's Scheduled condition fields to determine the beginning of pod sandbox creation.

Since the lookup of the pod fields and existing conditions is necessary for SLIs around pod sandbox creation latency, surfacing the SandboxReady condition in pod status will allow a metric collection service to directly access the relevant data without requiring the ability to collect and parse OpenTelemetry traces. As mentioned in the User Stories, popular community managed services like Kube State Metrics can consume the SandboxReady condition with a trivial set of changes. Enhancing them to collect and parse OpenTelemetry traces with no sampling and mapping the data to associated data from API server data will be complex from an engineering and operational perspective.

For controllers using the pod sandbox conditions to determine reconciliation strategy, access to the pod is typically necessary while collecting and parsing traces would be unusual.

Have CSI/CNI/CRI plugins mark their start and completion timestamps while setting up their respective portions for a pod

Each infrastructural plugin that Kubelet calls out to (in the process of setting up a pod sandbox) can mark start and completion timestamps on the pod as conditions. This approach would be similar to how readiness gates work today. However, CSI and CRI plugins will need to be enlightened about fields in a pod (like status conditions) and setup a client to the API server (to update the conditions) which they may not implement to stay orchestrator agnostic.

Use a dedicated service between Kubelet and CRI runtime to mark sandbox ready condition on a pod

An on-host binary that runs as a service and proxies CRI API calls between the CRI runtime and Kubelet can intercept the successful creation of a pod sandbox in response to CRI RunPodSandbox. Next, using an API server client, the binary can mark extended conditions on a pod to indicate state of sandbox creation. While this approach works, without requiring any additional changes to Kubelet, it had a couple of disadvantages: First, this approach requires configuration and management of a separate proxy binary between Kubelet and CRI runtime in the cluster nodes. Second, the proxy binary will need to replicate the logic in Kubelet status manager to efficiently interact with the API server (as well as cache the status and retry in case of API server outages) regarding updates to pod sandbox status. Therefore isolating the logic around pod sandbox conditions to a separate binary intercepting API calls between kubelet and the CRI runtime is not preferred.

Have Kubelet mark sandbox ready condition on a pod using extended conditions

Instead of a "native" condition as proposed in this KEP, an "extended" condition maybe used by Kubelet to mark the SandboxReady condition. Such a condition may look like: kubernetes.io/pod-sandbox-ready. However, internal/core Kubernetes components (like Kubelet) do not use "extended" conditions today. So this approach may be unusual.

Infrastructure Needed (Optional)