- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- User Stories (Optional)
- User Stories For Consuming Pod SandboxReady Condition
- SandboxReady Condition Fields In Different User Scenarios
- Scenario 1: Stateless pod scheduled on a healthy node and cluster
- Scenario 2: Pods with startup delays due to problems with CSI, CNI or Runtime Handler plugins
- Story 3: Pod unable to start due to problems with CSI, CNI or Runtime Handler plugins
- Story 4: Pod Sandbox restart after a successful initial startup and crash
- Story 5: Graceful pod sandbox termination
- Notes/Constraints/Caveats (Optional)
- Risks and Mitigations
- User Stories (Optional)
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Dedicated fields or annotations for the pod sandbox creation timestamps
- Surface pod sandbox creation latency instead of timestamps
- Report sandbox creation latency as an aggregated metric
- Report sandbox creation stages using Kubelet tracing
- Have CSI/CNI/CRI plugins mark their start and completion timestamps while setting up their respective portions for a pod
- Use a dedicated service between Kubelet and CRI runtime to mark sandbox ready condition on a pod
- Have Kubelet mark sandbox ready condition on a pod using extended conditions
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Pod sandbox creation is a critical phase of a pod's lifecycle that the kubelet
orchestrates across multiple components: in-tree volume plugins (ConfigMap,
Secret, EmptyDir, etc), CSI plugins, container runtime (which in turn invokes
CNI plugins) and associated runtime handler and other components. This KEP
proposes a SandboxReady
condition in pod status to indicate successful
completion of pod sandbox creation by Kubelet. The SandboxReady
condition will
mark an important milestone in the pod's lifecycle similar to ContainersReady
and the overall Ready
conditions in pod status today.
Today, the scheduler surfaces a specific pod condition: PodScheduled
that
clearly identifies whether a pod got scheduled by the scheduler and when
scheduling completed. However, no specific conditions around initialization of
successfully scheduled pods from the perspective of completion of pod sandbox
creation is surfaced to cluster administrators in a scoped and consumable
fashion.
There is an existing pod condition: Initialized
that tracks execution of init
containers. For pods without init containers, the Initialized
condition is set
when the Kubelet starts to process a pod before any sandbox creation activities
start. For pods with init containers, the Initialized
condition is set when
init containers have been pulled and executed to completion. Therefore, the
existing Initialized
condition is insufficient and inaccurate for tracking
completion of sandbox creation of all pods in a cluster. This distinction
becomes especially relevant in multi-tenant clusters where individual tenants
own the pod specs (including the set of init containers) while the cluster
administrators are in charge of storage plugins, networking plugins and
container runtime handlers.
A dedicated condition around readiness of the pod sandbox will benefit cluster
operators (especially of multi-tenant clusters) who are responsible for
configuration and operational aspects of the various components that play a role
in pod sandbox creation: CSI plugins, CRI runtime and associated runtime
handlers, CNI plugins, etc. The duration between lastTransitionTime
field of
the SandboxReady
condition (with status
set to true
for a pod for the
first time) and the existing PodScheduled
condition will allow metrics
collection services to surface total latency of all the components involved in
pod sandbox creation as an SLI. Cluster operators can use this to publish SLOs
around pod initialization to their customers who launch workloads on the
cluster.
Custom pod controllers/operators can use a dedicated condition indicating completion of pod sandbox creation to make better decisions around how to reconcile a pod failing to become ready. As a specific example, a custom controller for managing pods that refer to PVCs associated with node local storage (e.g. Rook-Ceph) may decide to recreate PVCs (based on a specified PVC template) if the sandbox creation is repeatedly failing to complete. Such a controller can leave PVCs intact and only recreate pods if sandbox creation completes successfully but the pod's containers fail to become ready.
When a pod's sandbox no longer exists, the status
of SandboxReady
condition
will be set to false
. The duration between a pod's DeletionTimeStamp
and
subsequent lastTransitionTime
of SandboxReady
condition (with status
set
to false
) will indicate the latency of pod termination. This can also be
surfaced by metrics collection services as a SLI. Note that surfacing any
dedicated conditions around termination of pod sandbox is unnecessary and beyond
the scope of this KEP.
Individual container creation (including pulling images from a registry) takes place after the successful completion of pod sandbox creation. Updates to pod container status to report latencies associated with creation of individual containers within a pod is beyond the scope of this KEP.
- Surface a new pod condition
SandboxReady
to indicate the successful completion of pod sandbox creation by Kubelet - Describe how the new pod condition can be consumed by external services to determine state and duration of pod sandbox creation.
- Modify the meaning of the existing Initialized condition
- Specify metrics collection based on the conditions around pod sandbox creation
- Specify additional conditions (beyond
SandboxReady
withstatus
set tofalse
) to indicate sandbox teardown - Surface beginning and completion of creation of individual containers
This KEP proposes enhancements to the Kubelet to report the completion of pod
sandbox creation as a pod condition with type: SandboxReady
. Metric collection
and monitoring services can use the fields associated with the SandboxReady
condition to report sandbox creation state and latency either at a per-pod
cardinality or aggregate the data based on various properties of the pod: number
of volumes, storage class of PVCs, runtime class, custom annotations for CNI and
IPAM plugins, arbitrary labels and annotations on pods, etc. Certain pod
controllers can use the pod sandbox conditions to determine an optimal
reconciliation strategy for pods and associated resources (like PVCs).
Surfacing the completion of pod sandbox creation as a pod condition in pod status can be consumed in different ways:
A cluster operator may already depend on a service like Kube State
Metrics for monitoring the
state of their Kubernetes clusters. The cluster operator may want such a service
to surface pod sandbox creation state and latency at a granular level for each
pod (due to the ambiguity around Initialized
state as described earlier). For
this story, we are assuming the service has been enhanced to [1] consume the new
SandboxReady
pod condition as described in this KEP and [2] implement
informers and state to distinguish between the first time a pod sandbox becomes
ready and a subsequent instance of sandbox becoming ready (after sandbox
destruction) over the lifetime of a pod.
The operator can use PromQL queries to aggregate and analyze data (around pod sandbox creation) based on custom pod labels and annotations (already surfaced by a service like Kube State Metrics) indicating specific workload types across different namespaces. For example, annotations and labels could be used to differentiate pod sandbox creation state and latencies for "sensitive database" workloads, "sensitive analysis" workloads and "untrusted build" workloads each of which maps to pods mounting PVCs from different storage classes (depending on the level of encryption desired), using a specific runtime class (depending on the level of isolation desired - microvm vs runc based) and specific IPAM characteristics around reachability of the pods. Access to the pod labels and annotations along with the sandbox latency data at a per-pod cardinality is essential to enable the aggregation based on factors that have special/custom meaning for the operator's cluster and tenants. The values associated with such labels and annotations may not map to distinct namespaces, existing pod fields or other API object fields in a Kubernetes cluster.
Depending on the metrics and monitoring pipeline, as the cluster scales up, cardinality of data at a per pod level (surfaced from a service like Kube State Metrics) may lead to excessive load on the monitoring backend like Prometheus. At such a point, the cluster operator may decide to create and deploy their own custom monitoring service that uses a pod informer and aggregates (based on custom pod labels and annotations) state and latency of pod sandbox creation into a histogram which is ultimately reported to Prometheus. As with the previous approach, access to the pod labels and annotations and the sandbox latency data at a per-pod cardinality is essential to enable the aggregation based on factors that have special/custom meaning for the operator's cluster and tenants and may not map to distinct namespaces pod fields or other API object fields in the cluster.
The data from the above monitoring services can be used as SLIs with associated SLOs configured around sandbox creation state and latency (besides other metrics like scheduling latency) for each specific workload type depending on specific user requirements such as: desired encryption of persistent data (if any), runtime isolation and network reachability (governed by different IPAM plugins).
A controller managing a set of pods along with associated resources like
networking configuration, storage or arbitrary dynamic resources (in the future)
can evaluate the SandboxReady
condition to optimize the set of actions it
executes when bringing up pods and encountering failures. Depending on whether
the pod sandbox is ready, the controller may decide to destroy and re-create the
associated resources that are required for the sandbox creation to complete or
simply try to re-create the pod while keeping the resources intact.
A specific example of the above would be a controller for stateful application
pods that mount PVCs that bind to node local PVs. Let's assume the stateful
application has built-in data replication capabilities and the controller
supports PVC templates to dynamically generate PVCs. When trying to bring up
fresh pods (after earlier pods got terminated), there could be a problem with
the CSI plugin that mounts the node local PV into the pod. In such a situation,
the sandbox creation will not complete. Based on the SandboxReady
condition,
the controller may decided to create a fresh PVC. If sandbox creation does
complete successfully but the pod fails to enter a Ready state, the controller
will retain the PVC (to avoid any data replication) and only try to recreate the
pod. Having access to pod sandbox conditions allows the controller to optimize
it's reconciliation strategy and realize the desired state more efficiently.
In each of the scenarios below, nearly identical SandboxReady
conditions that
would result from different scenarios/problems are grouped together. The unique
scenarios are detailed after describing the values associated with the fields of
the SandboxReady
condition. To make each scenario concrete, a specific set of
timestamps in the future is chosen. The PodScheduled
condition is mentioned in
the stories but conditions after pod sandbox creation (e.g. Initialized
and
Ready
) are skipped. A service monitoring latency of initial pod sandbox
creation is assumed to implement a pod informer and appropriate state to
distinguish between the first time a pod sandbox becomes ready versus a
subsequent instance of readiness over the lifetime of the pod.
A user launches a simple, stateless runc based pod with no init containers in a healthy cluster. The pod gets successfully scheduled at 2022-12-06T15:33:46Z and pod sandbox is ready after three seconds at 2022-12-06T15:33:49Z.
The pod will report the following conditions in pod status at 2022-12-06T15:33:47Z (right after Kubelet worker starts processing the pod):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:47Z"
status: "False"
type: SandboxReady
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
The pod will report the following conditions in pod status at 2022-12-06T15:33:50Z (after pod sandbox creation is complete):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:49Z"
status: "True"
type: SandboxReady
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
A service monitoring latency of initial pod sandbox creation will record a
latency of three seconds in this scenario based on the delta between
lastTransitionTime
timestamp associated with SandboxReady
and PodScheduled
conditions.
In each of the scenarios under this section, problems or delays with infrastructural plugins like CSI/CNI/CRI result in a ten second delay for pod sandbox creation to complete. In each scenario, the pod gets successfully scheduled at 2022-12-06T15:33:46Z, pod sandbox is ready after ten seconds at 2022-12-06T15:33:56Z.
For each scenario below, the pod will report the following conditions in pod status at 2022-12-06T15:33:47Z (right after Kubelet worker starts processing the pod and the pod sandbox creation has started but not complete):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:47Z"
status: "False"
type: SandboxReady
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
For each scenario, the pod will report the following conditions in pod status at 2022-12-06T15:34:00Z (after pod sandbox is ready after ten seconds):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:56Z"
status: "True"
type: SandboxReady
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
A service monitoring duration of pod sandbox creation will record a latency of
ten seconds in these scenarios based on the delta between lastTransitionTime
timestamps associated with SandboxReady
and PodScheduled
conditions with
status
set to true
. For each observation associated with a scenario below,
the monitoring service also associates a label with the metric indicating
RuntimeClass of the pods and StorageClass of PVCs referred by the pod. This
enables further grouping of the data during analysis.
A cluster-wide SLO around initial pod sandbox creation latencies configured with a threshold of less than ten seconds will record a breach in these scenarios. Further analysis of the metrics based on labels indicating RuntimeClass of the pods and StorageClass of PVCs referred by the pod will enable the cluster administrators to isolate the cause of the breaches to specific infrastructure plugins as detailed below.
A Stateful pod refers to a PVC bound to a PV backed by a CSI plugin. After the pod is scheduled on a node, the CSI plugin runs into problems in the storage control plane when trying to attach the PV to the node. This results in several retries that ultimately succeeds after nine seconds.
A pod is scheduled on a node in an experimental pre-production cluster where the operator has configured a new CNI plugin using a centralized IP allocation mechanism. Due to a spike of load in the IP allocation service, the CNI plugin times out several times but ultimately succeeds getting an IP address and configuring the pod network after nine seconds.
A pod configured with a special microvm based runtime class is scheduled on a node. The runtimeclass handler encounters crashes in the guest kernel multiple times but ultimately initializes the virtual machine based sandbox environment successfully after nine seconds.
In each of the scenarios under this section, problems or delays with infrastructural plugins like CSI/CNI/CRI result in pod sandbox creation never completing. In each scenario, the pod gets successfully scheduled at 2022-12-06T15:33:46Z, but pod sandbox creation runs into problems that do not eventually resolve and results in repeated failures as kubelet tries to start the pod.
For each scenario below, the pod will report the following conditions in pod status at all times after 2022-12-06T15:33:47Z (after pod sandbox creation started until the pod is deleted manually or by a controller):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:47Z"
reason: PodSandboxCreationInProgress
status: "False"
type: SandboxReady
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
A service monitoring state of pod sandbox creation will record a metric indicating failure to create pod sandbox beyond a configured duration.
A cluster-wide SLO around success rate of pod sandbox creation may record a breach due to the pod sandbox creation failures. Further analysis of the metrics aggregated based on labels (associated with the metrics) indicating RuntimeClass of the pods and StorageClass of PVCs referred by the pod will enable the cluster administrators to associate the failures to specific infrastructure plugins as detailed below.
A Stateful pod refers to a PVC bound to a PV backed by a CSI plugin. After the pod is scheduled on a node, the CSI plugin runs into problems in the storage control plane when trying to attach the PV to the node. The failure to attach never resolves thus blocking pod sandbox creation.
A pod is scheduled on a node in an experimental pre-production cluster where the operator has configured a new CNI plugin using a centralized IP allocation mechanism. Due to problems in the IP allocation service, the CNI plugin fails to get an IP address and is unable to configure the pod network. This blocks pod sandbox creation.
A pod configured with a special microvm based runtime class is scheduled on a node. The runtimeclass handler encounters crashes in the guest kernel repeatedly and is unable to initialize the virtual machine based sandbox environment.
In each of the scenarios under this section, a pod sandbox is successfully created but eventually gets destroyed due to problems in the host or the sandbox environment. As a result, the pod sandbox has to be re-created by Kubelet. In each scenario, the pod is successfully scheduled at 2022-12-06T15:33:46Z and pod sandbox is ready after 5 seconds. The sandbox is destroyed after two hours. Re-creation of the sandbox runs into problems but eventually succeed after nine seconds.
The pod will report the following conditions in pod status at 2022-12-06T15:34:00Z (few seconds after initial pod sandbox is ready):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:52Z"
status: "True"
type: SandboxReady
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
The pod will report the following conditions in pod status at 2022-12-06T17:33:46Z (right after pod sandbox is destroyed):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T17:33:46Z"
status: "False"
type: SandboxReady
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
The pod will report the following conditions in pod status at 2022-12-06T17:34:00Z (few seconds after the new pod sandbox is ready):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T17:33:52Z"
status: "True"
type: SandboxReady
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
A service monitoring restarts associated with successfully created pod sandboxes will record a restart in these scenarios. A service measuring initial pod sandbox creation latency will need to implement logic (for example, using pod informers and state) to differentiate the initial pod sandbox creation from the latter pod sandbox creations resulting from node crashes/reboots or sandbox crashes.
A regular runc based pod is scheduled on a node whose kernel crashes after two hours of the pod sandbox getting created successfully. The node restarts quickly (resulting in no pod evictions) and kubelet has to re-create the pod sandbox.
A pod is configured with a microvm based runtime handler. The virtual machine sandbox is created successfully but suffers a crash due to problems with the guest kernel after two hours of the pod creation. As a result, kubelet has to re-create the pod sandbox.
A user launches a pod that runs successfully but eventually deleted by a controller after several hours. The pod was scheduled at 2022-12-06T12:33:46Z and the sandbox became ready at 2022-12-06T12:33:48Z. The delete request is invoked at 2022-12-06T15:33:47Z and the pod is terminated by Kubelet at 2022-12-06T15:33:49Z
The pod will report the following conditions in pod status at 2022-12-06T15:33:46Z (right before the pod delete request is invoked):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T12:33:48Z"
status: "True"
type: SandboxReady
- lastProbeTime: null
lastTransitionTime: "2022-12-06T12:33:46Z"
status: "True"
type: PodScheduled
The pod will report the following conditions in pod status at 2022-12-06T15:33:49Z (right after the pod termination has been processed by Kubelet but the pod is yet to be completely deleted from API server):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:49Z"
status: "False"
type: SandboxReady
- lastProbeTime: null
lastTransitionTime: "2022-12-06T12:33:46Z"
status: "True"
type: PodScheduled
A monitoring service measuring duration of initial sandbox creation of a pod
should differentiate between the initial and subsequent sandbox creations (if
any due to node crash/sandbox crash) and track them separately. This can be
achieved using a pod informer whose event handler stores (in a persistent store
or as custom annotations on the pod) the lastTransitionTime
field for
SandboxReady
condition observed when it had status
= true
for the first
time. Later, if the pod sandbox is recreated, the lastTransitionTime
for the
pod sandbox creation conditions can be differentiated from the data associated
with initial sandbox creation based on whether the initial data exists (either
in the persistent store or pod annotations).
Measuring duration of sandbox creation accurately beyond the initial sandbox
creation is not possible with the SandboxReady
condition alone. This is
similar to other ready conditions like ContainersReady
and overall pod Ready
which gets updated after containers are restarted without a specific marker of
when the process of restarting the containers or brining the pod back into a
ready state began following an event like a node crash.
The main risk associated with SandboxReady
is any potential confusion with the
existing Initialized
condition. Both the existing Initialized
conditions and
the new pod sandbox conditions refer to distinct stages in a pod's overall
initialization. Documentation will help mitigate this risk.
The Kubelet will set a new condition on a pod: SandboxReady
to surface the
successful completion of sandbox creation for a pod. A new PodConditionType
corresponding to SandboxReady
will be added in api/core/v1/types.go
. No
changes are required in the Pod Status API for this enhancement.
Today, syncPod()
in Kubelet is invoked with the kubecontainer.PodStatus
(distinct from the v1.PodStatus
API) associated with a given pod.
podSandboxChanged()
in kubeGenericRuntimeManager
evaluates the
SandboxStatuses
field in PodStatus
to determine whether a new pod sandbox
will need to be created for a pod. The same logic will be used to determine whether a sandbox is ready for a pod in the Kubelet status manager.
Kubelet will initially generate the SandboxReady
condition as part of existing
calls to generateAPIPodStatus()
early during syncPod()
. The status
field
will be set to true
if a sandbox is ready (determined by invoking
podSandboxChanged()
as described
above). The status
field
will be set to false
if a sandbox is found to be not ready.
When Kubelet starts creating a sandbox, it will set a temporary
PodSandboxCreationStarted
annotation in the pod cache. The reason
field for
SandboxReady
condition will be set to PodSandboxCreationInProgress
if the
PodSandboxCreationStarted
annotation exists. The annotations will be cleared
(in the pod cache) when sandbox creation is complete and the status
field of
SandboxReady
is set to true
. Note that this annotation will not be persisted
in the API server.
Kubelet will generate the SandboxReady
condition for the final time (in the
life of a pod) as part of existing calls to generateAPIPodStatus()
early
during syncTerminatedPod()
. Prior invocations of killPod()
(as part of
syncTerminatingPod
) will result in the absence of a sandbox corresponding to
the pod. As a result, the status
field of the SandboxReady
condition will be
set to false
(determined by invoking podSandboxChanged()
as described
above).
During periods of API server or etcd unavailability combined with a Kubelet
restart/crash (covered in more details
below),
the lastTransitionTime
field of SandboxReady
condition that
ultimately gets persisted upon Kubelet restarting and API server becoming
available again is as close as possible to an actual change in the condition
(that could not be persisted).
Changes of the status
field will result in lastTransitionTime
field getting
updated (by the Kubelet Status Manager).
Today, the Kubelet Status Manager surfaces APIs for other Kubelet components to issue pod status updates. It caches the pod status and issues patches to the API server when necessary. This infrastructure will be used for managing the new pod conditions as well.
The Kubelet Status Manager will surface a new GenerateSandboxReadyCondition
API. This will be invoked by Kubelet's generateAPIPodStatus()
to populate the
pod status that is passed to setPodStatus
. This is similar to the existing pod
conditions generator functions: GeneratePodReadyCondition
and
GeneratePodInitializedCondition
. If updates through generateAPIPodStatus()
is found to be inaccurate (for example if Kubelet is very busy), invocation of
GenerateSandboxReadyCondition
could also be added right after createSandbox
in kubeGenericRuntimeManager
returns successfully.
updateStatusInternal()
in the Kubelet Status Manager will be enhanced to mark
updateLastTransitionTime
for the new SandboxReady
condition when changes in
the status
of the conditions are detected.
If pod sandbox creation completed successfully on a node but API server became
unavailable, the Kubelet status manager will retry issuing the patches to the
API server. However, the Kubelet may get restarted (or crash) while the API
server is unavailable with the pod status updates not yet persisted. In such a
situation (expected to be quite rare), the timestamp associated with the
lastTransitionTime
field in the new conditions will not be accurate due to
inability to persist or cache them. The lastTransitionTime
field will get
updated on subsequent generateAPIPodStatus()
calls based on the state of the
CRI sandbox and the corresponding timestamps will be persisted. This aligns with
handling of other Kubelet managed conditions (ContainersReady, (Pod) Ready) when
API server is unavailable and Kubelet restarts resulting in the status manager
cache getting dropped.
E2E tests will be introduced to cover the user scenarios mentioned above. Tests
will involve launching pods with characteristics mentioned below and
examining the pod status has the new SandboxReady
condition with status
and reason
fields populated with expected values:
- A basic pod that launches successfully without any problems.
- A pod with references to a configmap (as a volume) that has not been created causing the pod sandbox creation to not complete until the configmap is created later.
- A pod whose node is rebooted leading to the sandbox being recreated.
Tests for pod conditions in the GracefulNodeShutdown
e2e_node test will be
enhanced to check the status of the new pod sandbox conditions are false
after
graceful termination of a pod.
Testing updates of Pod conditions in the Conformance Test Pods, completes the lifecycle of a Pod and the PodStatus
will be enhanced to cover resetting the
new pod sandbox conditions.
- Kubelet will report pod sandbox conditions if the feature flag
SandboxReadyCondition
is enabled. - Initial e2e tests completed and enabled.
- Gather feedback from cluster operators and developers of services or controllers that consume these conditions.
- Implement suggestions from feedback as feasible.
- Feature Flag removed.
- Add more test cases and link to this KEP.
- All tests are passing with no known flakiness.
- All feedback addressed around the new pod sandbox conditions.
- No open decision items around the new pod sandbox conditions.
The new condition will be managed by the Kubelet. When upgrading a node to a version of the Kubelet that can set the new condition, new pods launched on that node will surface the new condition. If Kubelet on the node is later downgraded, there may remain evicted pods that are not deleted. Foe such pods, a node with a version of the Kubelet that does not support the new condition will continue to report pods associated with it with the new conditions.
The new condition will be managed by the Kubelet. Since the control plane components are not involved, handling of version skew is not applicable.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: SandboxReadyCondition
- Components depending on the feature gate: Kubelet
No changes to any default behavior should result from enabling the feature.
Yes, the feature can be disabled once it has been enabled. However the new pod sandbox condition will get persisted in pods and would continue to be reported after the feature is disabled until those pods are deleted.
New pods created since re-enablement will report the new pod sandbox condition.
No
Skipping this section at the Alpha stage and will populate at Beta.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
Skipping this section at the Alpha stage and will populate at Beta.
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No, this feature does not have any dependencies. Other metric oriented services in the cluster may depend on this.
Yes, the new pod condition will result in the Kubelet Status Manager making additional PATCH calls on the pod status fields.
The Kubelet Status Manager already has infrastructure to cache pod status updates (including pod conditions) and issue the PATCH es in a batch.
No
No
Slight increase (a few bytes) of the Pod API object due to persistence of the additional condition in the pod status.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No
If etcd/API server is unavailable, pod status cannot be updated. So the
SandboxReady
condition associated with pod status cannot be updated either.
The pod status manager already retries the API server requests later (based on
data cached in the Kubelet) and that should help.
If pod sandbox creation completes for a pod on a node but API server becomes
unavailable (before the sandbox creation condition can be patched) and Kubelet
crashes or restarts (shortly after API server becoming and staying unavailable),
the lastTransitionTime
field may be inaccurate. This is described in the
section above.
None so far
SLOs are not applicable to pod status fields. Overall Kubernetes node level SLOs may leverage this feature.
The main drawback associated with the new pod sandbox conditions involves a
slight potential increase in calls to the API Server from Kubelet to patch
status
= true
for the new SandboxReady
condition in a pod's status.
Typically, this would involve an extra patch call for pod status in the lifetime
of most pods (if the status manager does not batch them with other pod status
updates): one when pod sandbox creation completes and another when the pod is
terminated. However, there could be a higher number of patch calls to API Server
if the pod sandbox environment (like a microvm) starts successfully and then
crashes in a re-start loop.
Caching of updates to pod status by the pod status manager and batching pod status updates (which is already in place) can help mitigate frequent patch calls to API server.
Timestamps around completion of pod sandbox creation may be surfaced as a
dedicated field in the pod status rather than a pod condition. However, since
the successful creation of pod sandbox is essentially a "milestones" in the life
of a pod (similar to Scheduled, Ready, etc), pod conditions is the ideal place
to surface these and aligns well with the existing conditions like
ContainersReady
and overall Ready
.
A dedicated annotation on the pod for surfacing this data is another potential approach. However, usage of annotations for Kubelet managed data is typically discouraged.
Surfacing the amount of time it took to successfully create a pod sandbox is an alternative to surfacing the condition around completion of pod sandbox creation (whose delta from pod scheduled condition reflects the latency). The latency data would surface the same information from a pod initialization SLI perspective as mentioned in the Motivations section. Implementing this approach would require an API change on the pod status to surface the latency data (as this no longer fits the structure of a pod condition). This data cannot be consumed by other controllers as mentioned in User Stories section.
The duration it took pod sandbox to become ready can be directly reported as a prometheus metrics aggregated in a histogram. However, aggregating the data at the Kubelet level prevents a metric collection service from classifying the data based on interesting fields on a pod (runtime class, storage class of PVCs, number of PVCs, etc) or using custom labels and annotations on pods that indicate workload characteristics (that the cluster operator may wish to use as a basis for aggregating the metrics).
This also prevents other controllers from acting on sandbox status as mentioned in User Stories section.
The Kubelet is being instrumented to emit traces based on OpenTelemetry around sandbox creation stages (as well several other parts of the pod lifecycle).
To implement the pod sandbox creation latency SLI/SLO use cases, the tracing infrastructure needs to be able to:
- Collect all traces around CRI sandbox creation for all pods with no sampling.
- Look-up pod fields from API server (associated with a pod's trace) like labels/annotations/storage classes of PVCs referred by the pod/runtimeclass/etc. that is of interest to cluster operators and their users for classifying and aggregating the metrics.
- Look-up a pod's Scheduled condition fields to determine the beginning of pod sandbox creation.
Since the lookup of the pod fields and existing conditions is necessary for SLIs
around pod sandbox creation latency, surfacing the SandboxReady
condition in
pod status will allow a metric collection service to directly access the
relevant data without requiring the ability to collect and parse OpenTelemetry
traces. As mentioned in the User Stories, popular community managed services
like Kube State Metrics can consume the SandboxReady
condition with a trivial
set of changes. Enhancing them to collect and parse OpenTelemetry traces with no
sampling and mapping the data to associated data from API server data will be
complex from an engineering and operational perspective.
For controllers using the pod sandbox conditions to determine reconciliation strategy, access to the pod is typically necessary while collecting and parsing traces would be unusual.
Have CSI/CNI/CRI plugins mark their start and completion timestamps while setting up their respective portions for a pod
Each infrastructural plugin that Kubelet calls out to (in the process of setting up a pod sandbox) can mark start and completion timestamps on the pod as conditions. This approach would be similar to how readiness gates work today. However, CSI and CRI plugins will need to be enlightened about fields in a pod (like status conditions) and setup a client to the API server (to update the conditions) which they may not implement to stay orchestrator agnostic.
An on-host binary that runs as a service and proxies CRI API calls between the
CRI runtime and Kubelet can intercept the successful creation of a pod sandbox
in response to CRI RunPodSandbox
. Next, using an API server client, the binary
can mark extended conditions on a pod to indicate state of sandbox creation.
While this approach works, without requiring any additional changes to Kubelet,
it had a couple of disadvantages: First, this approach requires configuration
and management of a separate proxy binary between Kubelet and CRI runtime in the
cluster nodes. Second, the proxy binary will need to replicate the logic in
Kubelet status manager to efficiently interact with the API server (as well as
cache the status and retry in case of API server outages) regarding updates to
pod sandbox status. Therefore isolating the logic around pod sandbox conditions
to a separate binary intercepting API calls between kubelet and the CRI runtime
is not preferred.
Instead of a "native" condition as proposed in this KEP, an "extended" condition
maybe used by Kubelet to mark the SandboxReady condition. Such a condition may
look like: kubernetes.io/pod-sandbox-ready
. However, internal/core Kubernetes
components (like Kubelet) do not use "extended" conditions today. So this
approach may be unusual.