Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3386: Add Beta (enabled by default) criteria for Evented PLEG #3900

Merged
merged 1 commit into from
Mar 10, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions keps/sig-node/3386-kubelet-evented-pleg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- [Timestamp of the Pod Status](#timestamp-of-the-pod-status)
- [Runtime Service Changes](#runtime-service-changes)
- [Pod Status Update in the Cache](#pod-status-update-in-the-cache)
- [Compatibility Check](#compatibility-check)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit tests](#unit-tests)
Expand All @@ -24,6 +25,11 @@
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [Beta (enabled by default)](#beta-enabled-by-default)
- [Stress Test](#stress-test)
- [Recovery Test](#recovery-test)
- [Retries with Backoff Logic](#retries-with-backoff-logic)
- [Generic PLEG Continuous Validation](#generic-pleg-continuous-validation)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
Expand Down Expand Up @@ -272,6 +278,10 @@ func (c *cache) Set(id types.UID, status *PodStatus, err error, timestamp time.T

This has no impact on the existing `Generic PLEG` when used without `Evented PLEG` because its the only entity that sets the cache and it does so every second (if needed) for a given pod.

### Compatibility Check

For this feature to work Kubelet needs to be used with a compatible CRI Runtime that is capable of generating CRI Events. During the Kubelet start up if it detects that CRI Runtime doesn't support generating and streaming CRI Events, it should automatically fall back to using `Generic PLEG`


### Test Plan

Expand Down Expand Up @@ -348,6 +358,45 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
- Add E2E Node Conformance presubmit job in CI
- Add E2E Node Conformance periodic job in CI

#### Beta (enabled by default)
##### Stress Test
To test the performance and scalability of Evented PLEG, it is necessary to generate a large number of CRI Events by creating and deleting a significant number of containers within a short period of time. The following steps outline the stress test:

Since this is a disruptive stress test, it should be part of a node e2e `Serial` job. CRI Events are generated per container, and therefore, the test should create a substantial number of containers within a single pod. After creation, these containers should run to completion and then be removed by the kubelet. This process will ensure the generation of CONTAINER_CREATED_EVENT, CONTAINER_STARTED_EVENT, CONTAINER_STOPPED_EVENT, and CONTAINER_DELETED_EVENT.

The test should continue to create these containers until the histogram metric `evented_pleg_connection_latency_seconds` begins to show distinct latency values in its 1-second bucket. This indicates that it is taking 1 second or longer for an event to be observed by the kubelet after getting generated by the runtime. Typical values for this latency are around 0.001 seconds, so it is safe to assume 1 second as a measure indicates that the system is under stress.

Once the `evented_pleg_connection_latency_seconds` is observed to be greater than 1 second, new container creation is halted, and the rest of the already created containers are run to completion. At this point, `kubelet_evented_pleg_connection_latency_seconds_count` can be used to determine the total number of CRI Events generated during this test.

##### Recovery Test
To test the ability of the Kubelet to recover the latest state of a container after a restart, a disruption test should be included in the node e2e Serial job. The test should involve creating a container with a sufficient time to completion (e.g. sleep 20), and then immediately stopping the Kubelet once the container enters the `Running` state. The CRI runtime should emit CRI events indicating the change in container state, but the Kubelet will miss the `CONTAINER_STOPPED_EVENT` for that container.

To validate the Kubelet's ability to recover the latest state of the container, the test should query the CRI endpoint to confirm that the container has ran to completion successfully. Once the Kubelet is started again, it should be able to query the CRI runtime and update its cache with the latest state of the container. If the Kubelet accurately reports the state of the container as `Completed`, the test will be considered passed.

##### Retries with Backoff Logic
Currently, the Kubelet attempts to reconnect five times before falling back on Generic PLEG in the event of errors encountered during the streaming connection with CRI Runtime. However, in situations where the CRI Runtime is taken down for maintenance purposes, the Kubelet may exhaust all of its reconnection attempts and never try again, resulting in the usage of `Generic PLEG` despite the CRI Runtime's compatibility with `Evented PLEG`. To address this issue, a backoff logic with exponentially increasing sequence and an upper limit should be implemented to retry re-establishing the connection. Once the upper limit is reached, it should periodically try with that value. By doing so, the Kubelet will be able to reconnect to the CRI Runtime even after multiple attempts have failed, and it will be able to utilize `Evented PLEG` when possible. e.g.

```
Retry immediately
Retry after 1 second
Retry after 2 seconds
Retry after 4 seconds
Retry after 8 seconds
Retry after 16 seconds
Retry after 32 seconds
Retry after 64 seconds
Retry after every 60 seconds indefinitely
```

##### Generic PLEG Continuous Validation
Make sure existing jobs in following test grid tabs that use `Generic PLEG` continue to use it by making sure that `Evented PLEG` is disabled for them.

https://testgrid.k8s.io/sig-node-release-blocking
https://testgrid.k8s.io/sig-node-kubelet
https://testgrid.k8s.io/sig-node-containerd
https://testgrid.k8s.io/sig-node-cri-o
https://testgrid.k8s.io/sig-node-presubmits

### Upgrade / Downgrade Strategy

N/A
Expand Down