KEP-4191: Split Image Filesystem

Release Signoff Checklist
Summary
Motivation
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- kubelet Disk Stats in CRI
- Add container filesystem usage to image filesystem array
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
[] (R) Production readiness review completed
[] (R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP is about enhancing kubelet to be aware if a container runtime splits the image filesystem.
Aware in this case means that garbage collecting images, containers and reporting disk usage is all functional.

Motivation

kubelet has two distinct filesystems: Node and Image. In typical deployments, users deploy Kubernetes where both the Node and Image filesystems are on the same disk. There are some requests to separate the storage into separate disks. The most common request is to separate the writable layer from the read-only layer. kubelet and Container data would be stored on the same disk while images would have their own disk. This could be beneficial because images occupy a lot of disk space while the writeable layer is typically smaller.

Container IO can impact kubelet and adding the ability for more disks could increase performance of kubelet.

However, it is not possible to separate the image layers and container writable layers on different disks.

In the current implementation of separate disks, containers and images must be stored on the same disk. So garbage collection, in case of node pressure (really image disk pressure) would GC images/containers on the image filesystem.

If one separates writable layer (containers) from readable layer (images), then garbage collection and statistics must account for this separation. Today this could potentially break kubelet if the container runtime configures storage in this way.

One downside of the separate disk is that pod data can be written in multiple locations. The writeable layer of a container would go on the image filesystem and volume storage would go to the root fs. There is another request to separate the root and the image filesystem to be writeable and read-only respective. This means that pod data can be written on one disk while the other disk can be read-only. Separting the writeable layer and the read-only layer will achieve this.

Goals

kubelet should still work if images/containers are separated into different disks
- Support writable layer being on same disk as kubelet
- Images can be on the separate filesystem

Possible Extensions in Post Alpha

kubelet, Images and Containers on all separate disks.

This case is possible with this implementation as ContainerFS will be set up to read file statistics from a separate filesystem. However, this is not in scope for Alpha.
If there is interest in this, this KEP could be extended to support this use case. Main areas to add would be testing.

Non-Goals

Multiple nodes can not share the same filesystem
Separating kubelet data into different filesystems
Multiple image and/or container filesystems
- This KEP will start support for this but more work needs to be done to investigate CAdvisor/CRIStats/Eviction to support this

Proposal

User Stories

Story 1

As a user, I would like to have my node configured so that I have a writeable filesystem and a readable filesystem.
kubelet will write volume data and the container runtime will write writeable layers to the writeable filesystem while the container runtime will write the images to the read-only filesystem.

User Deployment Options

It is not a common pattern to separate the filesystems in most Kubernetes deployments. We will summarize the existing configurations that are possible today.

Current Deployment Options

Image File system (ImageFs) and NodeFs (kubelet) same

sda0: [writeable layer, emptyDir, logs, read-only layer, ephemeral storage]

This is the default configuration for Kubernetes. If container runtime is not configured in any special way, then NodeFs and ImageFs are assumed to be the same.

If the node only has a NodeFs filesystem that meets eviction thresholds, the kubelet frees up disk space in the following order:

Garbage collect dead pods and containers
Delete unused images

The way that pods are ranked for eviction also changes based on the filesystem.

kubelet sorts pods based on their total disk usage (local volumes + logs & writable layer of all containers)

Node Pressure Eviction lists the possible options for how to reclaim resources based on filesystem configuration.

Ephemeral-Storage explains how ephemeral-storage tracking works with different filesystem configurations.

NodeFs and Image Filesystem (ImageFs) separated

sda0: [emptyDir, logs, ephemeral storage] sda1: [writeable layer, read-only layer]

If the node has a dedicated ImageFs filesystem for container runtimes to use, the kubelet does the following:

If the node filesystem meets the eviction thresholds, the kubelet garbage collects dead pods and logs
If the ImageFs filesystem meets the eviction thresholds, the kubelet deletes all unused images and containers
If ImageFs has disk pressure we will mark node as unhealthy and not allow new pods to be admitted until image disk pressure is gone

In case of disk pressure on each filesystem, what is garbage collected/stored on the disk?

Node Filesystem:

Logs
Pods
Ephemeral Storage

Image Filesystem:

Images
Containers

CAdvisor detects the different disks based on mountpoints. So if a user mounts a separate disk to /var/lib/containers, kubelet will think that the filesystem is split.

Users can write the writeable layer of a container and that would be stored on the image filesystem while data written in volumes can be written to the node filesystem.

Since this split case has two different filesystems that can have disk pressure, Pods are ranked differently based on what is experencing disk pressure.

Node Pressure:

Local volumes + logs of all containers

Image Pressure:

Sorts pods based on the writeable layer usage of all containers

New Deployment Options

Node And Writeable Layer on same disk while Images stored on separate disk

sda0: [writable layer, emptyDir, logs, ephemeral storage] sda1: [read-only layer]

A goal is to allow kubelet to have separate disks for read-only layer and everything else could be stored on the same disk as kubelet.

In case of disk pressure on each filesystem, what is garbage collected/stored on the disk?

Node Fileystem:

Pods
Logs
Containers
Ephemeral Storage

Image Filesystem:

Images

Node Filesystem should monitor storage for containers in addition to ephemeral storage.

Comment on Future Extensions

We foresee interest in the future for other use cases. So we want to comment on what work would be required to support these usecases.

One extension can be multiple filesystems for images and containers.
The API allows for a list of filesystem usage per images and containers but there has been no work done to support this in the container runtimes or in kubelet.

CAdvisor and Stats would need to be enhanced to allow for configurable amount of filesystems.
Currently, eviction manager is harded code to support a 1-to-1 relationship with a filesystem and a eviction signal.

The following cases could be configured but we are not targeting these at the moment.

a. Node, Writeable layer and Image on separate filesystems. b. Node and Images on same filesystem while Writeable layer on separate filesystem.

Risks and Mitigations

By splitting the filesystem we allow more cases than what we currently support in kubelet. To avoid bugs, we will validate on cases we don't currently support in kubelet and return an error.

The following cases will be validated and we will return an error if container runtime is set up for this:

More than one filesystem for images and containers

We will validate if the CRI implementation is returning more than 1 filesystem and log a warning.

A major risk of this feature will be increased evictions due to the addition of a new filesystem.
The eviction manager monitors image filesystem, node filesystem and now container filesystem for disk pressure.
Disk pressure can be inodes or storage limits.
Once the disk is exceeds the limits set by EvictionSoft or EvictionHard, then that node will eventually be marked as having disk pressure.
Garbage collection of containers, images or pods will be kicked off (depending on which filesystem experiences disk pressure).
New workloads will not be accepted by that node until disk pressure resolves itself either by garbage collection removing enough or manually intervention.

A mitigation for this is to initially support the case of the writeable layer being on the node filesystem (ContainerFs same as NodeFs), so we really are only monitoring two filesystems for pressure.

Design Details

CRI

We will switch to using ImageFsInfo but this will be guarded by a feature gate.

CRI-O and Containerd return a single element in this case and kubelet does not assume that there are multiple values in this array. Regardless, we add an array to ImageFsInfoResponse.

// ImageService defines the public APIs for managing images.
service ImageService {
…
    rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse) {}
}

message ImageFsInfoResponse {
    // Information of image filesystem(s).
    repeated FilesystemUsage image_filesystems = 1;
    + // Information of container filesystem(s).
    + // This is an optional field if container and image
    + // storage are separated.
    + // Default will be to return this as empty.
    + repeated FilesystemUsage container_filesystems = 2;
}

It is expected of the CRI implementation to return a unique identifier for images and containers so the kubelet can ask CRI if the objects are stored on separate disks. In the dedicated disk for container runtime, images_filesystem and container_fileystem will be set to the same value.

The CRI implementation can set this as needed. The image and container filesystems are both arrays so this provides some extensibility in case these are stored on multiple disks.

Container runtimes will need to implement ImageFsInfo

CRI-O Implementation
Containerd implementation

An Alpha to Beta graduation goal would be to have an implementation of crictl imagefsinfo that can allow for more detailed reports of the image fs info.

See PR for an example.

Stats Summary

Stats Summary has a field called runtime and we will add a ContainerFS to the runtime field.

// RuntimeStats are stats pertaining to the underlying container runtime.
type RuntimeStats struct {
// Stats about the underlying filesystem where container images are stored.
// This filesystem could be the same as the primary (root) filesystem.
// Usage here refers to the total number of bytes occupied by images on the filesystem.
// +optional
ImageFs *FsStats `json:"imageFs,omitempty"`
+ // Stats about the underlying filesystem where container's writeable layer is stored.
+ // This filesystem could be the same as the primary (root) filesystem or the ImageFs.
+ // Usage here refers to the total number of bytes occupied by the writeable layer on the filesystem.
+ // +optional
+ ContainerFs *FsStats `json:"containerFs,omitempty"`
}

In this KEP, ContainerFs can either be the same as ImageFs or NodeFs.

We will add a more detailed function for ImageFsStats in the Provider Interface

type containerStatsProvider interface {
...
ImageFsStats(ctx context.Context) (*statsapi.FsStats, *statsapi.FsStats, error)
}

If we have a single image filesystem then ImageFs includes both writable and read-only layer. In this case, ImageFsStats will return an identical object for ImageFS and ContainerFS.

In a case where the container runtime does not return a container filesystem, we will assume that the image_filesystem=container_filesystem. This allows us kubelet to support container runtimes that have yet implemented the CRI implementation in ImageFsInfo.

Stats Provider

The CRI Stats Provider uses the ImageFsInfo to get information about the filesystems, but the CAdvisor Stats Provider uses ImageStats which will list the images and computes the overall size from this list.

This switch will be guarded by a feature gate.

CAdvisor Stats Provider

CRI-O uses the CAdvisor Stats provider.

CAdvisor has plugins for each container runtime under containers. CRI-O

The plugin in CRI-O relies on the endpoints info and container/{id}. Info is used to get information about the storage filesystem and container gets information about the mount points. CRI-O will add a new field storage_image to tell when we are splitting the filesystem.

This is used to gather file stats.

CAdvisor labels CRI-O images as crio-images and that is assumed to be the mountpoint of the container. When splitting the filesystem this ends up pointing to the writeable layer of the container.

We will propose a new label in CAdvisor: crio-containers will point to the writeable layer and crio-images will point to the read-only layer.

In case of no split system, crio-images will be used for both layers.

We have created CAdvisor PR to suggest how CAdvisor's can be enhanced to support a container filesystem.

CRI Stats Provider

Containerd uses the CRI Stats Provider.

CRI Stats Provider calls ImageFsInfo and uses the FsId to get the filesystem information from CAdvisor. One could label the FsId for the writeable layer and this will be used to get the file system information for the container filesystem.

No changes should be necessary in CAdvisor for this provider.

Eviction Manager

A new signal will be added to the eviction manager to reflect the filesystem for the writeable layer.
For the first release on this KEP, this will be either NodeFs or ImageFs. In separate disks, this could be a separate filesystem.

 // SignalContainerFsAvailable is amount of storage available on filesystem that container runtime uses for container writable layers.
 SignalContainerFsAvailable Signal = "containerfs.available"
 // SignalContainerFsInodesFree is amount of inodes available on filesystem that container runtime uses for container writable layers.
 SignalContainerFsInodesFree Signal = "containerfs.inodesFree"

We do need to change the garbage collection based on the split filesystem case.

(Split Filesystem) Writable and root plus ImageFs for images

NodeFs monitors ephemeral-storage, logs and writable layer
ImageFs monitors read-only

Eviction manager decides the priority of eviction based on which filesystem is experencing pressure.

If Node filesystem experiences pressure, ranking is done as local volumes + logs of all containers + writeable layer of all containers

If Image filesystem experiences pressure, ranking is done as storage of images.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

(pkg/kubelet/eviction): Sep 11th 2023 - 69.9
(pkg/kubelet/stats): Sep 11th 2023 - 77.9
(pkg/kubelet/server/stat): Sep 11th 2023 - 55

This KEP will enhance coverage in the eviction manager by covering the case where dedicatedImageFs is true.
There is currently little test coverage when a separate ImageFs is used. Issue-120061 has been created to help resolve this.

We will also provide test cases for rolling back the changes in the eviction manager.

We will add unit tests to cover using ImageFsInfo and we will have testing around rolling back this feature.

We will add test cases for ImageStats in case of positive and negative usage of the feature. In negative cases, we will assume containerfs=imagefs. In positive test cases, we will allow different configurations of the image filesystem.

Integration tests

Typically these type of tests are done with e2e tests.

End-to-End tests

This code affects stats, eviction and the summary API.

There should be e2e tests for each of these components with a split disk.
However, there are a few complications with this goal.

E2E tests around eviction with a single disk are currently CRI-O-eviction and containerd-eviction failing.
There is zero test coverage around a separate image filesystem. There is an issue to improve this at the unit test level.

1 can be addressed by investigating the eviction tests and figure out the root cause of these failures.

As part of this KEP, we should add testing around separate disks in upstream Kubernetes. Since this is already a supported use case in kubelet, there should be testing around this.

kubelet/CRI-O should be set up with configuration for a separate disk. Eviction and Summary E2E tests should be added in the case of a separate disk.

And tests for split image filesystem should be added.

E2E Test Use Cases addition:

E2E tests for summary api with separate disk
- Separate Disk - ImageFs reports separate disk from root when disk is mounted
- Split Disk - Writeable layer on Node, read-only layer on ImageFs
E2E tests for eviction api with separate disk
- Replicate existing disk pressure eviction e2e tests with disk

E2E tests for separate disk:

Presubmits - Added separate ImageFs
Presubmits - Added conformance test for ImageFs

Graduation Criteria

Alpha Milestone #1 [Release the CRI API and kubelet changes]

CRI API changes are composed in containerd and CRI-O so the CRI API must be released first.

Using ImageFsInfo is guarded with a feature gate
Implementation for split image filesystem in Kubernetes
- Eviction manager modifications in case of split filesystem
- Summary and Stats Provider implementations
CRI API merged
Unit tests
E2E tests to cover separate image filesystem
- It is not possible to have e2e tests for split filesystem at this stage

Alpha Milestone #2 [CRI-O, E2E Tests and CRI tools]

Shortly after this release and new CRI package, projects that consume the CRI API can be updated to use the new API features.

At least one CRI implementation supports split filesystem
E2E tests supporting the CRI implementation with split image filesystem
CRI tool changes for image fs

Alpha To Beta Graduation

Gather feedback on other potential use cases
Always set KubeletSeparateDiskGC to true so ImageFsInfo is used instead of ImageStats in all cases
Always set KubeletSeparateDiskGC to true so that eviction manager will detect split file system and handle it correctly

Stable

More than one CRI implementation supports split filesystem

Upgrade / Downgrade Strategy

There are two cases that this feature could impact users.

Case 1: Turning the feature on with no split filesystem. In this case, the main difference will be that clusters that use the CAdvisor Stats Provider, we will switch to using ImageFsInfo to report the image filesystem statistics. Turning off this feature will use ImageStats.

Case 2: Feature is turned on and the container runtime is set up to split filesystem. In this case, rolling back this feature is only supported if one also configures the container runtime to not split the filesystem.

Another case that is important to highlight is that some container runtimes may not support split filesystem, We will guard against a container runtime not returing a container filesystem in ImageFsInfo. In this case we would assume that the image filesystem and the container filesystem are identical.

Since older versions of the container runtimes do not have the ability to split the filesystem, we don't foresee much issue with this. kubelet will not behave differently if the container and image filesystems are identical.

Version Skew Strategy

The initial release of this will be the CRI API and changes to kubelet.

We do not assume that container runtimes must implement this API so we will assume a single filesystem for images.

Once the container runtimes implement this API and the feature gate is enabled, then the feature would be active.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

If a container runtime is configured to split the image filesystem, there is no really good way to roll these changes back. We will include a feature gate for best practices to guard against our code.

Feature gate (also fill in values in kep.yaml)
- Feature gate name: KubeletSeparateDiskGC
- Components depending on the feature gate: kubelet
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane? It depends. If the control plan is run on kubelet, then yes. If the control plane is not run on kubelet, then no
- Will enabling / disabling the feature require downtime or reprovisioning of a node? Yes. One needs to restart the container runtime on the node to turn on support for split image filesystem

Our recommendation to roll this change back:

Configure your container runtime to not split the image filesystem.
Restart the container runtime.
Restart kubelet with feature flag off.

Does enabling the feature change any default behavior?

Yes, we will switch to using ImageFsInfo to compute disk stats rather than call ImageStats.

The eviction manager will monitor the container filesystem if the image filesystem is split.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

There are two possibilities for this feature:

Container runtime is configured for split disk
Container runtime is not configured for split disk

If the feature toggle is disabled in 1, then turning off the feature will tell eviction manager that the containerfs=imagefs.
The container garbage collection will try to delete the writeable layer on the image filesystem which may not be there. kubelet will still run but there could be a possibility that the container filesystem will grow unchecked and eventually cause disk pressure.

In case 2, rolling back this feature will be possible because we will use ImageStats to compute the filesystem usage. Since the container runtime is configured to not split the disk, nothing would really be changed in this case.

What happens if we reenable the feature if it was previously rolled back?

Nothing as long as the container runtime is set up to split again.

Are there any tests for feature enablement/disablement?

Yes, even though roll back is not supported, we will be switching to using ImageFsInfo for stats on the file system. This will be guarded by a feature gate and we will test negative and positive test cases.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

If the filesystem is not split, this rollout or rollback will be a no-op.

If the filesystem is split and you want to roll back the change that will require a change to the container runtime configuration.

If one does not want to change the container runtime configuration, there could be a possibility that node pressure could happen as garbage collection will not work. The container filesystem would grow unbounded and would require users to clean up their disks to avoid disk pressure.

What specific metrics should inform a rollback?

If a cluster is evicting a lot more pods (node_collector_evictions_total) than normal, this could be caused by this feature.

The eviction manager monitors the image filesystem, node filesystem and the container filesystem for disk pressure. If any of these filesystems are experencing I/O pressure, pods will start being evicted and the eviction manager will trigger garbage collection. The metric node_collector_evictions_total will inform operators that something is wrong because pods will be evicted and until disk pressure resolves itself, new workloads are not able to run.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not yet.

We are testing with container runtimes not requiring this API implementation for initial alpha.

In future releases, we could test this.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

This feature will be hidden from the users mostly but if an operator wants to know, it is possible to use crictl.

crictl imagefsinfo can be used to determine if the file systems are split.

crictl imagefsinfo will return a json object of file system usage for the image filesystem and the container filesystem. If the image filesystem is not split, then the image filesystem and container filesystem will have identical statistics.

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)

crictl imagefsinfo will give stats information about the different filesystems.

A user could check the filesystem for containers and images.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: node_collector_evictions_total
- Components exposing the metric: kubelet

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

The container runtime needs to be able to split the writeable and the read-only layer.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

N/A

Will enabling / using this feature result in any new calls to the cloud provider?

N/A

Will enabling / using this feature result in increasing size or count of the existing API objects?

There is an additional field added to the CRI api for ImageFsInfoResponse.

API type: protobuf array of FileSystem Usage
- Estimated increase in size: 24 bytes and a variable length string for the mount point
- Estimated amount of new objects: 1 element in the array
API type: ContainerFilesystem in Summary Stats
- Estimated increase in size: 24 bytes plus a variable length string for the mount point
- Estimated amount of new objects: 1 ContainerFilesystem for Summary Stats

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Yes. We are adding a way to split the image filesystem so it will be possible for disk space to be used.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Yes. We are adding a way to split the image filesystem so it will be possible for inodes/disk space to be used.

We will add a new eviction api for ContainerFS to handle a case if the container filesystem has disk pressure.

The split disk means that we will need to monitor image disk size on the ImageFs and the writeable layer on the rootfs.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

This feature does not interact with the API server and/or etcd as it is isolated to kubelet.

What are other known failure modes?

Pods do not start correctly
- Detection: The user notices that the desired pods are not starting correctly, and their status indicates an error or a failure related to image pull failures, which can then be traced to the Split Image Filesystem feature.
- Mitigations: The Split Image Filesystem feature can be disabled as a mitigation step. However, it is not without side effects, where any container images downloaded before would have to be downloaded again. Thus, further investigation would be recommended before a decision to disable this feature is made. The user should also ensure that if the feature is disabled, enough disk space will be available at the location where the ContainerFs filesystem is currently pointed against. A restart of kubelet will be required if this feature is to be disabled.
- Diagnostics: Kubernetes cluster events and specific pods statutes report image pull failures that are related to problems with one of the filesystem access permissions, storage volumes issues, mount points issues, etc., where none of the reported issues are related to disk space utilisation, which would otherwise trigger pods eviction. Reviewing CRI and kubelet service logs can help to determine the root cause. Additionally, reviewing operating system logs can be helpful and can be used to correlate events and any errors found in the service logs.
- Testing: A set of end-to-end tests aims to cover this scenario.

What steps should be taken if SLOs are not being met to determine the problem?

The operator should ensure that:

The underlying node is currently not under high load due to high CPU utilisation, memory pressure or storage volume latency (with the focus on I/O wait times)
There is sufficient disk space available on the filesystem or volume that is used for the image filesystem to use to store data
There are a sufficient number of inodes free and available, especially if the filesystem does not support a dynamic inodes allocation, on the provisioned filesystem where the image filesystem will store data
The volume, if backed by a local block device or network-attached storage, has been made available to the image filesystem to be used to store data
The CRI, container runtimes and kubelet have access to the location on the filesystem or the volume (block device) where the image filesystem will be storing data
The system user, if either CRI, container runtimes or kubelet have been configured to use a system user other than the privileged one such as root, has access to the filesystem location or volume where the image filesystem will store data
The node components, such as the CRI, container runtimes and kubelet, are up and running, and service logs are free from errors that might otherwise impact or degrade any of the components mentioned earlier
The CRI, container runtimes and kubelet service logs are free from error reports about the configured ContainerFs, ImageFs, and otherwise configured filesystem location or storage volumes

Additionally, the operator should also confirm that the necessary CRI and kubelet configuration has been deployed correctly and points to a correct path to a filesystem location where the image filesystem will be storing data.

While troubleshooting issues potentially related to the Split Image Filesystem feature, it's best to focus on the following areas:

Current CPU and memory utilisation on the underlying node
Storage volumes, disk space availability, and sufficient inodes capacity
I/O wait times, read and write queue depths, and latency for the storage volumes
Any expected mount points, whether bind mounts or otherwise
Access permission issues
SELinux, AppArmor, or POSIX ACLs set up
The kernel message buffer (dmesg)
Operating system logs
Specific services logs, such as CRI, container runtimes and kubelet
Kubernetes cluster events with a focus on evictions of pods from affected nodes
Any relevant pods or workloads statuses
Kubernetes cluster health with a focus on the Control Plane and any affected nodes
Monitoring and alerting system or services, with a focus on recent and historic events (past 24 hours or so)

If the Kubernetes cluster sports an observability solution, it would be useful to look at the collected usage metrics so that any problems found could potentially be correlated to events and usage data from the last 24 hours or so.

For cloud-based deployments, it would be prudent to interrogate any available monitoring dashboards for the node and any specific storage volume and to ensure that there is enough IOPS capacity provisioned and available, that the correct storage type has been provisioned, and that metrics such as burst capacity for IOPS and throughput aren't negatively impacted, should the storage volume support such features.

Implementation History

Initial Draft (September 12th 2023)
KEP Merged (October 5th 2023)
Alpha Milestone #1 PRs merged (October 31st 2023)
Alpha Milestone #2 PRs merged (December 22nd 2023)

Drawbacks

This could increase the amount of ways to configure kubelet to work and provide more difficulty in trouble shooting.

Alternatives

kubelet Disk Stats in CRI

In this case, we considered bypassing CAdvisor and have CRI return node usage information entirely. This would require container runtimes to report disk usage/total stats in the ImageFsInfo endpoint.

We decided to not go this route as we intend to support only two filesystems so we don’t need a separate tracking filesystem. We already have node and image statistics so we choose to use either use node or image in this KEP.

If one wants to support the writable layer as an entirely separate disk, then either extensions to CAdvisor or CRI may be needed as one will need to know information about the writable layer disk.

Add container filesystem usage to image filesystem array

In the internal API, kubelet directly uses the image filesystem array rather than the ImageFsInfoResponse.
To keep API changes minimal, we could have all containerd/cri-o add container filesystems to the image filesystem. This would work but it would require some additions to the file system usage with a label for images/containers.

We decided to not go this route as there could be more use cases to add to ImageFsInfoResponse that would not fit in the array type.

Infrastructure Needed (Optional)

E2E Test configuration with separate disks. It may be possible to use a tmpfs for this KEP.

Files

README.md

Latest commit

History