Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubectl Checkpoint #5091

Open
6 tasks
adrianreber opened this issue Jan 27, 2025 · 5 comments
Open
6 tasks

Kubectl Checkpoint #5091

adrianreber opened this issue Jan 27, 2025 · 5 comments
Labels
sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cli Categorizes an issue or PR as relevant to SIG CLI.

Comments

@adrianreber
Copy link
Contributor

adrianreber commented Jan 27, 2025

Enhancement Description

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 27, 2025
@kikisdeliveryservice
Copy link
Member

Hi @adrianreber,

Please fill out the Discussion Link section of the issue which indicates that you've spoken to a SIG about opening this KEP. Also please identify the sponsoring SIG for this KEP.

Thanks

@adrianreber
Copy link
Contributor Author

/sig api-machinery
/sig cli

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cli Categorizes an issue or PR as relevant to SIG CLI. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 4, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG CLI Feb 4, 2025
@adrianreber
Copy link
Contributor Author

See #2008 for the corresponding kubelet changes.

@assafWeaversoft
Copy link

Hello everyone,

We at Weaversoft are developing Grus, a solution focused on container checkpointing, restoration, and migration within Kubernetes environments. Our use case benefits from an API-exposed checkpoint interface, as it enables automated and remote checkpoint management, which is crucial for:

Seamless workload migration across clusters without direct pod access.
Optimized restart workflows by reducing startup times and restoring application state efficiently.
Improved debugging and fault tolerance, where automated checkpointing can assist in forensic analysis and recovery.
The current recommendation to use kubectl debug introduces operational friction for automated solutions like ours. A direct API and CLI interface would significantly improve usability and integration into existing workflows.

We appreciate the ongoing discussions and would love to contribute further. Looking forward to feedback!

@SimonBaeumer
Copy link
Member

+1 for exposing the checkpoint API at the kube-apiserver

At StackRox I've done analysis on how we potentially implement Container Checkpointing and share my results here. Imho exposing the checkpoint API at the kube-apiserver is a feature that aligns with Kubernetes' architectural goals and enhances its extensibility. Below, I present an analysis comparing two potential approaches to enabling checkpointing functionality, highlighting why API exposure at the kube-apiserver is the preferred option.

Approaches:

  1. Third-party agent architectures
  2. Direct API exposure in the kube-apiserver (upstream)

If the API endpoint is not exposed, an agent is required to keep the latency low between the checkpoint API request and cri-u triggering the checkpoint on the node.

1) Third-party Agent Architectures

This approach involves deploying a custom checkpointing service that communicates with the kubelet on each node. A typical setup would include:

  • A DaemonSet running an agent on each node.
  • The agent communicates with the kubelet via the host network and interacts with the container runtime to create checkpoints.
  • The checkpoint directory (/var/lib/kubelet/checkpoints) is mounted to the agent on each node.

Concerns with Third-party Agent Architectures

a) Agent access to the kubelet API
The agent requires direct access to the kubelet API, either by joining the host network or by exposing the kubelet endpoint on the Kubernetes network. Both options come with security and operational implications:

  • Joining the host network creates broader attack surfaces.
  • Exposing the kubelet endpoint may require additional security controls, which introduces complexity.

Importantly, exposing the checkpointing functionality at the kube-apiserver would not result in any additional security trade-offs compared to these alternatives. Instead, it would consolidate checkpointing access to a central, already-secured API layer.

b) Node discovery and routing complexity
Routing checkpointing requests to the appropriate node requires additional logic in third-party tooling. While the Kubernetes API proxy provides an efficient way to handle such routing, without it, third-party solutions must reimplement this logic. This not only duplicates effort but also increases the risk of errors and inconsistencies.

c) Resource consumption and operational overhead
Running an agent on every node introduces extra resource consumption and operational complexity:

  • Increased CPU and memory requirements across all nodes.
  • Maintenance overhead for managing the lifecycle of the agent DaemonSet (e.g., upgrades, scaling, etc.).

d) Maintaining upstream alignment
Offloading checkpointing functionality entirely to third-party tools fragments responsibility and misses an opportunity to standardize the feature in Kubernetes. Kubernetes should define a consistent API that lowers the barrier for third-party integrations while limiting the scope of upstream responsibilities to manageable components.


2) Direct API Exposure in kube-apiserver

Exposing the checkpoint API directly at the kube-apiserver provides a cleaner and more scalable solution, with clear benefits:

a) Simplified security model
Centralizing checkpointing functionality in the kube-apiserver ensures that access control can leverage existing Kubernetes RBAC policies. There’s no need to manage additional agents or configure their permissions separately, reducing the attack surface and operational complexity.

b) Unified request routing
The kube-apiserver already has efficient routing mechanisms to direct requests to specific nodes. Exposing the checkpointing API at this level eliminates the need for third-party tooling to reimplement routing logic, enabling seamless integration for external tools.

c) Reduced resource overhead
By eliminating the need for a DaemonSet and per-node agents, clusters can avoid the extra resource consumption and operational burden of managing additional components.

d) Enabling third-party integrations
While Kubernetes would handle API exposure and routing, the responsibility for managing checkpoint retention, encryption, and advanced features could remain with third-party tools. These tools could implement custom logic using Kubernetes Jobs or other constructs, leveraging the upstream API without the need for agents.


Conclusion

In would love to see the checkpoint API at the kube-apiserver. It provides us with a robust and standardized way to integrate checkpointing functionality without introducing unnecessary complexity or resource overhead. By addressing the concerns of agent-based architectures, this approach makes checkpointing simpler, safer, and more maintainable for the Kubernetes ecosystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cli Categorizes an issue or PR as relevant to SIG CLI.
Projects
Status: Needs Triage
Development

No branches or pull requests

5 participants