Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeletionTimeStamp not set for some evicted pods #54525

Open
patrickshan opened this issue Oct 24, 2017 · 40 comments
Open

DeletionTimeStamp not set for some evicted pods #54525

patrickshan opened this issue Oct 24, 2017 · 40 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@patrickshan
Copy link
Contributor

patrickshan commented Oct 24, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
When a node starts to evict pods under disk pressure, the DeletionTimestamp for some evicted pods are not set properly and still have zero value. It seems that pods created through Deployment are having this issue while pods created through DaemonSet have DeletionTimestamp set properly.

What you expected to happen:
Pods created through deployment should also have DeletionTimestamp set properly.

How to reproduce it (as minimally and precisely as possible):
Write an app to watch apiserver pods related events. Deploy a debian toolbox pod on one node using Deployment. Put that node under disk pressure like using more than 90% of disk space and consume more disk space from inside the toolbox pod (you can install some packages which uses lots of disk space like gnome-core for debian).

Anything else we need to know?:
You can find some events related with pod and they only have Phase updated to "Failed" but without setting DeletionTimestamp which still has zero value.

Environment:

  • Kubernetes version (use kubectl version): 1.8.1
  • Cloud provider or hardware configuration**: AWS
  • OS (e.g. from /etc/os-release): Container Linux by CoreOS 1520.4.0
  • Kernel (e.g. uname -a): Linux ip-10-150-64-105 4.13.3-coreos Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Wed Sep 20 22:17:11 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz GenuineIntel GNU/Linux
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 24, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 24, 2017
@patrickshan
Copy link
Contributor Author

@kubernetes/sig-api-machinery-misc

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Oct 25, 2017
@k8s-ci-robot
Copy link
Contributor

@patrickshan: Reiterating the mentions to trigger a notification:
@kubernetes/sig-api-machinery-misc

In response to this:

@kubernetes/sig-api-machinery-misc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 25, 2017
@k82cn
Copy link
Member

k82cn commented Oct 25, 2017

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 25, 2017
@dixudx
Copy link
Member

dixudx commented Oct 25, 2017

/assign

@lavalamp
Copy link
Member

Most likely this is sig-node.

@lavalamp lavalamp removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Oct 26, 2017
@smarterclayton
Copy link
Contributor

Deletion timestamp is set by the apiserver, it cannot be set by clients

@liggitt
Copy link
Member

liggitt commented Oct 26, 2017

Are they actually evicted/deleted or do they just have failed status?

@patrickshan
Copy link
Contributor Author

patrickshan commented Oct 27, 2017

They are evicted and then marked as "Failed" status. For pods created through daemonset, their DeletionTimeStamp get set after "Failed" status set. But for pods created through deployment, their DeletionTimeStamp just keep zero value and never get updated.

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 27, 2017 via email

@dixudx
Copy link
Member

dixudx commented Oct 27, 2017

That means that no one deleted them. The ReplicaSet controller is responsible for performing that deletion.

I reproduced this. But I found the pods were successfully evicted and deleted only from kubelet, not apiserver. Apiserver still kept a copy, with nil DeletionTimeStamp and ContainerStatuses. @liggitt Do you know why? Quite abnormal.

@liggitt
Copy link
Member

liggitt commented Oct 27, 2017

I see the kubelet sync loop construct a pod status like what you describe if an internal module decides the pod should be evicted:

func (kl *Kubelet) generateAPIPodStatus(pod *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus {
glog.V(3).Infof("Generating status for %q", format.Pod(pod))
// check if an internal module has requested the pod is evicted.
for _, podSyncHandler := range kl.PodSyncHandlers {
if result := podSyncHandler.ShouldEvict(pod); result.Evict {
return v1.PodStatus{
Phase: v1.PodFailed,
Reason: result.Reason,
Message: result.Message,
}
}
}

The kubelet then syncs status back to the API server:

func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
if !m.needsUpdate(uid, status) {
glog.V(1).Infof("Status for pod %q is up-to-date; skipping", uid)
return
}
// TODO: make me easier to express from client code
pod, err := m.kubeClient.CoreV1().Pods(status.podNamespace).Get(status.podName, metav1.GetOptions{})
if errors.IsNotFound(err) {
glog.V(3).Infof("Pod %q (%s) does not exist on the server", status.podName, uid)
// If the Pod is deleted the status will be cleared in
// RemoveOrphanedStatuses, so we just ignore the update here.
return
}
if err != nil {
glog.Warningf("Failed to get status for pod %q: %v", format.PodDesc(status.podName, status.podNamespace, uid), err)
return
}
translatedUID := m.podManager.TranslatePodUID(pod.UID)
// Type convert original uid just for the purpose of comparison.
if len(translatedUID) > 0 && translatedUID != kubetypes.ResolvedPodUID(uid) {
glog.V(2).Infof("Pod %q was deleted and then recreated, skipping status update; old UID %q, new UID %q", format.Pod(pod), uid, translatedUID)
m.deletePodStatus(uid)
return
}
pod.Status = status.status
// TODO: handle conflict as a retry, make that easier too.
newPod, err := m.kubeClient.CoreV1().Pods(pod.Namespace).UpdateStatus(pod)
if err != nil {
glog.Warningf("Failed to update status for pod %q: %v", format.Pod(pod), err)
return
}
pod = newPod
glog.V(3).Infof("Status for pod %q updated successfully: (%d, %+v)", format.Pod(pod), status.version, status.status)
m.apiStatusVersions[kubetypes.MirrorPodUID(pod.UID)] = status.version
// We don't handle graceful deletion of mirror pods.
if m.canBeDeleted(pod, status.status) {
deleteOptions := metav1.NewDeleteOptions(0)
// Use the pod UID as the precondition for deletion to prevent deleting a newly created pod with the same name and namespace.
deleteOptions.Preconditions = metav1.NewUIDPreconditions(string(pod.UID))
err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)
if err != nil {
glog.Warningf("Failed to delete status for pod %q: %v", format.Pod(pod), err)
return
}
glog.V(3).Infof("Pod %q fully terminated and removed from etcd", format.Pod(pod))
m.deletePodStatus(uid)
}
}

But unless the pod's deletion timestamp is already set, the kubelet won't delete the pod:

func (m *manager) canBeDeleted(pod *v1.Pod, status v1.PodStatus) bool {
if pod.DeletionTimestamp == nil || kubepod.IsMirrorPod(pod) {
return false
}
return m.podDeletionSafety.PodResourcesAreReclaimed(pod, status)
}

@kubernetes/sig-node-bugs that doesn't seem like the kubelet does a complete job of evicting the pod from the API's perspective. Would you expect the kubelet to delete a pod directly in that case or to still go through posting a pod eviction (should pod disruption budget be honored in cases where the kubelet is out of resources?)

@yujuhong
Copy link
Contributor

@kubernetes/sig-node-bugs that doesn't seem like the kubelet does a complete job of evicting the pod from the API's perspective. Would you expect the kubelet to delete a pod directly in that case or to still go through posting a pod eviction (should pod disruption budget be honored in cases where the kubelet is out of resources?)

I think this is intentional.

AFAIk, kubelet's pod eviction includes failing the pod (i.e., setting the pod status) and reclaiming the resources used by the pod on the node. There is no "deleting the pod from the apiserver" involved in the eviction. Users/controllers and check the pod status to know what happened to the pod if needed.

/cc @derekwaynecarr @dashpole

@dashpole
Copy link
Contributor

Yes, this is intentional. In order for evicted pods to be inspected after eviction, we do not remove the pod API object. Otherwise it would appear that the pod simply disappeared
We do still stop and remove all containers, clean up cgroups, unmount volumes, etc to ensure that we reclaim all resources that were in use by the pod.
I dont think we set the deletion timestamp for daemon set pods. I suspect that the daemon set controller deletes evicted pods.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2018
@janetkuo
Copy link
Member

janetkuo commented Feb 22, 2018

In order for evicted pods to be inspected after eviction, we do not remove the pod API object.

If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don't do that today.

I dont think we set the deletion timestamp for daemon set pods. I suspect that the daemon set controller deletes evicted pods.

DaemonSet controller actively deletes failed pods (#40330), to ensure that DaemonSet can recover from transient errors (#36482). Evicted DaemonSet pods get killed just because they're also failed pods.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2018
@enisoc
Copy link
Member

enisoc commented Feb 22, 2018

In order for evicted pods to be inspected after eviction, we do not remove the pod API object.

If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don't do that today.

For something like StatefulSet, it's actually necessary to immediately delete any Pods evicted by kubelet, so the Pod name can be reused. As @janetkuo also mentioned, DaemonSet does this as well. For such controllers, you're thus not gaining anything from kubelet leaving the Pod record.

Even for something like ReplicaSet, it probably makes the most sense for the controller to delete Pods evicted by kubelet (though it doesn't do that now, see #60162) to avoid carrying along Failed Pods indefinitely.

So I would argue that in pretty much all cases, Pods with restartPolicy: Always that go to Failed should be expediently deleted by some controller, so users can't expect such Pods to stick around.

If we can agree that some controller should delete them, the only question left is which controller? I suggest that the Node controller makes the most sense: delete any Failed Pods with restartPolicy: Always that are scheduled to me. Otherwise, we effectively shift the responsibility to "all Pod/workload controllers that exist or ever will exist." Given the explosion of custom controllers that's coming thanks to CRD, I don't think it's prudent to put that responsibility on every controller author.

Otherwise it would appear that the pod simply disappeared

With the /eviction subresource and Node drains, we have already set the precedent that your Pods might simply disappear (if the eviction succeeds, the Pod is deleted from the API server) at any time, without a trace.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 30, 2019
@liggitt liggitt added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Nov 28, 2019
@schollii
Copy link

schollii commented Mar 5, 2020

@dashpole:

@dashpole
Copy link
Contributor

dashpole commented Mar 5, 2020

@schollii I suppose we don't document the list of things that do not happen during evictions. The out-of-resource documentation says "it terminates all of its containers and transitions its PodPhase to Failed". It doesn't explicitly call out that it does not set the deletion timestamp.

Some googling says you can reference evicted pods with: --field-selector=status.phase=Failed. You should be able to list, delete, etc with that.

@schollii
Copy link

schollii commented Mar 6, 2020

@dashpole I saw the mentions of --field-selector=status.phase=Failed but the problem there is that the "reason" is actually what says "evicted", so there could be failed pods that were not evicted. And you cannot select on status.reason, I tried. So we are left with grepping and awking the output of get pods -o wide. This needs fixing. E.g. make status.reason selectable, or have a phase called Evicted (although I doubt this is acceptable because not backwards compat). Or just have a command kubectl delete pods --evicted-only. If it can be fixed by a newbie to the k8s code base, I'd be happy to do it.

@barney-s
Copy link
Contributor

should we be explicit in setting the deletion timestamp ?

@lavalamp
Copy link
Member

grepping and awking the output of get pods -o wide

Use jq and -o json for stuff like this.

There's a "podgc" controller which deletes old pods, is it not triggering for evicted pods? How many do you accumulate? Why is it problematic?

should we be explicit in setting the deletion timestamp ?

I am not sure what the contract between kubelet / scheduler / controller is for evictions. Which entity is supposed to delete the pod? I assume they are not deleted by kubelet to give signal to scheduler/controller about the lack of fit?

@andyxning
Copy link
Member

Should Deployment check and delete Failed pods like what has been done in SatefulSet and DaemonSet?

@andyxning
Copy link
Member

Or pod GC should come in and cover this for other Resources besides StatefulSet and DaemonSet?

@andyxning
Copy link
Member

Just for someone who also interested in how failed pod deleted is done in StatefulSet controller:

for i := range replicas {
// delete and recreate failed pods
if isFailed(replicas[i]) {
ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
"StatefulSet %s/%s is recreating failed Pod %s",
set.Namespace,
set.Name,
replicas[i].Name)
if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
return &status, err
}

@matthyx
Copy link
Contributor

matthyx commented Jun 25, 2021

/triage accepted
/help
/priority important-longterm

@k8s-ci-robot
Copy link
Contributor

@matthyx:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help
/priority important-longterm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jun 25, 2021
@bboreham
Copy link
Contributor

According to StackOverflow:

the evicted pods will hang around until the number of them reaches the terminated-pod-gc-threshold limit (it's an option of kube-controller-manager and is equal to 12500 by default)

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 8, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Status: Needs Triage
Status: Triage
Development

No branches or pull requests