-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KubeJobCompletion Prometheus alert for desheduler jobs #432
Comments
/triage support |
@seanmalloy: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind support |
@KR411-prog thanks for opening this issue. Please provide the below details and we will try to help. Please provide the full CronJob yaml with any sensitive info redacted.
What k8s version are you using? Please provide the pod log for long running descheduler CronJob pod with any sensitive info redacted.
|
@KR411-prog it would also be helpful to know if the descheduler pod is maxing out it's CPU/Memory requests or limits. Also, roughly how many nodes and pods are in the cluster? Thanks! |
Here is the cronjob config,
I am unable to take logs because I dont have this issue today.As soon as I get this issue again,I can share the logs here. |
@KR411-prog thanks for providing the additional details. One thing I see is that the descheduler container requests/limits are not set. This might be a bug in the helm chart, but I did not dig into the helm chart to see if there is an option to set that. Without the logs it will be difficult to determine root cause. Please add the descheduler pod logs if you see this happen again. Thanks! |
Today we got same issue..
We received KubeJobCompletion alert for descheduler-1604089080 job. But the logs in the pod had no error, but within 1 min or so the pod and job was deleted automicatically. Now I see only new jobs,
So the job which showed problem didnt have any failed status in it,
By checking the cronjob manifest file, I see ConcurrencyPolicy set as Forbid. Does adding "startingDeadlineSeconds: 10" helps in improving cronjob behaviour? |
There is another issue today. Pod logs showed an error,
I think by setting activeDeadlineSeconds, this issue can be resolved. But I dont find this field "activeDeadlineSeconds" in values file of descheduler chart. |
I found the below error in today's issue,
Pods are deleted but the job was still in failed status with the error "Job was active longer than specified deadline". |
I see you are running descheduler v0.19, and you also mentioned your cluster is k8s 1.15. Please note that we currently only support k8s to descheduler version N-3 (see https://github.com/kubernetes-sigs/descheduler/#compatibility-matrix) I'm not sure if that will relate to your problem, but this seems like more of an issue with the cronjob (though any logs you could get from the descheduler pod would be the best way to tell, possibly at a higher log level like If you can't resolve that, another option is running the descheduler as a regular deployment with the |
In my opinion this is a problem in the descheduler helm chart and also the k8s manifests found in the top level
Item 1 from above is a bug in my opinion. I supposed item 2 would be a feature enhancement request. /kind bug |
I think I can get the helm chart and k8s yaml manifests updated to hopefully mitigate this issue. /assign |
We are receiving KubeJobCompletion Prometheus alert for desheduler jobs with the below alert message,
kube-system/descheduler-1603424400 is taking more than 12 hours to complete.
The descheduler values file config is as shown below,
I am not sure if there is a way in the config to tune to delete the job if it takes more than 30 mins. I dont find that tuning configuration in the values file in this helm chart.
chart: descheduler/descheduler-helm-chart
version: "0.19.0"
Any help on how to avoid getting this alert? Is there any tuning that can be done in descheduler config?
The text was updated successfully, but these errors were encountered: