-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(ci|pull)-kubernetes-kind-e2e.* jobs are failing #19080
Comments
I think the resource limits we have in place for the presubmit variants of these jobs are what we should be using
CPU limit utilization for pull-kubernetes-e2e-kind CPU limit utilization for ci-kubernetes-kind-e2e.* Memory limit utilization for pull-kubernetes-e2e-kind Memory limit utilization for ci-kubernetes-kind-e2e.* If we then find we're experiencing a lot of test flakiness (but fewer jobs in error state), we may want to consider whether it's time to look into the extra complexity of scheduling merge-blocking and release-blocking jobs to separate nodepools (or clusters). Or, we could see if there's some way we can overcome whatever I/O limits we're hitting (ref: kubernetes/k8s.io#1187) |
.... ok, https://prow.k8s.io/?job=pull-kubernetes-e2e-kind made me reconsider setting cpu to 4 I match the memory limit but set CPU to 7 instead of 4, based on the fact that integration and verify are scheduling #19081 As for integration/verify scheduling... they are exhibiting weird duplicate/close testgrid entries, but I think this is actually an unrelated issue, I'm seeing that across a bunch of testgrid jobs, going to open a separate issue (EDIT: opened #19082 and whatever it is seems to have resolved) |
/priority critical-urgent, |
@spiffxp: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority critical-urgent |
testgrid alerts have lit up like a christmas tree across all multiple branches 😞 |
I'm not going to have much time to dig into this today. After #19081 merged at 2020-08-31 19:35 PT, CI jobs started scheduling, but consistently failing on one of a few different tests. If this was purely due to load, I would have expected to see some green from the CI jobs during quieter hours. Based on what load looked like over the past 24h, I would have especially expected it between 10pm-12am PT. From https://console.cloud.google.com/monitoring/dashboards/custom/10925237040785467832?project=k8s-infra-prow-build&timeDomain=1d: On top of this, asking for and getting 7 cpu is still basically equivalent to getting an entire n1-highmem-8 node, since there are no jobs scheduled to this cluster that ask for <1 cpu. So I am suspicious of a regression or two that only kind is catching with any frequency. We may also be I/O bound in some way that is exacerbating this, but I would suggest taking a look at what's merged in kubernetes/kubernetes lately. |
Recent merges doesn't explain similar spike in failures on release branches going back to 1.17 Most of these jobs were changed to NOT request that much CPU? FTR pull-kubernetes-e2e-gce runs on 1x n1 (control plane) and 3x n2 (workers), in addition to the prow pod running e2e for 7+ CPU, including the equivilant postsubmit jobs. It is also relying on dedicated disk and more ram. So while bin packing these is nice but just local reproducibility, autoscale vs boskos is a win and 7+ CPU request is no more expensive than typical cloudy e2e... |
/cc |
kubelet event related tests went haywire for all currently tested releases: seemingly only within the past day or so. https://github.com/kubernetes-sigs/kind/commits/master not a lot changed here recently, in particular actually nothing changed in the past 3 days. We did start seeing the scheduling issues around the the same time these tests flaked though. Maybe our CI node fixup daemonsets are not running? (e.g. the one that increases inotify watches?) |
is it possible these changes made jobs that previously were serialized run in parallel:
|
@liggitt crier just reports jobs, it doesn't do anything about their execution, that is plank. Do we have any indications that jobs that should be serialized are getting run in parallel? |
No, its a distinct binary. Furthermore, the locking is only done for presubmit jobs and the jobs here seem to be periodics. Looking at the examples it seems the cluster is out of capacity, all of them are podPendingTimeout (which if too short could be caused by scaleup taking too much time) or fail to schedule altogether. |
presubmits are also badly affected by this... if we're experiencing a scheduling issue, could something letting presubmits swamp nodes affect periodics? |
Sure, if they run on the same cluster (which I guess they do?). Do you have access to that cluster? Are there any graphs that show cluster load? |
@aojea may have identified the culprit, he noticed lots of logs in the kind kubelet logs of the form
cloud logging link shows "us-central1-b.c.k8s-infra-prow-build.internal" didn't show up before 2020-08-31 ~10:50am Digging through audit logs, turns out I fat fingered setting I deleted it about 15 minutes ago. Checking a random node's /etc/resolv.conf showed the change had taken effect. It may be a bit for jobs to reflect this. I suspect we're about to go through this |
https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-kind showed an immediate dramatic improvement |
I'll hold this open until I see passes on the release-branch CI jobs |
I see seas of green in the recent runs |
kubernetes-sigs/kind#1860 to track mitigating in the future |
What happened:
Seeing a similar pattern for all other kind release-blocking jobs.
Some, but not all of the failures, are of the variety "There are no nodes that your pod can schedule to" according to spyglass.
I suspect what is happening is the following:
What you expected to happen:
I expected cluster autoscaling to spin up new nodes if none were available with sufficient capacity for these jobs.
How to reproduce it (as minimally and precisely as possible):
Please provide links to example occurrences, if any:
e.g. for https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel&width=20
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300531215815675904 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/68 nodes are available: 1 Insufficient memory, 3 node(s) had taints that the pod didn't tolerate, 65 Insufficient cpu.)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300485096591069186 - Job execution failed: Pod pending timeout.
Anything else we need to know?:
EDIT: turns out the PR jobs were failing a lot too, though a few specific test cases instead of hitting pod pending (ref: #19080 (comment))
The increased PR / job traffic caused by v1.20 PR's getting merged is almost certainly exacerbating this
The text was updated successfully, but these errors were encountered: