(ci|pull)-kubernetes-kind-e2e.* jobs are failing #19080

spiffxp · 2020-09-01T01:49:12Z

What happened:

https://testgrid.k8s.io/sig-release-1.19-blocking#kind-1.19-parallel&width=20
- 2020-08-31 11:56 PDT - started failing continuously
- 2020-08-28 13:47 PDT started seeing intermitted incompletes ("still running" according to spyglass)
https://testgrid.k8s.io/sig-release-1.19-blocking#kind-ipv6-1.19-parallel&width=20
- 2020-08-31 11:34 PDT - started failing continuously
- 2020-08-28 15:24 PDT started seeing intermitted incompletes ("still running" according to spyglass)

Seeing a similar pattern for all other kind release-blocking jobs.

Some, but not all of the failures, are of the variety "There are no nodes that your pod can schedule to" according to spyglass.

I suspect what is happening is the following:

these jobs are demanding about an entire node's worth of CPU, so any nodes that have even one job running have insufficient CPU
cluster autoscaling spins up new nodes to meet the demand
new nodes become available, but other jobs with smaller resource requests get scheduled on them, so they have insufficient CPU
pod is eventually terminated by plank due to remaining in Pending for too long

What you expected to happen:

I expected cluster autoscaling to spin up new nodes if none were available with sufficient capacity for these jobs.

How to reproduce it (as minimally and precisely as possible):

Please provide links to example occurrences, if any:
e.g. for https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel&width=20

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300592620015194112 - 3 tests fail
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300577268589924352 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (skip schedule deleting pod: test-pods/332044eb-ebe2-11ea-aff6-5a14b0d7f3dc)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300561918280339458 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/52 nodes are available: 3 node(s) had taints that the pod didn't tolerate, 4 Insufficient memory, 49 Insufficient cpu.)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300546567635210241 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/66 nodes are available: 1 Insufficient memory, 3 node(s) had taints that the pod didn't tolerate, 63 Insufficient cpu.)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300531215815675904 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/68 nodes are available: 1 Insufficient memory, 3 node(s) had taints that the pod didn't tolerate, 65 Insufficient cpu.)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300515864868556800 - 2 tests fail
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300500448448352257 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/57 nodes are available: 3 Insufficient memory, 3 node(s) had taints that the pod didn't tolerate, 54 Insufficient cpu.)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300485096591069186 - Job execution failed: Pod pending timeout.

Anything else we need to know?:

EDIT: turns out the PR jobs were failing a lot too, though a few specific test cases instead of hitting pod pending (ref: #19080 (comment))

The increased PR / job traffic caused by v1.20 PR's getting merged is almost certainly exacerbating this

spiffxp · 2020-09-01T01:57:12Z

I think the resource limits we have in place for the presubmit variants of these jobs are what we should be using

https://prow.k8s.io/?state=error - shows the presubmit variants of these jobs have not been having this problem
Limit utilization graphs show the CI jobs have plenty of headroom compared to the presubmits

CPU limit utilization for pull-kubernetes-e2e-kind

CPU limit utilization for ci-kubernetes-kind-e2e.*

Memory limit utilization for pull-kubernetes-e2e-kind

Memory limit utilization for ci-kubernetes-kind-e2e.*

If we then find we're experiencing a lot of test flakiness (but fewer jobs in error state), we may want to consider whether it's time to look into the extra complexity of scheduling merge-blocking and release-blocking jobs to separate nodepools (or clusters).

Or, we could see if there's some way we can overcome whatever I/O limits we're hitting (ref: kubernetes/k8s.io#1187)

spiffxp · 2020-09-01T02:34:51Z

.... ok, https://prow.k8s.io/?job=pull-kubernetes-e2e-kind made me reconsider setting cpu to 4

I match the memory limit but set CPU to 7 instead of 4, based on the fact that integration and verify are scheduling #19081

As for integration/verify scheduling... they are exhibiting weird duplicate/close testgrid entries, but I think this is actually an unrelated issue, I'm seeing that across a bunch of testgrid jobs, going to open a separate issue (EDIT: opened #19082 and whatever it is seems to have resolved)

spiffxp · 2020-09-01T02:40:49Z

/priority critical-urgent,
since these are release-blocking jobs

k8s-ci-robot · 2020-09-01T02:40:51Z

@spiffxp: The label(s) priority/critical-urgent, cannot be applied, because the repository doesn't have them

In response to this:

/priority critical-urgent,
since these are release-blocking jobs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp · 2020-09-01T02:41:20Z

/priority critical-urgent

BenTheElder · 2020-09-01T08:05:37Z

testgrid alerts have lit up like a christmas tree across all multiple branches 😞

spiffxp · 2020-09-01T16:38:49Z

I'm not going to have much time to dig into this today.

After #19081 merged at 2020-08-31 19:35 PT, CI jobs started scheduling, but consistently failing on one of a few different tests.

If this was purely due to load, I would have expected to see some green from the CI jobs during quieter hours. Based on what load looked like over the past 24h, I would have especially expected it between 10pm-12am PT. From https://console.cloud.google.com/monitoring/dashboards/custom/10925237040785467832?project=k8s-infra-prow-build&timeDomain=1d:

On top of this, asking for and getting 7 cpu is still basically equivalent to getting an entire n1-highmem-8 node, since there are no jobs scheduled to this cluster that ask for <1 cpu.

So I am suspicious of a regression or two that only kind is catching with any frequency. We may also be I/O bound in some way that is exacerbating this, but I would suggest taking a look at what's merged in kubernetes/kubernetes lately.

BenTheElder · 2020-09-01T16:50:34Z

Recent merges doesn't explain similar spike in failures on release branches going back to 1.17

Most of these jobs were changed to NOT request that much CPU?

FTR pull-kubernetes-e2e-gce runs on 1x n1 (control plane) and 3x n2 (workers), in addition to the prow pod running e2e for 7+ CPU, including the equivilant postsubmit jobs. It is also relying on dedicated disk and more ram.

So while bin packing these is nice but just local reproducibility, autoscale vs boskos is a win and 7+ CPU request is no more expensive than typical cloudy e2e...

aojea · 2020-09-01T17:13:58Z

/cc

BenTheElder · 2020-09-01T17:29:12Z

kubelet event related tests went haywire for all currently tested releases:
https://testgrid.k8s.io/sig-release-1.17-blocking#kind-1.17-parallel
https://testgrid.k8s.io/sig-release-1.18-blocking#kind-1.18-parallel
https://testgrid.k8s.io/sig-release-1.19-blocking#kind-1.19-parallel
https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel

seemingly only within the past day or so.

https://github.com/kubernetes-sigs/kind/commits/master not a lot changed here recently, in particular actually nothing changed in the past 3 days. We did start seeing the scheduling issues around the the same time these tests flaked though.

Maybe our CI node fixup daemonsets are not running? (e.g. the one that increases inotify watches?)

liggitt · 2020-09-01T20:38:07Z

is it possible these changes made jobs that previously were serialized run in parallel:

Crier github reporter: Avoid lock contention issues #19053
K8SGCSReporter: Fix handling of aborted jobs #19048
Update prow to v20200831-223e625607, and other images as necessary. #19045 (bumped crier to include the above two commits)

liggitt · 2020-09-01T20:38:49Z

@alvaroaleman ^

alvaroaleman · 2020-09-01T20:45:22Z

@liggitt crier just reports jobs, it doesn't do anything about their execution, that is plank. Do we have any indications that jobs that should be serialized are getting run in parallel?

liggitt · 2020-09-01T20:48:41Z

Do we have any indications that jobs that should be serialized are getting run in parallel?

Not in particular, just looking at test-infra changes that merged around the timeframe things went sideways, and the locking changes stood out. I'm not familiar with how the CI components interact, so just wanted to check what the implications of those changes were, especially this comment:

In 0056287 we introduced PR-Level
locking for presubmits to the crier github reporter in order to allow
running it with multiple workers without having the workers race with
each other when creating the GitHub comment.

Is it possible that lock in crier was unintentionally serializing some aspect of the CI queue?

Looking at the triage board and testgrids, scheduling failures appear to have started to affect jobs between ~12:25 ET and 1:25 ET, and the "Events should be sent by kubelets and the scheduler" failures ~3:30 ET

Good:

Bad:

alvaroaleman · 2020-09-01T20:54:48Z

Is it possible that lock in crier was unintentionally serializing some aspect of the CI queue?

No, its a distinct binary. Furthermore, the locking is only done for presubmit jobs and the jobs here seem to be periodics. Looking at the examples it seems the cluster is out of capacity, all of them are podPendingTimeout (which if too short could be caused by scaleup taking too much time) or fail to schedule altogether.

liggitt · 2020-09-01T21:02:05Z

the locking is only done for presubmit jobs and the jobs here seem to be periodics.

presubmits are also badly affected by this... if we're experiencing a scheduling issue, could something letting presubmits swamp nodes affect periodics?

alvaroaleman · 2020-09-01T21:13:06Z

presubmits are also badly affected by this... if we're experiencing a scheduling issue, could something letting presubmits swamp nodes affect periodics?

Sure, if they run on the same cluster (which I guess they do?). Do you have access to that cluster? Are there any graphs that show cluster load?

spiffxp · 2020-09-01T21:25:55Z

@aojea may have identified the culprit, he noticed lots of logs in the kind kubelet logs of the form

Sep 01 19:07:11 kind-worker kubelet[611]: E0901 19:07:11.320994     611 dns.go:125] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: volume-expand-4184-7003.svc.cluster.local svc.cluster.local cluster.local test-pods.svc.cluster.local us-central1-b.c.k8s-infra-prow-build.internal c.k8s-infra-prow-build.internal

cloud logging link shows "us-central1-b.c.k8s-infra-prow-build.internal" didn't show up before 2020-08-31 ~10:50am

Digging through audit logs, turns out I fat fingered setting VmDnsSetting=ZonalPreferred on the wrong GCP project's common instance metadata (I intended to set it on my personal project). This happened at 2020-08-31 10:37am PT, which lines up with when the jobs started failing.

I deleted it about 15 minutes ago. Checking a random node's /etc/resolv.conf showed the change had taken effect. It may be a bit for jobs to reflect this.

I suspect we're about to go through this

liggitt · 2020-09-01T22:42:55Z

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-kind showed an immediate dramatic improvement

BenTheElder · 2020-09-01T23:47:22Z

https://prow.k8s.io/?job=pull-kubernetes-*-kind*

spiffxp · 2020-09-01T23:49:02Z

I'll hold this open until I see passes on the release-branch CI jobs

BenTheElder · 2020-09-02T04:30:06Z

I see seas of green in the recent runs

BenTheElder · 2020-09-18T21:03:06Z

kubernetes-sigs/kind#1860 to track mitigating in the future

spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Sep 1, 2020

spiffxp mentioned this issue Sep 1, 2020

Tune release-blocking kind job resources down #19081

Merged

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Sep 1, 2020

liggitt mentioned this issue Sep 1, 2020

let panics propagate up when processLoop in informer controller panic in client-go kubernetes/kubernetes#93646

Merged

liggitt mentioned this issue Sep 1, 2020

Update default etcd server to 3.4.13 kubernetes/kubernetes#94287

Merged

BenTheElder mentioned this issue Sep 1, 2020

kubeadm: make the scheduler and KCM connect to the local API endpoint kubernetes/kubernetes#94398

Merged

This was referenced Sep 1, 2020

Impersonated user with a specified group should not fail flow schema match in Priority & Fairness kubernetes/kubernetes#94204

Merged

Updating kube-proxy to trim space from loadBalancerSourceRanges kubernetes/kubernetes#94107

Merged

This was referenced Sep 1, 2020

fix(kubelet): protect containerCleanupInfos from concurrent map writes kubernetes/kubernetes#93773

Merged

kubectl: Allow --patch-file to be passed to kubectl patch kubernetes/kubernetes#93548

Merged

spiffxp changed the title ~~ci-kubernetes-kind-e2e.* jobs are failing~~ (ci|pull)-kubernetes-kind-e2e.* jobs are failing Sep 1, 2020

BenTheElder assigned spiffxp Sep 1, 2020

BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Sep 1, 2020

This was referenced Sep 2, 2020

Fix FakeClock::Reset to always succeed kubernetes/kubernetes#94317

Merged

kubeadm: reset don't unmount /var/lib/kubelet if it is mounted kubernetes/kubernetes#93702

Merged

BenTheElder closed this as completed Sep 2, 2020

BenTheElder mentioned this issue Sep 2, 2020

Events should be sent by kubelets and the scheduler about pods scheduling and running [Conformance] is extremely flaky in kind kubernetes/kubernetes#94414

Closed

BenTheElder mentioned this issue Sep 18, 2020

avoid excessive search lines in CI kubernetes-sigs/kind#1860

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(ci|pull)-kubernetes-kind-e2e.* jobs are failing #19080

(ci|pull)-kubernetes-kind-e2e.* jobs are failing #19080

spiffxp commented Sep 1, 2020 •

edited

Loading

spiffxp commented Sep 1, 2020

spiffxp commented Sep 1, 2020 •

edited

Loading

spiffxp commented Sep 1, 2020

k8s-ci-robot commented Sep 1, 2020

spiffxp commented Sep 1, 2020

BenTheElder commented Sep 1, 2020

spiffxp commented Sep 1, 2020 •

edited

Loading

BenTheElder commented Sep 1, 2020

aojea commented Sep 1, 2020

BenTheElder commented Sep 1, 2020

liggitt commented Sep 1, 2020

liggitt commented Sep 1, 2020

alvaroaleman commented Sep 1, 2020

liggitt commented Sep 1, 2020 •

edited

Loading

alvaroaleman commented Sep 1, 2020

liggitt commented Sep 1, 2020

alvaroaleman commented Sep 1, 2020

spiffxp commented Sep 1, 2020 •

edited

Loading

liggitt commented Sep 1, 2020

BenTheElder commented Sep 1, 2020 •

edited

Loading

spiffxp commented Sep 1, 2020

BenTheElder commented Sep 2, 2020

BenTheElder commented Sep 18, 2020

(ci|pull)-kubernetes-kind-e2e.* jobs are failing #19080

(ci|pull)-kubernetes-kind-e2e.* jobs are failing #19080

Comments

spiffxp commented Sep 1, 2020 • edited Loading

spiffxp commented Sep 1, 2020

spiffxp commented Sep 1, 2020 • edited Loading

spiffxp commented Sep 1, 2020

k8s-ci-robot commented Sep 1, 2020

spiffxp commented Sep 1, 2020

BenTheElder commented Sep 1, 2020

spiffxp commented Sep 1, 2020 • edited Loading

BenTheElder commented Sep 1, 2020

aojea commented Sep 1, 2020

BenTheElder commented Sep 1, 2020

liggitt commented Sep 1, 2020

liggitt commented Sep 1, 2020

alvaroaleman commented Sep 1, 2020

liggitt commented Sep 1, 2020 • edited Loading

alvaroaleman commented Sep 1, 2020

liggitt commented Sep 1, 2020

alvaroaleman commented Sep 1, 2020

spiffxp commented Sep 1, 2020 • edited Loading

liggitt commented Sep 1, 2020

BenTheElder commented Sep 1, 2020 • edited Loading

spiffxp commented Sep 1, 2020

BenTheElder commented Sep 2, 2020

BenTheElder commented Sep 18, 2020

spiffxp commented Sep 1, 2020 •

edited

Loading

spiffxp commented Sep 1, 2020 •

edited

Loading

spiffxp commented Sep 1, 2020 •

edited

Loading

liggitt commented Sep 1, 2020 •

edited

Loading

spiffxp commented Sep 1, 2020 •

edited

Loading

BenTheElder commented Sep 1, 2020 •

edited

Loading