Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(ci|pull)-kubernetes-kind-e2e.* jobs are failing #19080

Closed
spiffxp opened this issue Sep 1, 2020 · 23 comments
Closed

(ci|pull)-kubernetes-kind-e2e.* jobs are failing #19080

spiffxp opened this issue Sep 1, 2020 · 23 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@spiffxp
Copy link
Member

spiffxp commented Sep 1, 2020

What happened:

Seeing a similar pattern for all other kind release-blocking jobs.

Some, but not all of the failures, are of the variety "There are no nodes that your pod can schedule to" according to spyglass.

I suspect what is happening is the following:

  • these jobs are demanding about an entire node's worth of CPU, so any nodes that have even one job running have insufficient CPU
  • cluster autoscaling spins up new nodes to meet the demand
  • new nodes become available, but other jobs with smaller resource requests get scheduled on them, so they have insufficient CPU
  • pod is eventually terminated by plank due to remaining in Pending for too long

What you expected to happen:

I expected cluster autoscaling to spin up new nodes if none were available with sufficient capacity for these jobs.

How to reproduce it (as minimally and precisely as possible):

Please provide links to example occurrences, if any:
e.g. for https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel&width=20

Anything else we need to know?:

EDIT: turns out the PR jobs were failing a lot too, though a few specific test cases instead of hitting pod pending (ref: #19080 (comment))

The increased PR / job traffic caused by v1.20 PR's getting merged is almost certainly exacerbating this

@spiffxp spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Sep 1, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Sep 1, 2020

I think the resource limits we have in place for the presubmit variants of these jobs are what we should be using

  • https://prow.k8s.io/?state=error - shows the presubmit variants of these jobs have not been having this problem
  • Limit utilization graphs show the CI jobs have plenty of headroom compared to the presubmits

CPU limit utilization for pull-kubernetes-e2e-kind
Screen Shot 2020-08-31 at 6 26 53 PM

CPU limit utilization for ci-kubernetes-kind-e2e.*
Screen Shot 2020-08-31 at 6 28 29 PM

Memory limit utilization for pull-kubernetes-e2e-kind
Screen Shot 2020-08-31 at 6 52 47 PM

Memory limit utilization for ci-kubernetes-kind-e2e.*
Screen Shot 2020-08-31 at 6 52 00 PM

If we then find we're experiencing a lot of test flakiness (but fewer jobs in error state), we may want to consider whether it's time to look into the extra complexity of scheduling merge-blocking and release-blocking jobs to separate nodepools (or clusters).

Or, we could see if there's some way we can overcome whatever I/O limits we're hitting (ref: kubernetes/k8s.io#1187)

@spiffxp
Copy link
Member Author

spiffxp commented Sep 1, 2020

.... ok, https://prow.k8s.io/?job=pull-kubernetes-e2e-kind made me reconsider setting cpu to 4

I match the memory limit but set CPU to 7 instead of 4, based on the fact that integration and verify are scheduling #19081

As for integration/verify scheduling... they are exhibiting weird duplicate/close testgrid entries, but I think this is actually an unrelated issue, I'm seeing that across a bunch of testgrid jobs, going to open a separate issue (EDIT: opened #19082 and whatever it is seems to have resolved)

@spiffxp
Copy link
Member Author

spiffxp commented Sep 1, 2020

/priority critical-urgent,
since these are release-blocking jobs

@k8s-ci-robot
Copy link
Contributor

@spiffxp: The label(s) priority/critical-urgent, cannot be applied, because the repository doesn't have them

In response to this:

/priority critical-urgent,
since these are release-blocking jobs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp
Copy link
Member Author

spiffxp commented Sep 1, 2020

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Sep 1, 2020
@BenTheElder
Copy link
Member

testgrid alerts have lit up like a christmas tree across all multiple branches 😞

@spiffxp
Copy link
Member Author

spiffxp commented Sep 1, 2020

I'm not going to have much time to dig into this today.

After #19081 merged at 2020-08-31 19:35 PT, CI jobs started scheduling, but consistently failing on one of a few different tests.

If this was purely due to load, I would have expected to see some green from the CI jobs during quieter hours. Based on what load looked like over the past 24h, I would have especially expected it between 10pm-12am PT. From https://console.cloud.google.com/monitoring/dashboards/custom/10925237040785467832?project=k8s-infra-prow-build&timeDomain=1d:
Screen Shot 2020-09-01 at 9 29 42 AM

On top of this, asking for and getting 7 cpu is still basically equivalent to getting an entire n1-highmem-8 node, since there are no jobs scheduled to this cluster that ask for <1 cpu.

So I am suspicious of a regression or two that only kind is catching with any frequency. We may also be I/O bound in some way that is exacerbating this, but I would suggest taking a look at what's merged in kubernetes/kubernetes lately.

@BenTheElder
Copy link
Member

Recent merges doesn't explain similar spike in failures on release branches going back to 1.17

Most of these jobs were changed to NOT request that much CPU?


FTR pull-kubernetes-e2e-gce runs on 1x n1 (control plane) and 3x n2 (workers), in addition to the prow pod running e2e for 7+ CPU, including the equivilant postsubmit jobs. It is also relying on dedicated disk and more ram.

So while bin packing these is nice but just local reproducibility, autoscale vs boskos is a win and 7+ CPU request is no more expensive than typical cloudy e2e...

@aojea
Copy link
Member

aojea commented Sep 1, 2020

/cc

@BenTheElder
Copy link
Member

kubelet event related tests went haywire for all currently tested releases:
https://testgrid.k8s.io/sig-release-1.17-blocking#kind-1.17-parallel
https://testgrid.k8s.io/sig-release-1.18-blocking#kind-1.18-parallel
https://testgrid.k8s.io/sig-release-1.19-blocking#kind-1.19-parallel
https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel

seemingly only within the past day or so.

https://github.com/kubernetes-sigs/kind/commits/master not a lot changed here recently, in particular actually nothing changed in the past 3 days. We did start seeing the scheduling issues around the the same time these tests flaked though.

Maybe our CI node fixup daemonsets are not running? (e.g. the one that increases inotify watches?)

@liggitt
Copy link
Member

liggitt commented Sep 1, 2020

is it possible these changes made jobs that previously were serialized run in parallel:

@liggitt
Copy link
Member

liggitt commented Sep 1, 2020

@alvaroaleman ^

@alvaroaleman
Copy link
Member

@liggitt crier just reports jobs, it doesn't do anything about their execution, that is plank. Do we have any indications that jobs that should be serialized are getting run in parallel?

@liggitt
Copy link
Member

liggitt commented Sep 1, 2020

Do we have any indications that jobs that should be serialized are getting run in parallel?

Not in particular, just looking at test-infra changes that merged around the timeframe things went sideways, and the locking changes stood out. I'm not familiar with how the CI components interact, so just wanted to check what the implications of those changes were, especially this comment:

In 0056287 we introduced PR-Level
locking for presubmits to the crier github reporter in order to allow
running it with multiple workers without having the workers race with
each other when creating the GitHub comment.

Is it possible that lock in crier was unintentionally serializing some aspect of the CI queue?

Looking at the triage board and testgrids, scheduling failures appear to have started to affect jobs between ~12:25 ET and 1:25 ET, and the "Events should be sent by kubelets and the scheduler" failures ~3:30 ET

Good:

Bad:

@alvaroaleman
Copy link
Member

Is it possible that lock in crier was unintentionally serializing some aspect of the CI queue?

No, its a distinct binary. Furthermore, the locking is only done for presubmit jobs and the jobs here seem to be periodics. Looking at the examples it seems the cluster is out of capacity, all of them are podPendingTimeout (which if too short could be caused by scaleup taking too much time) or fail to schedule altogether.

@liggitt
Copy link
Member

liggitt commented Sep 1, 2020

the locking is only done for presubmit jobs and the jobs here seem to be periodics.

presubmits are also badly affected by this... if we're experiencing a scheduling issue, could something letting presubmits swamp nodes affect periodics?

@alvaroaleman
Copy link
Member

presubmits are also badly affected by this... if we're experiencing a scheduling issue, could something letting presubmits swamp nodes affect periodics?

Sure, if they run on the same cluster (which I guess they do?). Do you have access to that cluster? Are there any graphs that show cluster load?

@spiffxp
Copy link
Member Author

spiffxp commented Sep 1, 2020

@aojea may have identified the culprit, he noticed lots of logs in the kind kubelet logs of the form

Sep 01 19:07:11 kind-worker kubelet[611]: E0901 19:07:11.320994     611 dns.go:125] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: volume-expand-4184-7003.svc.cluster.local svc.cluster.local cluster.local test-pods.svc.cluster.local us-central1-b.c.k8s-infra-prow-build.internal c.k8s-infra-prow-build.internal

cloud logging link shows "us-central1-b.c.k8s-infra-prow-build.internal" didn't show up before 2020-08-31 ~10:50am

Digging through audit logs, turns out I fat fingered setting VmDnsSetting=ZonalPreferred on the wrong GCP project's common instance metadata (I intended to set it on my personal project). This happened at 2020-08-31 10:37am PT, which lines up with when the jobs started failing.

I deleted it about 15 minutes ago. Checking a random node's /etc/resolv.conf showed the change had taken effect. It may be a bit for jobs to reflect this.

I suspect we're about to go through this

it was DNS

@liggitt
Copy link
Member

liggitt commented Sep 1, 2020

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-kind showed an immediate dramatic improvement

@spiffxp spiffxp changed the title ci-kubernetes-kind-e2e.* jobs are failing (ci|pull)-kubernetes-kind-e2e.* jobs are failing Sep 1, 2020
@BenTheElder
Copy link
Member

BenTheElder commented Sep 1, 2020

@spiffxp
Copy link
Member Author

spiffxp commented Sep 1, 2020

I'll hold this open until I see passes on the release-branch CI jobs

@BenTheElder
Copy link
Member

I see seas of green in the recent runs

@BenTheElder
Copy link
Member

kubernetes-sigs/kind#1860 to track mitigating in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

6 participants