ci-kubernetes-build-canary jobs are failing to be scheduled #20670

justaugustus · 2021-01-29T20:47:54Z

What happened:

The last six runs of ci-kubernetes-build-canary have had scheduling issues.
(Some of these are triggered reruns.)

What you expected to happen:

Successful scheduling on a Prow node 🙃

How to reproduce it (as minimally and precisely as possible):

Reproduces on any current run of the job.

Please provide links to example occurrences, if any:

Job: https://prow.k8s.io/?job=ci-kubernetes-build-canary

Failed runs:

Anything else we need to know?:

The job was recently migrated from using bootstrap to krel ci-build in #20663.
ref: kubernetes/release#1711 (comment)

cc: @kubernetes/release-engineering @spiffxp

The text was updated successfully, but these errors were encountered:

justaugustus · 2021-01-29T20:55:03Z

Slack thread in #testing-ops: https://kubernetes.slack.com/archives/C7J9RP96G/p1611953363009500

chaodaiG · 2021-01-29T21:37:15Z

Latest succeeded run is configured at 9f603e50-624b-11eb-bd55-a20b0ecfb997

First failed run is configured at 2395e2c8-6254-11eb-bd55-a20b0ecfb997

One noticeable difference in the prowjob is that in the failed run:

    resources:
      clonerefs:
        requests:
          cpu: 100m
      initupload:
        requests:
          cpu: 100m
      place_entrypoint:
        requests:
          cpu: 100m
      sidecar:
        requests:
          cpu: 100m

which is not present in the succeeded run. And test container both requested:

requests:
          cpu: 7300m
          memory: 34Gi

It's possible that each node in cluster: k8s-infra-prow-build can only allocate either 7.3 cpu or 34Gi memory, can someone have access check the node size?

chaodaiG · 2021-01-29T21:39:45Z

To make it more clear, the difference in the prowjobs resulted requested cpu from 7.3 to 7.7, and probably not related to memory

spiffxp · 2021-01-29T22:33:53Z

Thanks @chaodaiG that is most likely the problem.

A job trying to schedule all available CPU as a proxy for hogging a node all to itself is a risky strategy for this reason. The k8s-infra-prow-build nodes have (among other daemonsets) calico running on them for network policy enforcement, meaning they have slightly more overhead / less allocatable cpu than the google.com default nodes.

/area jobs
/sig release
/assign @justaugustus
The kind release-blocking jobs also use pod-utils and try to hog nodes, they use 7 cpu (e.g. https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-release/release-branch-jobs/1.20.yaml#L602-L650). Try that out.

BenTheElder · 2021-01-29T22:41:38Z

The kind release-blocking jobs also use pod-utils and try to hog nodes, they use 7 cpu (e.g. https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-release/release-branch-jobs/1.20.yaml#L602-L650). Try that out.

I/etcd just need those sweet sweet IOPs 🦆 , someday we'll have scheduling for this ...

justaugustus · 2021-01-29T22:46:50Z

Thanks everyone for taking a peek.
Opened #20674.

spiffxp · 2021-01-29T23:06:50Z

I/etcd just need those sweet sweet IOPs 🦆 , someday we'll have scheduling for this ...

If we can make progress on moving k8s-infra-prow-build up to 1.18 we can try out nodepools with local-ssd, waaaaaay more iops kubernetes/k8s.io#1187 (comment)

justaugustus added the kind/bug Categorizes issue or PR as related to a bug. label Jan 29, 2021

k8s-ci-robot assigned justaugustus Jan 29, 2021

k8s-ci-robot added area/jobs sig/release Categorizes an issue or PR as relevant to SIG Release. labels Jan 29, 2021

justaugustus mentioned this issue Jan 29, 2021

ci-kubernetes-build-canary: Adjust CPU request/limit to 7 #20674

Merged

k8s-ci-robot closed this as completed in #20674 Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci-kubernetes-build-canary jobs are failing to be scheduled #20670

ci-kubernetes-build-canary jobs are failing to be scheduled #20670

justaugustus commented Jan 29, 2021

justaugustus commented Jan 29, 2021

chaodaiG commented Jan 29, 2021

chaodaiG commented Jan 29, 2021

spiffxp commented Jan 29, 2021

BenTheElder commented Jan 29, 2021

justaugustus commented Jan 29, 2021

spiffxp commented Jan 29, 2021

ci-kubernetes-build-canary jobs are failing to be scheduled #20670

ci-kubernetes-build-canary jobs are failing to be scheduled #20670

Comments

justaugustus commented Jan 29, 2021

justaugustus commented Jan 29, 2021

chaodaiG commented Jan 29, 2021

chaodaiG commented Jan 29, 2021

spiffxp commented Jan 29, 2021

BenTheElder commented Jan 29, 2021

justaugustus commented Jan 29, 2021

spiffxp commented Jan 29, 2021