Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-kubernetes-build-canary jobs are failing to be scheduled #20670

Closed
justaugustus opened this issue Jan 29, 2021 · 7 comments · Fixed by #20674
Closed

ci-kubernetes-build-canary jobs are failing to be scheduled #20670

justaugustus opened this issue Jan 29, 2021 · 7 comments · Fixed by #20674
Assignees
Labels
area/jobs kind/bug Categorizes issue or PR as related to a bug. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@justaugustus
Copy link
Member

What happened:

The last six runs of ci-kubernetes-build-canary have had scheduling issues.
(Some of these are triggered reruns.)

What you expected to happen:

Successful scheduling on a Prow node 🙃

How to reproduce it (as minimally and precisely as possible):

Reproduces on any current run of the job.

Please provide links to example occurrences, if any:

Job: https://prow.k8s.io/?job=ci-kubernetes-build-canary

Failed runs:

Anything else we need to know?:

The job was recently migrated from using bootstrap to krel ci-build in #20663.
ref: kubernetes/release#1711 (comment)

cc: @kubernetes/release-engineering @spiffxp

@justaugustus justaugustus added the kind/bug Categorizes issue or PR as related to a bug. label Jan 29, 2021
@justaugustus
Copy link
Member Author

Slack thread in #testing-ops: https://kubernetes.slack.com/archives/C7J9RP96G/p1611953363009500

@chaodaiG
Copy link
Contributor

Latest succeeded run is configured at 9f603e50-624b-11eb-bd55-a20b0ecfb997

First failed run is configured at 2395e2c8-6254-11eb-bd55-a20b0ecfb997

One noticeable difference in the prowjob is that in the failed run:

    resources:
      clonerefs:
        requests:
          cpu: 100m
      initupload:
        requests:
          cpu: 100m
      place_entrypoint:
        requests:
          cpu: 100m
      sidecar:
        requests:
          cpu: 100m

which is not present in the succeeded run. And test container both requested:

requests:
          cpu: 7300m
          memory: 34Gi

It's possible that each node in cluster: k8s-infra-prow-build can only allocate either 7.3 cpu or 34Gi memory, can someone have access check the node size?

@chaodaiG
Copy link
Contributor

To make it more clear, the difference in the prowjobs resulted requested cpu from 7.3 to 7.7, and probably not related to memory

@spiffxp
Copy link
Member

spiffxp commented Jan 29, 2021

Thanks @chaodaiG that is most likely the problem.

A job trying to schedule all available CPU as a proxy for hogging a node all to itself is a risky strategy for this reason. The k8s-infra-prow-build nodes have (among other daemonsets) calico running on them for network policy enforcement, meaning they have slightly more overhead / less allocatable cpu than the google.com default nodes.

/area jobs
/sig release
/assign @justaugustus
The kind release-blocking jobs also use pod-utils and try to hog nodes, they use 7 cpu (e.g. https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-release/release-branch-jobs/1.20.yaml#L602-L650). Try that out.

@k8s-ci-robot k8s-ci-robot added area/jobs sig/release Categorizes an issue or PR as relevant to SIG Release. labels Jan 29, 2021
@BenTheElder
Copy link
Member

The kind release-blocking jobs also use pod-utils and try to hog nodes, they use 7 cpu (e.g. https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-release/release-branch-jobs/1.20.yaml#L602-L650). Try that out.

I/etcd just need those sweet sweet IOPs 🦆 , someday we'll have scheduling for this ...

@justaugustus
Copy link
Member Author

Thanks everyone for taking a peek.
Opened #20674.

@spiffxp
Copy link
Member

spiffxp commented Jan 29, 2021

I/etcd just need those sweet sweet IOPs 🦆 , someday we'll have scheduling for this ...

If we can make progress on moving k8s-infra-prow-build up to 1.18 we can try out nodepools with local-ssd, waaaaaay more iops kubernetes/k8s.io#1187 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jobs kind/bug Categorizes issue or PR as related to a bug. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants