Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-conformance-kind-ga-only-parallel #18850

Closed
spiffxp opened this issue Aug 14, 2020 · 7 comments
Assignees
Labels
area/jobs kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 14, 2020

What should be cleaned up or changed:

This is part of #18550

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

Migrate pull-kubernetes-conformance-kind-ga-only-parallel to k8s-infra-prow-build by adding a cluster: k8s-infra-prow-build field to the job:

Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.

Things to watch for the job

Things to watch for the build cluster

  • prow-build dashboard 1w
    • is the build cluster scaling as needed? (e.g. maybe it can't scale because we've hit some kind of quota)
    • (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
  • prowjobs-experiment 1w
    • (shows resource consumption of all job runs, pretty noisy but putting this here for completeness)

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra
/sig testing
/area jobs
/help

@spiffxp spiffxp added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Aug 14, 2020
@k8s-ci-robot k8s-ci-robot added wg/k8s-infra sig/testing Categorizes an issue or PR as relevant to SIG Testing. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. area/jobs labels Aug 14, 2020
@neolit123
Copy link
Member

/assign
/remove-help

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 15, 2020
@neolit123
Copy link
Member

PR: #18872

@neolit123
Copy link
Member

observations:

does the job start failing more often?

the failure rate does not seem to have decreased. given this job is testing GA-only, during code freeze i'd assume flakes, instead of changes in features causing the job to fail.

does the job duration look worse than before? spikier than before?

the overall duration seems to have decreased to ~20minutes on average from ~26/30 minutes?

do more failures show up than before?

doesn't seem like it, possibly more runs are needed on the new cluster to say for sure.

is the job wildly underutilizing its memory limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)

i'd assume this kind cluster has 1CP2W nodes.
https://github.com/kubernetes/test-infra/pull/18872/files#diff-e4b92f7fa3467cd10631b29b58d683daR290-R298
the job requests 4 cpus, 9gi. my estimate is that this is not very underutilized, but possibly the res. requests can be reduced.
potential experiment here is to reduce them to half, but keep the limits.

@neolit123
Copy link
Member

neolit123 commented Aug 20, 2020

looks like i need to PR myself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 to see the rest of the links.

EDIT: kubernetes/k8s.io#1165

@neolit123
Copy link
Member

neolit123 commented Aug 22, 2020

metrics explorer

i hope i'm reading the data correctly. it seems the memory and CPU "limit utilization" is around a maximum of ~0.5-0.6, without spikes above ~0.8 (1.0 == limit). this seems fine - can see minor adjustments but resources are not very underutilized.

cpu
memory

@spiffxp
Copy link
Member Author

spiffxp commented Aug 28, 2020

/close
Agreed, this looks good. Thanks for your help!

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
Agreed, this looks good. Thanks for your help!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jobs kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

3 participants