Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-e2e-gce #18852

Closed
spiffxp opened this issue Aug 14, 2020 · 13 comments
Closed
Assignees
Labels
area/jobs kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 14, 2020

What should be cleaned up or changed:

This is part of #18550

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

Migrate pull-kubernetes-e2e-gce to k8s-infra-prow-build by adding a cluster: k8s-infra-prow-build field to the job:

NOTE: migrating this job is not as straightforward as some of the other #18550 issues, because:

  • it's being demoted from merge-blocking on release-1.19 and the main branch (as of pull-kubernetes-e2e-gce always_run: false #18832)
  • so in terms of appropriate amount of PR traffic, either manually trigger this job, or look for release-1.18 (and earlier) PRs

Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.

Things to watch for the job

Things to watch for the build cluster

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra
/sig testing
/area jobs
/help

@spiffxp spiffxp added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Aug 14, 2020
@k8s-ci-robot k8s-ci-robot added wg/k8s-infra sig/testing Categorizes an issue or PR as relevant to SIG Testing. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. area/jobs labels Aug 14, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 19, 2020

/remove-help
/assign

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 19, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 19, 2020

Tangentially related, it would be nice to know if we even need to use --stage=gs://kubernetes-release-pull (ref #18789). I already migrated over pull-kubernetes-e2e-gce-ubuntu-containerd which uses it, so I'll do the same here. But would then like to remove it if it's not needed, or migrate to kubernetes.io-owned gs://k8s-release-pull if it's needed

@spiffxp
Copy link
Member Author

spiffxp commented Aug 19, 2020

Opened #18916

The main branch and 1.19 variants aren't merge-blocking anymore, but earlier branches are. Moving them all over

@spiffxp
Copy link
Member Author

spiffxp commented Sep 1, 2020

#18916 merged 2020-08-19 16:40 PT

https://prow.k8s.io/?job=pull-kubernetes-e2e-gce - shows a reasonable amount of traffic since there is now a push to get PR's landed in time for the final cut of kubernetes v1.16. The only failures appear to be flakes

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&graph-metrics=test-duration-minutes - overall the job duration is less spiky and has maybe gone slightly down over time

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24 - no real change in errors

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce - Seeing #19034, would like to understand whether this job caused that issue or something else

cpu utilization - big spikes in the beginning for build, then nothing
Screen Shot 2020-08-31 at 5 56 05 PM

memory utilization - looks like that's about right
Screen Shot 2020-08-31 at 5 58 44 PM

So if it turn out #19034 is unrelated to this change, we're good. But need to dig into that a little more first

@RobertKielty
Copy link
Member

@spiffxp I this moved to In Progress. Will have a look at #19034 ...

@snowmanstark
Copy link

@RobertKielty @spiffxp I would like to work on this issue

@snowmanstark
Copy link

@spiffxp can you help me understand what does the following mean and how this affects the changes to be made for this issue

it's being demoted from merge-blocking on release-1.19 and the main branch (as of #18832)

@spiffxp
Copy link
Member Author

spiffxp commented Sep 14, 2020

@snowmanstark

So, the changes have already been made via #18916 (see #18852 (comment))

The reason this is still open is because #19034 is unexplained, and maybe happened around the same time #18916 merged? If we can either prove that #18916 didn't cause it (see #19034 (comment)), or if we can fix #19034, then this issue can be closed.


To answer your question

// A job is merge-blocking if it:
// - is not optional
// - reports (aka does not skip reporting)
// - always runs OR runs if some path changed
func isMergeBlocking(job cfg.Presubmit) bool {
return !job.Optional && !job.SkipReport && (job.AlwaysRun || job.RunIfChanged != "")
}

#18832 set always_run to false for the main branch when v1.19 was under development, and the release-1.19 branch. There is no run_if_changed for it, thus it's not considered merge-blocking for those branches.

It is still merge blocking for older branches (release-1.18, release-1.17), as we generally don't backport policy or test changes back to already-released versions of kubernetes except under special circumstances.

The reason this complicates things is the job wouldn't see as much traffic as jobs that always run for all branches, so it's tougher to avoid variance due to a smaller sample-set size, and thus tougher to make a judgement call on "does everything still look OK."

However, I saw enough traffic in #18852 (comment) when cherry picks were being swept through in advance of upcoming patch releases. So aside from the question of #19034 I think this looks good

@snowmanstark
Copy link

Thanks @spiffxp for that explanation. It makes total sense to me now. I'll look into #19034 too to get this closed.

@snowmanstark
Copy link

@spiffxp I looked into #19034 and nothings seems to be off there.

@RobertKielty
Copy link
Member

RobertKielty commented Nov 3, 2020

Hi @spiffxp can this issue be closed now?

I've updated #19034

We need to review #19034 for sure but I'm confused as to how both these issues are related?

@spiffxp
Copy link
Member Author

spiffxp commented Nov 5, 2020

Per #18852 (comment) the reason I held this open is because I'm still not certain that migration of this job did not cause #19034. But we've lived with it unresolved for about 90d now, so I guess we can live with it unexplained for longer.

/close

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

Per #18852 (comment) the reason I held this open is because I'm still not certain that migration of this job did not cause #19034. But we've lived with it unresolved for about 90d now, so I guess we can live with it unexplained for longer.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jobs kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

4 participants