Skip to content

ci-operator/config/openshift/cluster-version-operator: Generic e2e-gcp -> e2e for 4.4+ #10152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

wking
Copy link
Member

@wking wking commented Jul 9, 2020

In 7ee21db (#10046), I'd mentioned generic names as a way for component teams to say "we don't care which platform, other folks can pick whatever they want to balance platform volume vs. capacity". @sdodson pushed back based on lack of precedent. And we currently do rely on platforms in job-names for monitored success rates. But these are presubmits, and presubmits are noisy (e.g. if you pull-request some buggy code, your presubmits will fail, but the delivered product is not affected by your in-flight bugs). So in the presubmit case we are not constrained by job-name-based health reporting. And we run a lot of presubmit volume, so even if this approach only works in the presubmit case, it gives us a lot of rebalancing flexibility.

Benefits for component teams (the CVO maintainers in the case of this commit) include not having to care about where their jobs get scheduled. And with the job names not changing during rebalances, they don't have to remember to /skip or /refresh or whatever, because Prow would understand that it's the same effective test regardless of the underlying platform.

Benefits to rebalancing admins is that they don't have to ask component teams to opt-in at rebalance time. Teams can opt in with this pattern, and rebalancing admins can just search for e2e steps that do not include a platform slug in their as config name.

Generated by manually changing ci-operator/config/... and then running:

$ make update

…p -> e2e for 4.4+

In 7ee21db (ci-operator/jobs/openshift/cluster-version-operator:
Move 4.4 and later presubmits to GCP, 2020-07-02, openshift#10046), I'd
mentioned generic names as a way for component teams to say "we don't
care which platform, other folks can pick whatever they want to
balance platform volume vs. capacity".  Scott pushed back based on
lack of precedent [1,2].  And we currently do rely on platforms in
job-names for monitored success rates.  But these are presubmits, and
presubmits are noisy (e.g. if you pull-request some buggy code, your
presubmits will fail, but the delivered product is not affected by
your in-flight bugs).  So in the presubmit case we are not constrained
by job-name-based health reporting.  And we run a lot of presubmit
volume, so even if this approach only works in the presubmit case, it
gives us a lot of rebalancing flexibility.

Benefits for component teams (the CVO maintainers in the case of this
commit) include not having to care about where their jobs get
scheduled.  And with the job names not changing during rebalances,
they don't have to remember to /skip or /refresh or whatever, because
Prow would understand that it's the same effective test regardless of
the underlying platform.

Benefits to rebalancing admins is that they don't have to ask
component teams to opt-in at rebalance time.  Teams can opt in with
this pattern, and rebalancing admins can just search for e2e steps
that do not include a platform slug in their 'as' config name.

Generated by manually changing ci-operator/config/... and then running:

  $ make update

[1]: openshift#10046 (comment)
[2]: openshift#10046 (comment)
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 9, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 9, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

@wking: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/cluster-version-operator/master/e2e e7bb102 link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Jul 9, 2020

e2e presubmit:

error: unable to connect to image repository registry.svc.ci.openshift.org/ci-op-gg96rlwy/stable@sha256:d3da4a0e896cf5fcf1302a8edd52264fd4a76cd8b4e17976c7cd314bb9524e03: endpoint "https://registry.svc.ci.openshift.org" does not support v2 API (got 503 Service Unavailable)

Seems unrelated to my change.

@openshift-merge-robot openshift-merge-robot merged commit be73683 into openshift:master Jul 9, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 15 configmaps:

  • ci-operator-master-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-version-operator-master.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-master.yaml
  • ci-operator-4.6-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-version-operator-release-4.6.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.6.yaml
  • job-config-master configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-version-operator-master-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-master-presubmits.yaml
  • job-config-4.5 configmap in namespace ci at cluster api.ci using the following files:
    • key openshift-cluster-version-operator-release-4.5-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.5-presubmits.yaml
  • job-config-4.6 configmap in namespace ci at cluster api.ci using the following files:
    • key openshift-cluster-version-operator-release-4.6-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.6-presubmits.yaml
  • job-config-4.7 configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-version-operator-release-4.7-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.7-presubmits.yaml
  • job-config-4.4 configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-version-operator-release-4.4-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.4-presubmits.yaml
  • job-config-4.5 configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-version-operator-release-4.5-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.5-presubmits.yaml
  • job-config-4.7 configmap in namespace ci at cluster api.ci using the following files:
    • key openshift-cluster-version-operator-release-4.7-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.7-presubmits.yaml
  • ci-operator-4.5-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-version-operator-release-4.5.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.5.yaml
  • ci-operator-4.4-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-version-operator-release-4.4.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.4.yaml
  • ci-operator-4.7-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-version-operator-release-4.7.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.7.yaml
  • job-config-master configmap in namespace ci at cluster api.ci using the following files:
    • key openshift-cluster-version-operator-master-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-master-presubmits.yaml
  • job-config-4.4 configmap in namespace ci at cluster api.ci using the following files:
    • key openshift-cluster-version-operator-release-4.4-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.4-presubmits.yaml
  • job-config-4.6 configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-version-operator-release-4.6-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.6-presubmits.yaml

In response to this:

In 7ee21db (#10046), I'd mentioned generic names as a way for component teams to say "we don't care which platform, other folks can pick whatever they want to balance platform volume vs. capacity". @sdodson pushed back based on lack of precedent. And we currently do rely on platforms in job-names for monitored success rates. But these are presubmits, and presubmits are noisy (e.g. if you pull-request some buggy code, your presubmits will fail, but the delivered product is not affected by your in-flight bugs). So in the presubmit case we are not constrained by job-name-based health reporting. And we run a lot of presubmit volume, so even if this approach only works in the presubmit case, it gives us a lot of rebalancing flexibility.

Benefits for component teams (the CVO maintainers in the case of this commit) include not having to care about where their jobs get scheduled. And with the job names not changing during rebalances, they don't have to remember to /skip or /refresh or whatever, because Prow would understand that it's the same effective test regardless of the underlying platform.

Benefits to rebalancing admins is that they don't have to ask component teams to opt-in at rebalance time. Teams can opt in with this pattern, and rebalancing admins can just search for e2e steps that do not include a platform slug in their as config name.

Generated by manually changing ci-operator/config/... and then running:

$ make update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the platform-agnostic-cvo-preflights branch July 9, 2020 17:41
wking added a commit to wking/openshift-release that referenced this pull request Jul 10, 2020
Following the pattern from e7bb102
(ci-operator/config/openshift/cluster-version-operator: Generic
e2e-gcp -> e2e for 4.4+, 2020-07-09, openshift#10152), the "e2e-aws" -> "e2e"
change declares the router presubmits to be platform-agnostic.

The AWS -> GCP change takes advantage of the platform-agnosticism to
shift CI load from AWS (where we're currently pegging Boskos lease
capacity) to GCP (where we have some spare lease capacity).
wking added a commit to wking/openshift-release that referenced this pull request Jul 17, 2020
… e2e-aws -> e2e for 4.6+

Following the pattern from e7bb102
(ci-operator/config/openshift/cluster-version-operator: Generic
e2e-gcp -> e2e for 4.4+, 2020-07-09, openshift#10152), the "e2e-aws" -> "e2e"
change declares the router presubmits to be platform-agnostic
(although David has specifically requested to be kept off Azure based
on CI-success stability concerns).

The AWS -> GCP change takes advantage of the platform-agnosticism to
shift CI load from AWS (where we're currently pegging Boskos lease
capacity) to GCP (where we have some spare lease capacity).
wking added a commit to wking/openshift-release that referenced this pull request Jul 24, 2020
…-aws -> e2e for 4.6+

Following the pattern from e7bb102
(ci-operator/config/openshift/cluster-version-operator: Generic
e2e-gcp -> e2e for 4.4+, 2020-07-09, openshift#10152), the "e2e-aws" -> "e2e"
change declares the monitoring-operator presubmits to be
platform-agnostic.

I'm leaving e2e-aws-operator alone, because it relies on AWS-specific
gp2 storage configuration [1,2].

The AWS -> GCP change takes advantage of the platform-agnosticism to
shift CI load from AWS (where we're currently pegging Boskos lease
capacity) to GCP (where we have some spare lease capacity).

Generated by manually changing ci-operator/config/... and then running:

  $ make update

[1]: openshift#10377 (comment)
[2]: https://github.com/openshift/cluster-monitoring-operator/blob/e1caabda745caba6e4784095aebd75f561e1244d/test/e2e/alertmanager_test.go#L51-L61
wking added a commit to wking/openshift-release that referenced this pull request Aug 27, 2020
Following the pattern from e7bb102
(ci-operator/config/openshift/cluster-version-operator: Generic
e2e-gcp -> e2e for 4.4+, 2020-07-09, openshift#10152), this commit is using
platform-agnostic names for the required tests.  The role of
platform-agnostic tests is discussed in 4212d7d
(ci-operator/README: Discuss platform job rebalancing, 2020-07-09, openshift#10166).

The platform-specific jobs are retained, in case folks want to ask for
them explicitly with '/test e2e-aws', etc. while performing any
platform-specific tuning logic.  Making them on-demand reduces our
load in platforms where we are near capacity.

With this change:

* We grow a new, platform-agnostic e2e that is always_run=true and optional=false.
* e2e-gcp-upgrade becomes the platform-agnostic e2e-upgrade.
* e2e-aws-operator becomes the platform-agnostic e2e-operator.
* e2e-aws-disruptive becomes the platform-agnostic e2e-disruptive.
* e2e-aws and e2e-gcp become always_run=false and optional=true.
* e2e-azure and e2e-metal-ipi become always_run=false (they were already optional=true).

I've also ordered the configs to place the platform-agnostic stuff
first and shunt the platform-specific stuff towards the end.

Generated by manually changing `ci-operator/config/...`, running:

  $ make update

and tuning the configurable [1] always_run and optional.

[1]: https://github.com/openshift/ci-tools/blob/7fb6fd8b3802e47162442c9a5e10807952ba12eb/GENERATOR.md#hand-edited-prow-configuration
wking added a commit to wking/openshift-release that referenced this pull request Aug 31, 2020
Following the pattern from e7bb102
(ci-operator/config/openshift/cluster-version-operator: Generic
e2e-gcp -> e2e for 4.4+, 2020-07-09, openshift#10152), this commit is using
platform-agnostic names.  The role of platform-agnostic tests is
discussed in ci-operator/platform-balance.  The choice of a GCP
implementation for the now-agnostic tests is because we currently have
free GCP capacity, while AWS is maxed out.

I've left the benchmark test on AWS, because I'm not sure what that's
about.  It might be AWS-specific.

Generated by manually changing `ci-operator/config/...`, running:

  $ make update
wking added a commit to wking/openshift-release that referenced this pull request Aug 31, 2020
Following the pattern from e7bb102
(ci-operator/config/openshift/cluster-version-operator: Generic
e2e-gcp -> e2e for 4.4+, 2020-07-09, openshift#10152), this commit is using
platform-agnostic names.  The role of platform-agnostic tests is
discussed in ci-operator/platform-balance.  The choice of a GCP
implementation for the now-agnostic tests is because we currently have
free GCP capacity, while AWS is maxed out.

Generated by manually changing `ci-operator/config/...`, running:

  $ make update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants