step-registry: add configure and install IPI steps #6708

bbguimaraes · 2020-01-15T16:09:35Z

Essentially a copy of the setup container from the templates broken into
a configuration and an install step. Communication between the two is done
by writing/reading install-config.yaml to the shared secret.

The configuration step checks the cluster type and special files in the
shared secret to customize the installation.

Open questions to be solved before merging:

~~Why does ipi-deprovision-artifacts-bootstrap-commands.sh fail?~~ ci-operator/step-registry/ipi/deprovision/artifacts/bootstrap: Drop gather #6336
Which variables are necessary in ipi-deprovision-deprovision-commands.sh?

bbguimaraes · 2020-01-15T16:17:45Z

/cc @openshift/openshift-team-developer-productivity-platform

wking · 2020-01-17T20:26:23Z

Why does ipi-deprovision-artifacts-bootstrap-commands.sh fail?

Are there logs I can look at? The rehearsal job just has:

time="2020-01-15T17:21:33Z" level=info msg="Submitted rehearsal prowjob" job=rehearse-6708-pull-ci-openshift-installer-master-e2e-steps name=7a3d3ceb-37bb-11ea-a24b-0a58ac10e066 org=openshift pr=6708 repo=release type=presubmit
time="2020-01-15T18:02:08Z" level=error msg="Job failed" job=rehearse-6708-pull-ci-openshift-installer-master-e2e-steps name=7a3d3ceb-37bb-11ea-a24b-0a58ac10e066 org=openshift pr=6708 repo=release state=failure type=presubmit

which doesn't go into detail about the failure. Are there logs somewhere else I can look at?

bbguimaraes · 2020-01-20T10:26:03Z

@wking: that was specifically about 40f2371. That python program has always failed in my tests and seems to fail in production also, e.g.:

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/286/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/798/artifacts/e2e-aws-operator/container-logs/teardown.log

Traceback (most recent call last):
  File "<string>", line 1, in <module>
KeyError: 'modules'

I still need to make some adjustments to how the tests are executed in CI for rehearsals to work.

bbguimaraes · 2020-01-21T15:10:36Z

Here's a summary of the RBAC problems (which I did not anticipate). Template tests are "allowed" to create any kind of resource because they are evaluated by the ci-operator ServiceAccount, an admin in the test namespace.

Multi-stage tests, however, use the same account as other tests (default), so the set of allowed operations is much more limited. The actual failures seen in the rehearsal are due to the commands here:

https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/rbac/ipi-install-rbac-commands.sh

They are:

Assign system:image-puller to system:unauthenticated and system:authenticated.
- required for all E2E tests
Assign admin to system:serviceaccount:ci:ci-chat-bot.
- 17dd954
- 1c9bb1b
Create a role that can edit ImageStreams, ISTags, and Layers.
- required for the mirror variant of tests (not yet implemented in the step registry)
- 246118f

stevekuznetsov · 2020-01-21T17:37:33Z

Let's run the test Pods with the ci-operator ServiceAccount for now to unblock you and take this down as something we might need to rethink later.

bbguimaraes · 2020-01-22T18:08:31Z

First, an additional requirement I did not mention: the secret-wrapper requires create and update for Secrets*.

@stevekuznetsov: since this will inevitably require changes to ci-operator, how about this:

optional: create a test-specific ServiceAccount (or just use default)
assign system:image-puller to it
create and assign a Role that can list/create RoleBindings and update Secrets (resourceNames can be used to to limit access to the test-specific Secret)
change the secret-wrapper so it always uses update and doesn't try to create the Secret

With these changes, I was able to execute the ipi-conf and ipi-install-rbac steps in this PR with the modified ServiceAccount. I am currently executing the full install/deprovision test to be sure, but don't expect it to fail.

* The RBAC API is doubly inconvenient here. The ServiceAccount requires create with the current code (even though we never actually create the Secret) because calls fail even if they would result in an "already exists" error. We also cannot limit create to a single secret (kubernetes/kubernetes#80295).

bbguimaraes · 2020-01-22T18:19:32Z

As for the other RBAC rules that are not part of the basic test flow:

The implementation of the mirror variant is already suspicious, but we could allow tests to edit ImageStreams/ISTags/Layers by extending the Role in my previous message.
Giving admin to the test is undoubtedly not a good idea. My guess is it was done using the template for convenience, but we can add the capability to give admin rights to a service account to ci-operator and eliminate the need for it.

stevekuznetsov · 2020-01-22T22:39:38Z

I agree - that sounds like a good approach. Given the current separation of concerns and the amount of release-specific logic that exists already in CI Operator I wonder why the mirror variant exists in the template instead. Let's punt on that (it can continue using the template for now). If we can get the normal e2e flows migrated for one job we can crowd-source the rest IMO.

bbguimaraes · 2020-01-23T12:03:30Z

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/6708/rehearse-6708-pull-ci-openshift-installer-master-e2e-steps/5

\o/

I'll open PRs with the changes made to the images.

wking · 2020-01-25T21:17:28Z

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/6708/rehearse-6708-pull-ci-openshift-installer-master-e2e-steps/5

This job should get adjusted to show up in GitHub, right? Currently you need to drop down into pj-rehearse to find it?

wking · 2020-01-25T21:28:24Z

A few more notes on the successful job:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/6708/rehearse-6708-pull-ci-openshift-installer-master-e2e-steps/5/build-log.txt | grep 'Pod.*steps'
2020/01/23 10:51:56 Pod e2e-steps-ipi-conf succeeded after 54s
2020/01/23 10:52:03 Pod e2e-steps-ipi-install-rbac succeeded after 5s
2020/01/23 11:36:02 Pod e2e-steps-ipi-install-install succeeded after 43m56s
2020/01/23 11:40:38 Pod e2e-steps-ipi-deprovision-artifacts-artifacts succeeded after 4m34s
2020/01/23 11:40:47 Pod e2e-steps-ipi-deprovision-artifacts-bootstrap succeeded after 6s
2020/01/23 11:44:52 Pod e2e-steps-ipi-deprovision-artifacts-must-gather succeeded after 4m4s
2020/01/23 11:48:28 Pod e2e-steps-ipi-deprovision-deprovision succeeded after 3m30s

Looks like it's missing actual tests? Goes straight from install to deprovision.

And we may want to adjust container asset collection, now that there's no longer a single shared volume? I'm less clear on how steps share assets, but currently you're pushing up the test-clusters admin kubeconfig and password. I don't have a problem with that (by the time they get pushed, we've torn down the cluster they granted access to), but avoiding leaking them was the motivation behind #6692. And I don't see assets from the ipi-deprovision-deprovision container? Or logs from any of the step containers?

wking · 2020-01-25T21:32:24Z

ci-operator/jobs/openshift/installer/openshift-installer-master-presubmits.yaml

+    name: pull-ci-openshift-installer-master-e2e-steps
+    optional: true
+    rerun_command: /test e2e-steps
+    skip_report: true


I guess this is why it's not showing up in GitHub. Personally, I don't see why anyone would ever want to set this. Folks who aren't interested can just not click through.

The idea is to avoid noise in the PRs in openshift/installer while we work on the job. The final one will not have it.

wking · 2020-01-25T21:33:29Z

ci-operator/config/openshift/installer/openshift-installer-master.yaml

@@ -197,3 +197,7 @@ tests:
  commands: TEST_SUITE=openshift/conformance/parallel run-tests
  openshift_installer_upi:
    cluster_profile: vsphere
+- as: e2e-steps


This should probably include aws and ipi somewhere in its name, since the installer is also going to want to run other flavors.

The ultimate goal is to replace the existing job, so this is just a temporary name (see also the previous response).

wking · 2020-01-25T21:36:11Z

ci-operator/step-registry/ipi/conf/ipi-conf-commands.sh

+trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi' TERM
+
+cluster_profile=/var/run/secrets/ci.openshift.io/cluster-profile
+cluster_name=${NAMESPACE}-${JOB_NAME_HASH}


nit: Neither of these is likely to contain shell-sensitive characters, but I like quoting variables just in case:

cluster_name="${NAMESPACE}-${JOB_NAME_HASH}"

It's not clear to me how ShellCheck passed without that. Is the ShellCheck job currently an empty shim?

There's no need to quote assignments, there's no ambiguity (as opposed to expansion in commands):

$ a='a b'; b='c d'; c=$a-$b; echo "==$c==" ==a b-c d==

wking · 2020-01-25T21:40:33Z

ci-operator/step-registry/ipi/conf/ipi-conf-commands.sh

+    aws) base_domain=origin-ci-int-aws.dev.rhcloud.com;;
+    azure) base_domain=ci.azure.devcluster.openshift.com;;
+    gcp) base_domain=origin-ci-int-gce.dev.openshift.com;;
+    esac


do we want a:

*) echo "no base domain configured for cluster type ${CLUSTER_TYPE}" >&2; exit 1;;

To guard against folks adding new types and forgetting to bump here?

Good as follow-ups when this is in prod.

There is a check in the final case statement, but I can see how a failure here could also be useful. I'll add it.

wking · 2020-01-25T21:43:55Z

ci-operator/step-registry/ipi/conf/ipi-conf-commands.sh

+        1) subnets="['subnet-0170ee5ccdd7e7823','subnet-0d50cac95bebb5a6e','subnet-0094864467fc2e737','subnet-0daa3919d85296eb6','subnet-0ab1e11d3ed63cc97','subnet-07681ad7ce2b6c281']";;
+        2) subnets="['subnet-00de9462cf29cd3d3','subnet-06595d2851257b4df','subnet-04bbfdd9ca1b67e74','subnet-096992ef7d807f6b4','subnet-0b3d7ba41fc6278b2','subnet-0b99293450e2edb13']";;
+        3) subnets="['subnet-047f6294332aa3c1c','subnet-0c3bce80bbc2c8f1c','subnet-038c38c7d96364d7f','subnet-027a025e9d9db95ce','subnet-04d9008469025b101','subnet-02f75024b00b20a75']";;
+        *) echo >&2 "invalid subnets index"; exit 1;;


this should expand to pick up #6833 and #6845, but I can do that in follow-up work if you don't want to pick that up here. You may also want to pull in has_variant, e.g. from here.

Good as follow-ups when this is in prod.

I'm updating my local branch after every rebase, the final version of the PR will have those.

wking · 2020-01-25T21:47:11Z

ci-operator/step-registry/ipi/conf/ipi-conf-commands.sh

+    workers=0
+fi
+
+case "${CLUSTER_TYPE}" in


there's really not a lot of shared logic between the various platforms. It may make more sense to have separate steps e.g. steps-registry/conf/aws that sets up a basic AWS install-config.yaml, etc.

Also, then FIPS and other adjustments that are platform agnostic would come in as an additional config step after the platform-specific steps had laid in the groundwork.

Good as follow-ups when this is in prod.

We still need to think about how to support an IaaS-agnostic job. My initial idea was having a single set of steps that can support all cases based on $CLUSTER_TYPE, but now that I've implemented a test (and considering future expansions), this seems too restrictive.

We can split the script once we have a clearer resolution.

openshift-ci-robot · 2020-01-30T16:54:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbguimaraes, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~ci-operator/config/openshift/installer/OWNERS~~ [bbguimaraes,stevekuznetsov]
~~ci-operator/jobs/openshift/installer/OWNERS~~ [bbguimaraes,stevekuznetsov]
~~ci-operator/step-registry/ipi/OWNERS~~ [bbguimaraes,stevekuznetsov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-01-30T18:36:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

Add an annotation to pods created by multi-stage tests to write the logs of all containers into the artifact directory, as is done for template tests. The `prow.k8s.io/id` label is removed so logs are not collected once more by the `artifact-uploader` controller. Detailed discussion: openshift/release#6708 (comment)

It's been 1000m since it landed in de3de20 (step-registry: add configure and install IPI steps, 2020-01-14, openshift#6708), but it's a pretty simple container (just write a config file), so Steve suggestes 10m as more appropriate [1]. [1]: openshift#7625 (comment)

A few changes here: * Background the 'destroy cluster' call and add a trap, so we can gracefully handle TERM. More on this in 4472ace (ci-operator/templates/openshift/installer: Restore backgrounded 'create cluster', 2019-01-23, openshift#2680). * The 'set +e' and related wrapping around the 'wait' follows de3de20 (step-registry: add configure and install IPI steps, 2020-01-14, openshift#6708), and ensures we gather logs and other assets in the event of a failed openshift-install invocation. More on this below. * We considered piping the installer's stderr into /dev/null. The same information is going to show up in .openshift_install.log, and .openshift_install.log includes timestamps which are not present in the container's captured standard streams. By using /dev/null, we could DRY up our password redaction, but we really want installer output to end up in the build log [1], so keeping the grep business there (even though that means we end up with largely duplicated assets between the container stderr and .openshift_install.log). * Quote $?. It's never going to contain shell-sensitive characters, but neither is $! and we quote that. * Make the log copy the first thing that happens after the installer exits. This ensures we capture the logs even if the installer fails before creating a kubeconfig or metadata.json. * Shift the log-bundle copy earlier, because it cannot fail, while the SHARED_DIR copy might fail. Although I'm not sure we have a case where we'd generate a log bundle but not generate the kubeconfig and metadata.json. Testing the 'set +e' approach, just to make sure it works as expected: $ echo $BASH_VERSION 5.0.11(1)-release $ cat test.sh #!/bin/bash set -o nounset set -o errexit set -o pipefail trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi; echo "done cleanup"' TERM "${@}" & set +e wait "$!" ret="$?" set -e echo gather logs exit "$ret" $ ./test.sh sleep 600 & [2] 19397 $ kill 19397 done cleanup gather logs [2]- Exit 143 ./test.sh sleep 20 $ ./test.sh true gather logs $ echo $? 0 $ ./test.sh false gather logs $ echo $? 1 So that all looks good. [1]: openshift#7936 (comment)

History of this logic: * Initially landed for nodes in e102a16 (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Gather node console logs on AWS, 2019-12-02, openshift#6189). * Grew --text in 6ec5bf3 (installer artifacts: keep text version of instance output, 2020-01-02, openshift#6536). * Grew machine handling in a469f53 (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Gather console logs from Machine providerID too, 2020-01-29, openshift#6906). * The node-provider-IDs creation was ported to steps in e2fd5c7 (step-registry: update deprovision step, 2020-01-30, openshift#6708), but without any consumer for the collected file. The aws-instance-ids.txt injection allows install-time steps to register additional instances for later console collection (for a proxy instance, bootstrap instance, etc.). Approvers are: * Myself and Vadim, who have touched this logic in the past. * Alberto, representing the machine-API space that needs console logs to debug failed-to-boot issues. * Colin, representing the RHCOS/machine-config space that needs console logs to debug RHCOS issues. TMPDIR documentation is based on POSIX [1]. [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_03

Like 7aa198b (ci-operator/step-registry/ipi/conf/azure: Get region from Boskos lease, 2020-10-14, openshift#12584), but for AWS. I'm keeping a switch for AWS to give folks a pattern for selecting zones, if AWS breaks a zone in a particular region. We should probably distribute that (and the shared subnets, for shared-subnet tests?) via leases as well, but baby steps. I'm leaving ci-operator/templates alone; hopefully those will be gone soon. I've already updated ci-tools with openshift/ci-tools@00ebab17e1 (pkg/steps/clusterinstall/template: Get region from Boskos lease, 2020-12-11, openshift/ci-tools#1527). I'm also normalizing to uppercase shell variables, now that we are no longer constrained by Go template expansion. Hmm, at least that's why I thought the variables used to be lowercase, see 43e08e7 (ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Push AWS-specific default base domain down into the template, 2019-09-23, openshift#5151). But looking at the templates when de3de20 (step-registry: add configure and install IPI steps, 2020-01-14, openshift#6708), I'm now not sure why these step commands were using lowercase variable names.

openshift-ci-robot requested review from ewolinetz and staebler January 15, 2020 16:10

bbguimaraes force-pushed the steps branch 3 times, most recently from c5af223 to 02fad4b Compare January 15, 2020 17:20

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 17, 2020

bbguimaraes force-pushed the steps branch from 02fad4b to fcffe73 Compare January 23, 2020 10:06

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 23, 2020

bbguimaraes force-pushed the steps branch from 5faf039 to 8c0c930 Compare January 23, 2020 10:34

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 25, 2020

wking reviewed Jan 25, 2020

View reviewed changes

stevekuznetsov approved these changes Jan 30, 2020

View reviewed changes

openshift-ci-robot assigned stevekuznetsov Jan 30, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2020

openshift-merge-robot merged commit b32b8d3 into openshift:master Jan 30, 2020

bbguimaraes deleted the steps branch January 31, 2020 09:25

bbguimaraes mentioned this pull request Jan 31, 2020

ci-operator multi-stage: save container logs openshift/ci-tools#410

Merged

wking mentioned this pull request Mar 13, 2020

ci-operator/step-registry/ipi/conf: Drop to 10m #7650

Merged

wking mentioned this pull request Mar 25, 2020

ci-operator/step-registry/ipi: Shuffle installer log gathering #7936

Merged

wking mentioned this pull request Jun 15, 2020

config: add periodic job that runs upgrades using multistage tests #9489

Closed

wking mentioned this pull request Jun 17, 2020

gather/aws/console: Add a step to gather console logs on AWS #9743

Merged

wking mentioned this pull request Dec 14, 2020

ci-operator/step-registry/ipi/conf/aws: Get region from Boskos lease #14322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

step-registry: add configure and install IPI steps #6708

step-registry: add configure and install IPI steps #6708

bbguimaraes commented Jan 15, 2020 •

edited

Loading

bbguimaraes commented Jan 15, 2020

wking commented Jan 17, 2020

bbguimaraes commented Jan 20, 2020

bbguimaraes commented Jan 21, 2020

stevekuznetsov commented Jan 21, 2020

bbguimaraes commented Jan 22, 2020

bbguimaraes commented Jan 22, 2020

stevekuznetsov commented Jan 22, 2020

bbguimaraes commented Jan 23, 2020

wking commented Jan 25, 2020 •

edited

Loading

wking commented Jan 25, 2020

wking Jan 25, 2020

bbguimaraes Jan 28, 2020

wking Jan 25, 2020

bbguimaraes Jan 28, 2020

wking Jan 25, 2020

bbguimaraes Jan 28, 2020

wking Jan 25, 2020

stevekuznetsov Jan 27, 2020

bbguimaraes Jan 28, 2020

wking Jan 25, 2020

stevekuznetsov Jan 27, 2020

bbguimaraes Jan 28, 2020

wking Jan 25, 2020

wking Jan 25, 2020

stevekuznetsov Jan 27, 2020

bbguimaraes Jan 28, 2020

openshift-ci-robot commented Jan 30, 2020

openshift-bot commented Jan 30, 2020

step-registry: add configure and install IPI steps #6708

step-registry: add configure and install IPI steps #6708

Conversation

bbguimaraes commented Jan 15, 2020 • edited Loading

bbguimaraes commented Jan 15, 2020

wking commented Jan 17, 2020

bbguimaraes commented Jan 20, 2020

bbguimaraes commented Jan 21, 2020

stevekuznetsov commented Jan 21, 2020

bbguimaraes commented Jan 22, 2020

bbguimaraes commented Jan 22, 2020

stevekuznetsov commented Jan 22, 2020

bbguimaraes commented Jan 23, 2020

wking commented Jan 25, 2020 • edited Loading

wking commented Jan 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Jan 30, 2020

openshift-bot commented Jan 30, 2020

bbguimaraes commented Jan 15, 2020 •

edited

Loading

wking commented Jan 25, 2020 •

edited

Loading