Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step-registry: add configure and install IPI steps #6708

Merged
merged 4 commits into from
Jan 30, 2020

Conversation

bbguimaraes
Copy link
Contributor

@bbguimaraes bbguimaraes commented Jan 15, 2020

Essentially a copy of the setup container from the templates broken into
a configuration and an install step. Communication between the two is done
by writing/reading install-config.yaml to the shared secret.

The configuration step checks the cluster type and special files in the
shared secret to customize the installation.

Open questions to be solved before merging:

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 15, 2020
@bbguimaraes
Copy link
Contributor Author

/cc @openshift/openshift-team-developer-productivity-platform

@wking
Copy link
Member

wking commented Jan 17, 2020

Why does ipi-deprovision-artifacts-bootstrap-commands.sh fail?

Are there logs I can look at? The rehearsal job just has:

time="2020-01-15T17:21:33Z" level=info msg="Submitted rehearsal prowjob" job=rehearse-6708-pull-ci-openshift-installer-master-e2e-steps name=7a3d3ceb-37bb-11ea-a24b-0a58ac10e066 org=openshift pr=6708 repo=release type=presubmit
time="2020-01-15T18:02:08Z" level=error msg="Job failed" job=rehearse-6708-pull-ci-openshift-installer-master-e2e-steps name=7a3d3ceb-37bb-11ea-a24b-0a58ac10e066 org=openshift pr=6708 repo=release state=failure type=presubmit

which doesn't go into detail about the failure. Are there logs somewhere else I can look at?

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 17, 2020
@bbguimaraes
Copy link
Contributor Author

@wking: that was specifically about 40f2371. That python program has always failed in my tests and seems to fail in production also, e.g.:

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/286/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/798/artifacts/e2e-aws-operator/container-logs/teardown.log

Traceback (most recent call last):
  File "<string>", line 1, in <module>
KeyError: 'modules'

I still need to make some adjustments to how the tests are executed in CI for rehearsals to work.

@bbguimaraes
Copy link
Contributor Author

Here's a summary of the RBAC problems (which I did not anticipate). Template tests are "allowed" to create any kind of resource because they are evaluated by the ci-operator ServiceAccount, an admin in the test namespace.

Multi-stage tests, however, use the same account as other tests (default), so the set of allowed operations is much more limited. The actual failures seen in the rehearsal are due to the commands here:

https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/rbac/ipi-install-rbac-commands.sh

They are:

  • Assign system:image-puller to system:unauthenticated and system:authenticated.
    • required for all E2E tests
  • Assign admin to system:serviceaccount:ci:ci-chat-bot.
  • Create a role that can edit ImageStreams, ISTags, and Layers.
    • required for the mirror variant of tests (not yet implemented in the step registry)
    • 246118f

@stevekuznetsov
Copy link
Contributor

Let's run the test Pods with the ci-operator ServiceAccount for now to unblock you and take this down as something we might need to rethink later.

@bbguimaraes
Copy link
Contributor Author

First, an additional requirement I did not mention: the secret-wrapper requires create and update for Secrets*.

@stevekuznetsov: since this will inevitably require changes to ci-operator, how about this:

  • optional: create a test-specific ServiceAccount (or just use default)
  • assign system:image-puller to it
  • create and assign a Role that can list/create RoleBindings and update Secrets (resourceNames can be used to to limit access to the test-specific Secret)
  • change the secret-wrapper so it always uses update and doesn't try to create the Secret

With these changes, I was able to execute the ipi-conf and ipi-install-rbac steps in this PR with the modified ServiceAccount. I am currently executing the full install/deprovision test to be sure, but don't expect it to fail.

* The RBAC API is doubly inconvenient here. The ServiceAccount requires create with the current code (even though we never actually create the Secret) because calls fail even if they would result in an "already exists" error. We also cannot limit create to a single secret (kubernetes/kubernetes#80295).

@bbguimaraes
Copy link
Contributor Author

As for the other RBAC rules that are not part of the basic test flow:

  • The implementation of the mirror variant is already suspicious, but we could allow tests to edit ImageStreams/ISTags/Layers by extending the Role in my previous message.
  • Giving admin to the test is undoubtedly not a good idea. My guess is it was done using the template for convenience, but we can add the capability to give admin rights to a service account to ci-operator and eliminate the need for it.

@stevekuznetsov
Copy link
Contributor

I agree - that sounds like a good approach. Given the current separation of concerns and the amount of release-specific logic that exists already in CI Operator I wonder why the mirror variant exists in the template instead. Let's punt on that (it can continue using the template for now). If we can get the normal e2e flows migrated for one job we can crowd-source the rest IMO.

@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 23, 2020
@bbguimaraes
Copy link
Contributor Author

@wking
Copy link
Member

wking commented Jan 25, 2020

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 25, 2020
@wking
Copy link
Member

wking commented Jan 25, 2020

A few more notes on the successful job:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/6708/rehearse-6708-pull-ci-openshift-installer-master-e2e-steps/5/build-log.txt | grep 'Pod.*steps'
2020/01/23 10:51:56 Pod e2e-steps-ipi-conf succeeded after 54s
2020/01/23 10:52:03 Pod e2e-steps-ipi-install-rbac succeeded after 5s
2020/01/23 11:36:02 Pod e2e-steps-ipi-install-install succeeded after 43m56s
2020/01/23 11:40:38 Pod e2e-steps-ipi-deprovision-artifacts-artifacts succeeded after 4m34s
2020/01/23 11:40:47 Pod e2e-steps-ipi-deprovision-artifacts-bootstrap succeeded after 6s
2020/01/23 11:44:52 Pod e2e-steps-ipi-deprovision-artifacts-must-gather succeeded after 4m4s
2020/01/23 11:48:28 Pod e2e-steps-ipi-deprovision-deprovision succeeded after 3m30s

Looks like it's missing actual tests? Goes straight from install to deprovision.

And we may want to adjust container asset collection, now that there's no longer a single shared volume? I'm less clear on how steps share assets, but currently you're pushing up the test-clusters admin kubeconfig and password. I don't have a problem with that (by the time they get pushed, we've torn down the cluster they granted access to), but avoiding leaking them was the motivation behind #6692. And I don't see assets from the ipi-deprovision-deprovision container? Or logs from any of the step containers?

name: pull-ci-openshift-installer-master-e2e-steps
optional: true
rerun_command: /test e2e-steps
skip_report: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is why it's not showing up in GitHub. Personally, I don't see why anyone would ever want to set this. Folks who aren't interested can just not click through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to avoid noise in the PRs in openshift/installer while we work on the job. The final one will not have it.

@@ -197,3 +197,7 @@ tests:
commands: TEST_SUITE=openshift/conformance/parallel run-tests
openshift_installer_upi:
cluster_profile: vsphere
- as: e2e-steps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably include aws and ipi somewhere in its name, since the installer is also going to want to run other flavors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ultimate goal is to replace the existing job, so this is just a temporary name (see also the previous response).

trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi' TERM

cluster_profile=/var/run/secrets/ci.openshift.io/cluster-profile
cluster_name=${NAMESPACE}-${JOB_NAME_HASH}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Neither of these is likely to contain shell-sensitive characters, but I like quoting variables just in case:

cluster_name="${NAMESPACE}-${JOB_NAME_HASH}"

It's not clear to me how ShellCheck passed without that. Is the ShellCheck job currently an empty shim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to quote assignments, there's no ambiguity (as opposed to expansion in commands):

$ a='a b'; b='c d'; c=$a-$b; echo "==$c=="
==a b-c d==

aws) base_domain=origin-ci-int-aws.dev.rhcloud.com;;
azure) base_domain=ci.azure.devcluster.openshift.com;;
gcp) base_domain=origin-ci-int-gce.dev.openshift.com;;
esac
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want a:

*) echo "no base domain configured for cluster type ${CLUSTER_TYPE}" >&2; exit 1;;

To guard against folks adding new types and forgetting to bump here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good as follow-ups when this is in prod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a check in the final case statement, but I can see how a failure here could also be useful. I'll add it.

1) subnets="['subnet-0170ee5ccdd7e7823','subnet-0d50cac95bebb5a6e','subnet-0094864467fc2e737','subnet-0daa3919d85296eb6','subnet-0ab1e11d3ed63cc97','subnet-07681ad7ce2b6c281']";;
2) subnets="['subnet-00de9462cf29cd3d3','subnet-06595d2851257b4df','subnet-04bbfdd9ca1b67e74','subnet-096992ef7d807f6b4','subnet-0b3d7ba41fc6278b2','subnet-0b99293450e2edb13']";;
3) subnets="['subnet-047f6294332aa3c1c','subnet-0c3bce80bbc2c8f1c','subnet-038c38c7d96364d7f','subnet-027a025e9d9db95ce','subnet-04d9008469025b101','subnet-02f75024b00b20a75']";;
*) echo >&2 "invalid subnets index"; exit 1;;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should expand to pick up #6833 and #6845, but I can do that in follow-up work if you don't want to pick that up here. You may also want to pull in has_variant, e.g. from here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good as follow-ups when this is in prod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm updating my local branch after every rebase, the final version of the PR will have those.

workers=0
fi

case "${CLUSTER_TYPE}" in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's really not a lot of shared logic between the various platforms. It may make more sense to have separate steps e.g. steps-registry/conf/aws that sets up a basic AWS install-config.yaml, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, then FIPS and other adjustments that are platform agnostic would come in as an additional config step after the platform-specific steps had laid in the groundwork.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good as follow-ups when this is in prod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to think about how to support an IaaS-agnostic job. My initial idea was having a single set of steps that can support all cases based on $CLUSTER_TYPE, but now that I've implemented a test (and considering future expansions), this seems too restrictive.

We can split the script once we have a clearer resolution.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbguimaraes, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit b32b8d3 into openshift:master Jan 30, 2020
@bbguimaraes bbguimaraes deleted the steps branch January 31, 2020 09:25
bbguimaraes added a commit to bbguimaraes/ci-tools that referenced this pull request Jan 31, 2020
Add an annotation to pods created by multi-stage tests to write the logs
of all containers into the artifact directory, as is done for template
tests.  The `prow.k8s.io/id` label is removed so logs are not collected
once more by the `artifact-uploader` controller.  Detailed discussion:

openshift/release#6708 (comment)
wking added a commit to wking/openshift-release that referenced this pull request Mar 13, 2020
It's been 1000m since it landed in de3de20 (step-registry: add
configure and install IPI steps, 2020-01-14, openshift#6708), but it's a pretty
simple container (just write a config file), so Steve suggestes 10m as
more appropriate [1].

[1]: openshift#7625 (comment)
wking added a commit to wking/openshift-release that referenced this pull request Mar 26, 2020
A few changes here:

* Background the 'destroy cluster' call and add a trap, so we can
  gracefully handle TERM.  More on this in 4472ace
  (ci-operator/templates/openshift/installer: Restore backgrounded
  'create cluster', 2019-01-23, openshift#2680).

* The 'set +e' and related wrapping around the 'wait' follows
  de3de20 (step-registry: add configure and install IPI steps,
  2020-01-14, openshift#6708), and ensures we gather logs and other assets in
  the event of a failed openshift-install invocation.  More on this
  below.

* We considered piping the installer's stderr into /dev/null.  The
  same information is going to show up in .openshift_install.log, and
  .openshift_install.log includes timestamps which are not present in
  the container's captured standard streams.  By using /dev/null, we
  could DRY up our password redaction, but we really want installer
  output to end up in the build log [1], so keeping the grep business
  there (even though that means we end up with largely duplicated
  assets between the container stderr and .openshift_install.log).

* Quote $?.  It's never going to contain shell-sensitive characters,
  but neither is $! and we quote that.

* Make the log copy the first thing that happens after the installer
  exits.  This ensures we capture the logs even if the installer fails
  before creating a kubeconfig or metadata.json.

* Shift the log-bundle copy earlier, because it cannot fail, while the
  SHARED_DIR copy might fail.  Although I'm not sure we have a case
  where we'd generate a log bundle but not generate the kubeconfig and
  metadata.json.

Testing the 'set +e' approach, just to make sure it works as expected:

  $ echo $BASH_VERSION
  5.0.11(1)-release
  $ cat test.sh
  #!/bin/bash

  set -o nounset
  set -o errexit
  set -o pipefail

  trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi; echo "done cleanup"' TERM

  "${@}" &

  set +e
  wait "$!"
  ret="$?"
  set -e

  echo gather logs
  exit "$ret"
  $ ./test.sh sleep 600 &
  [2] 19397
  $ kill 19397
  done cleanup
  gather logs

  [2]-  Exit 143                ./test.sh sleep 20
  $ ./test.sh true
  gather logs
  $ echo $?
  0
  $ ./test.sh false
  gather logs
  $ echo $?
  1

So that all looks good.

[1]: openshift#7936 (comment)
wking added a commit to wking/openshift-release that referenced this pull request Jun 18, 2020
History of this logic:

* Initially landed for nodes in e102a16
  (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e:
  Gather node console logs on AWS, 2019-12-02, openshift#6189).

* Grew --text in 6ec5bf3 (installer artifacts: keep text version of
  instance output, 2020-01-02, openshift#6536).

* Grew machine handling in a469f53
  (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e:
  Gather console logs from Machine providerID too, 2020-01-29, openshift#6906).

* The node-provider-IDs creation was ported to steps in e2fd5c7
  (step-registry: update deprovision step, 2020-01-30, openshift#6708), but
  without any consumer for the collected file.

The aws-instance-ids.txt injection allows install-time steps to
register additional instances for later console collection (for a
proxy instance, bootstrap instance, etc.).

Approvers are:

* Myself and Vadim, who have touched this logic in the past.
* Alberto, representing the machine-API space that needs console logs
  to debug failed-to-boot issues.
* Colin, representing the RHCOS/machine-config space that needs
  console logs to debug RHCOS issues.

TMPDIR documentation is based on POSIX [1].

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_03
wking added a commit to wking/openshift-release that referenced this pull request Dec 15, 2020
Like 7aa198b (ci-operator/step-registry/ipi/conf/azure: Get region
from Boskos lease, 2020-10-14, openshift#12584), but for AWS.

I'm keeping a switch for AWS to give folks a pattern for selecting
zones, if AWS breaks a zone in a particular region.  We should
probably distribute that (and the shared subnets, for shared-subnet
tests?) via leases as well, but baby steps.

I'm leaving ci-operator/templates alone; hopefully those will be gone
soon.  I've already updated ci-tools with
openshift/ci-tools@00ebab17e1 (pkg/steps/clusterinstall/template: Get
region from Boskos lease, 2020-12-11, openshift/ci-tools#1527).

I'm also normalizing to uppercase shell variables, now that we are no
longer constrained by Go template expansion.  Hmm, at least that's why
I thought the variables used to be lowercase, see 43e08e7
(ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e:
Push AWS-specific default base domain down into the template,
2019-09-23, openshift#5151).  But looking at the templates when de3de20
(step-registry: add configure and install IPI steps, 2020-01-14, openshift#6708),
I'm now not sure why these step commands were using lowercase
variable names.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants