-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
step-registry: add configure and install IPI steps #6708
Conversation
/cc @openshift/openshift-team-developer-productivity-platform |
c5af223
to
02fad4b
Compare
Are there logs I can look at? The rehearsal job just has:
which doesn't go into detail about the failure. Are there logs somewhere else I can look at? |
@wking: that was specifically about 40f2371. That python program has always failed in my tests and seems to fail in production also, e.g.:
I still need to make some adjustments to how the tests are executed in CI for rehearsals to work. |
Here's a summary of the RBAC problems (which I did not anticipate). Template tests are "allowed" to create any kind of resource because they are evaluated by the Multi-stage tests, however, use the same account as other tests ( They are:
|
Let's run the test Pods with the |
First, an additional requirement I did not mention: the @stevekuznetsov: since this will inevitably require changes to
With these changes, I was able to execute the * The RBAC API is doubly inconvenient here. The ServiceAccount requires |
As for the other RBAC rules that are not part of the basic test flow:
|
I agree - that sounds like a good approach. Given the current separation of concerns and the amount of release-specific logic that exists already in CI Operator I wonder why the |
\o/ I'll open PRs with the changes made to the images. |
This job should get adjusted to show up in GitHub, right? Currently you need to drop down into pj-rehearse to find it? |
A few more notes on the successful job: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/6708/rehearse-6708-pull-ci-openshift-installer-master-e2e-steps/5/build-log.txt | grep 'Pod.*steps'
2020/01/23 10:51:56 Pod e2e-steps-ipi-conf succeeded after 54s
2020/01/23 10:52:03 Pod e2e-steps-ipi-install-rbac succeeded after 5s
2020/01/23 11:36:02 Pod e2e-steps-ipi-install-install succeeded after 43m56s
2020/01/23 11:40:38 Pod e2e-steps-ipi-deprovision-artifacts-artifacts succeeded after 4m34s
2020/01/23 11:40:47 Pod e2e-steps-ipi-deprovision-artifacts-bootstrap succeeded after 6s
2020/01/23 11:44:52 Pod e2e-steps-ipi-deprovision-artifacts-must-gather succeeded after 4m4s
2020/01/23 11:48:28 Pod e2e-steps-ipi-deprovision-deprovision succeeded after 3m30s Looks like it's missing actual tests? Goes straight from install to deprovision. And we may want to adjust container asset collection, now that there's no longer a single shared volume? I'm less clear on how steps share assets, but currently you're pushing up the test-clusters admin kubeconfig and password. I don't have a problem with that (by the time they get pushed, we've torn down the cluster they granted access to), but avoiding leaking them was the motivation behind #6692. And I don't see assets from the ipi-deprovision-deprovision container? Or logs from any of the step containers? |
name: pull-ci-openshift-installer-master-e2e-steps | ||
optional: true | ||
rerun_command: /test e2e-steps | ||
skip_report: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is why it's not showing up in GitHub. Personally, I don't see why anyone would ever want to set this. Folks who aren't interested can just not click through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to avoid noise in the PRs in openshift/installer
while we work on the job. The final one will not have it.
@@ -197,3 +197,7 @@ tests: | |||
commands: TEST_SUITE=openshift/conformance/parallel run-tests | |||
openshift_installer_upi: | |||
cluster_profile: vsphere | |||
- as: e2e-steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably include aws
and ipi
somewhere in its name, since the installer is also going to want to run other flavors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ultimate goal is to replace the existing job, so this is just a temporary name (see also the previous response).
trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi' TERM | ||
|
||
cluster_profile=/var/run/secrets/ci.openshift.io/cluster-profile | ||
cluster_name=${NAMESPACE}-${JOB_NAME_HASH} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Neither of these is likely to contain shell-sensitive characters, but I like quoting variables just in case:
cluster_name="${NAMESPACE}-${JOB_NAME_HASH}"
It's not clear to me how ShellCheck passed without that. Is the ShellCheck job currently an empty shim?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to quote assignments, there's no ambiguity (as opposed to expansion in commands):
$ a='a b'; b='c d'; c=$a-$b; echo "==$c=="
==a b-c d==
aws) base_domain=origin-ci-int-aws.dev.rhcloud.com;; | ||
azure) base_domain=ci.azure.devcluster.openshift.com;; | ||
gcp) base_domain=origin-ci-int-gce.dev.openshift.com;; | ||
esac |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want a:
*) echo "no base domain configured for cluster type ${CLUSTER_TYPE}" >&2; exit 1;;
To guard against folks adding new types and forgetting to bump here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good as follow-ups when this is in prod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a check in the final case
statement, but I can see how a failure here could also be useful. I'll add it.
1) subnets="['subnet-0170ee5ccdd7e7823','subnet-0d50cac95bebb5a6e','subnet-0094864467fc2e737','subnet-0daa3919d85296eb6','subnet-0ab1e11d3ed63cc97','subnet-07681ad7ce2b6c281']";; | ||
2) subnets="['subnet-00de9462cf29cd3d3','subnet-06595d2851257b4df','subnet-04bbfdd9ca1b67e74','subnet-096992ef7d807f6b4','subnet-0b3d7ba41fc6278b2','subnet-0b99293450e2edb13']";; | ||
3) subnets="['subnet-047f6294332aa3c1c','subnet-0c3bce80bbc2c8f1c','subnet-038c38c7d96364d7f','subnet-027a025e9d9db95ce','subnet-04d9008469025b101','subnet-02f75024b00b20a75']";; | ||
*) echo >&2 "invalid subnets index"; exit 1;; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good as follow-ups when this is in prod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm updating my local branch after every rebase, the final version of the PR will have those.
workers=0 | ||
fi | ||
|
||
case "${CLUSTER_TYPE}" in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's really not a lot of shared logic between the various platforms. It may make more sense to have separate steps e.g. steps-registry/conf/aws
that sets up a basic AWS install-config.yaml
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, then FIPS and other adjustments that are platform agnostic would come in as an additional config step after the platform-specific steps had laid in the groundwork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good as follow-ups when this is in prod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to think about how to support an IaaS-agnostic job. My initial idea was having a single set of steps that can support all cases based on $CLUSTER_TYPE
, but now that I've implemented a test (and considering future expansions), this seems too restrictive.
We can split the script once we have a clearer resolution.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bbguimaraes, stevekuznetsov The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
Add an annotation to pods created by multi-stage tests to write the logs of all containers into the artifact directory, as is done for template tests. The `prow.k8s.io/id` label is removed so logs are not collected once more by the `artifact-uploader` controller. Detailed discussion: openshift/release#6708 (comment)
It's been 1000m since it landed in de3de20 (step-registry: add configure and install IPI steps, 2020-01-14, openshift#6708), but it's a pretty simple container (just write a config file), so Steve suggestes 10m as more appropriate [1]. [1]: openshift#7625 (comment)
A few changes here: * Background the 'destroy cluster' call and add a trap, so we can gracefully handle TERM. More on this in 4472ace (ci-operator/templates/openshift/installer: Restore backgrounded 'create cluster', 2019-01-23, openshift#2680). * The 'set +e' and related wrapping around the 'wait' follows de3de20 (step-registry: add configure and install IPI steps, 2020-01-14, openshift#6708), and ensures we gather logs and other assets in the event of a failed openshift-install invocation. More on this below. * We considered piping the installer's stderr into /dev/null. The same information is going to show up in .openshift_install.log, and .openshift_install.log includes timestamps which are not present in the container's captured standard streams. By using /dev/null, we could DRY up our password redaction, but we really want installer output to end up in the build log [1], so keeping the grep business there (even though that means we end up with largely duplicated assets between the container stderr and .openshift_install.log). * Quote $?. It's never going to contain shell-sensitive characters, but neither is $! and we quote that. * Make the log copy the first thing that happens after the installer exits. This ensures we capture the logs even if the installer fails before creating a kubeconfig or metadata.json. * Shift the log-bundle copy earlier, because it cannot fail, while the SHARED_DIR copy might fail. Although I'm not sure we have a case where we'd generate a log bundle but not generate the kubeconfig and metadata.json. Testing the 'set +e' approach, just to make sure it works as expected: $ echo $BASH_VERSION 5.0.11(1)-release $ cat test.sh #!/bin/bash set -o nounset set -o errexit set -o pipefail trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi; echo "done cleanup"' TERM "${@}" & set +e wait "$!" ret="$?" set -e echo gather logs exit "$ret" $ ./test.sh sleep 600 & [2] 19397 $ kill 19397 done cleanup gather logs [2]- Exit 143 ./test.sh sleep 20 $ ./test.sh true gather logs $ echo $? 0 $ ./test.sh false gather logs $ echo $? 1 So that all looks good. [1]: openshift#7936 (comment)
History of this logic: * Initially landed for nodes in e102a16 (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Gather node console logs on AWS, 2019-12-02, openshift#6189). * Grew --text in 6ec5bf3 (installer artifacts: keep text version of instance output, 2020-01-02, openshift#6536). * Grew machine handling in a469f53 (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Gather console logs from Machine providerID too, 2020-01-29, openshift#6906). * The node-provider-IDs creation was ported to steps in e2fd5c7 (step-registry: update deprovision step, 2020-01-30, openshift#6708), but without any consumer for the collected file. The aws-instance-ids.txt injection allows install-time steps to register additional instances for later console collection (for a proxy instance, bootstrap instance, etc.). Approvers are: * Myself and Vadim, who have touched this logic in the past. * Alberto, representing the machine-API space that needs console logs to debug failed-to-boot issues. * Colin, representing the RHCOS/machine-config space that needs console logs to debug RHCOS issues. TMPDIR documentation is based on POSIX [1]. [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_03
Like 7aa198b (ci-operator/step-registry/ipi/conf/azure: Get region from Boskos lease, 2020-10-14, openshift#12584), but for AWS. I'm keeping a switch for AWS to give folks a pattern for selecting zones, if AWS breaks a zone in a particular region. We should probably distribute that (and the shared subnets, for shared-subnet tests?) via leases as well, but baby steps. I'm leaving ci-operator/templates alone; hopefully those will be gone soon. I've already updated ci-tools with openshift/ci-tools@00ebab17e1 (pkg/steps/clusterinstall/template: Get region from Boskos lease, 2020-12-11, openshift/ci-tools#1527). I'm also normalizing to uppercase shell variables, now that we are no longer constrained by Go template expansion. Hmm, at least that's why I thought the variables used to be lowercase, see 43e08e7 (ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Push AWS-specific default base domain down into the template, 2019-09-23, openshift#5151). But looking at the templates when de3de20 (step-registry: add configure and install IPI steps, 2020-01-14, openshift#6708), I'm now not sure why these step commands were using lowercase variable names.
Essentially a copy of the
setup
container from the templates broken intoa configuration and an install step. Communication between the two is done
by writing/reading
install-config.yaml
to the shared secret.The configuration step checks the cluster type and special files in the
shared secret to customize the installation.
Open questions to be solved before merging:
Why does ipi-deprovision-artifacts-bootstrap-commands.sh fail?ci-operator/step-registry/ipi/deprovision/artifacts/bootstrap: Drop gather #6336