ci-operator/step-registry/ipi: Shuffle installer log gathering #7936

wking · 2020-03-25T23:07:37Z

Builds on #7935; review that first.

A few changes here:

Background the destroy cluster call and add a trap, so we can gracefully handle TERM. More on this in 4472ace (ci-operator/templates/openshift/installer: Restore backgrounded 'create cluster' #2680).
The set +e and related wrapping around the wait follows de3de20 (step-registry: add configure and install IPI steps #6708), and ensures we gather logs and other assets in the event of a failed openshift-install invocation.
Only attempt to gather .openshift_install.log if it exists. This way we do not have a failed cp kicking us out of the script and blocking the execution of subsequent gathers.
Pipe the installer's stderr into /dev/null. The same information is going to show up in .openshift_install.log, and .openshift_install.log includes timestamps which are not present in the container's captured standard streams. By using /dev/null, we can DRY up our password redaction, and the opened window for "installer dies before it sets up .openshift_install.log" is small.
Quote $?. It's never going to contain shell-sensitive characters, but neither is $! and we quote that.
Make the log copy the first thing that happens after the installer exits. This ensures we capture the logs even if the installer fails before creating a kubeconfig or metadata.json.
Condition the log-bundle copy on $ret. No need to fail a glob cp attempt if the installer succeeded. We still need to ignore failed copies, because there will be no log bundle if the installer dies early (e.g. during asset generation) or late (after bootstrap teardown).

Also shift this copy earlier, because it cannot fail, while the SHARED_DIR copy might fail. Although I'm not sure we have a case where we'd generate a log bundle but not generate the kubeconfig and metadata.json.

ci-operator/step-registry/ipi/deprovision/deprovision/ipi-deprovision-deprovision-commands.sh

ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh

abhinavdahiya · 2020-03-25T23:48:21Z

one of the nice things about steps was the pre, test , post stages so that the steps can be simpler and failing a stage ensure the post stage is always executed.

So I would like to no add these file exists checks, like we did previously with templates. if a step fails in stage i think it's okay to bail out...

wking · 2020-03-26T02:11:09Z

Rebased with 1044cf8d1 -> 1a7b314f9, so this is just one commit on master now that #7935 has landed.

wking · 2020-03-26T04:51:11Z

No new vSphere rehearsal for the most recent commit, but AWS has:

Container ipi-install-install exited with code 1, reason Error
---
/bin/bash: line 42: $!: unbound variable

But ShellCheck was happy, and that line in 1a7b314f91 looks correctly quoted to me. I dunno.

/retest

bbguimaraes · 2020-03-26T13:21:05Z

ci-operator/step-registry/ipi/deprovision/deprovision/ipi-deprovision-deprovision-commands.sh

-openshift-install --dir /tmp/installer destroy cluster
-cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}/"
+openshift-install --dir /tmp/installer destroy cluster &
+
+set +e
+wait "$!"
+ret="$?"
+set -e
+
+cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}"
+
+exit "$ret"


I was thinking about this the other day, would creating symlinks before executing the installer work in these cases? We could then drop this type of logic from this and other steps.

I was thinking about this the other day, would creating symlinks before executing the installer work in these cases?

Might work for destroy cluster, but would not work for install because we need to grep or sed to drop the kubeadmin password. I'll try if I can sort out the unbounded-variable thing...

ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh

wking · 2020-03-26T15:31:53Z

AWS rehearsal still failing on the nominally-unbound variable. And I'm still not sure what's going on with that.

A few changes here: * Background the 'destroy cluster' call and add a trap, so we can gracefully handle TERM. More on this in 4472ace (ci-operator/templates/openshift/installer: Restore backgrounded 'create cluster', 2019-01-23, openshift#2680). * The 'set +e' and related wrapping around the 'wait' follows de3de20 (step-registry: add configure and install IPI steps, 2020-01-14, openshift#6708), and ensures we gather logs and other assets in the event of a failed openshift-install invocation. More on this below. * We considered piping the installer's stderr into /dev/null. The same information is going to show up in .openshift_install.log, and .openshift_install.log includes timestamps which are not present in the container's captured standard streams. By using /dev/null, we could DRY up our password redaction, but we really want installer output to end up in the build log [1], so keeping the grep business there (even though that means we end up with largely duplicated assets between the container stderr and .openshift_install.log). * Quote $?. It's never going to contain shell-sensitive characters, but neither is $! and we quote that. * Make the log copy the first thing that happens after the installer exits. This ensures we capture the logs even if the installer fails before creating a kubeconfig or metadata.json. * Shift the log-bundle copy earlier, because it cannot fail, while the SHARED_DIR copy might fail. Although I'm not sure we have a case where we'd generate a log bundle but not generate the kubeconfig and metadata.json. Testing the 'set +e' approach, just to make sure it works as expected: $ echo $BASH_VERSION 5.0.11(1)-release $ cat test.sh #!/bin/bash set -o nounset set -o errexit set -o pipefail trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi; echo "done cleanup"' TERM "${@}" & set +e wait "$!" ret="$?" set -e echo gather logs exit "$ret" $ ./test.sh sleep 600 & [2] 19397 $ kill 19397 done cleanup gather logs [2]- Exit 143 ./test.sh sleep 20 $ ./test.sh true gather logs $ echo $? 0 $ ./test.sh false gather logs $ echo $? 1 So that all looks good. [1]: openshift#7936 (comment)

stevekuznetsov · 2020-03-26T17:13:02Z

ci-operator/step-registry/ipi/deprovision/deprovision/ipi-deprovision-deprovision-commands.sh

-cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}/"
+openshift-install --dir /tmp/installer destroy cluster &
+
+set +e


So is the intent of this to continue even if it fails? If so, it's racy -- you can have the background fail before you set +e. Why are we doing this?

I don't think errexit, even with pipefail, considers background processes. I can't find a positive reference, but:

$ time bash -ec 'false & sleep 2; wait "$!"' real 0m2.010s user 0m0.004s sys 0m0.007s

So is the intent of this to continue even if it fails?

Yes. And capture the backgrounded command's exit status so we can return it later on.

If so, it's racy -- you can have the background fail before you set +e.

Indeed. As described in the PR topic comment, this approach is recycled from de3de20 (#6708). But testing locally with false &, I can produce your race. Will reroll to handle log collection via trap to avoid the race.

Why not just

if ! openshift-install --dir /tmp/installer destroy cluster; then # failure else # not fi

we need to background it to get trap TERM handling; see 4472ace (#2680).

Hmm. I thought I'd reproduced the race in a shell earlier, but trying to nail this down for a commit message I'm having trouble:

$ echo $BASH_VERSION 5.0.11(1)-release $ cat test.sh #!/bin/bash set -o nounset set -o errexit set -o pipefail "${@}" & sleep 1 set +e wait "$!" ret="$?" set -e echo "post-wait gather" exit "${ret}" $ ./test.sh false post-wait gather $ echo $? 1 $ ./test.sh sleep 2 post-wait gather $ echo $? 0

Can you provide an example that exposes the race?

openshift-ci-robot · 2020-03-26T18:16:57Z

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/rehearse/openshift/installer/master/e2e-vsphere	1044cf8d18101ee387b8616d937a8a4c6078b60a	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/master/e2e-aws	`068c5de`	link	`/test pj-rehearse`
ci/prow/pj-rehearse	`068c5de`	link	`/test pj-rehearse`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2020-03-28T04:01:30Z

Mysterious unbound-variable thing is gone in the most recent rehearsal, where AWS died on the multus TargetDown bug. We collected install std-stream logs (with the kubeadmin password redacted), debug install logs (also redacted), and debug destroy logs. So I think this is a solid improvement over where we are now, and we can merge this PR.

There are two outstanding threads, but both are considering further pivots away from where master stands today, so I don't think they need to block this PR from landing.

ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh

Steve found this opaque [1], so shift the cp into the 'set +e' block instead. The '|| :' syntax was originally from 20157f6 (Capture log bundle in install step, 2020-03-24, openshift#7899). [1]: https://github.com/openshift/release/pull/7936/files#r400512879

openshift-ci-robot · 2020-03-30T22:26:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/ipi/deprovision/deprovision/OWNERS~~ [stevekuznetsov,wking]
~~ci-operator/step-registry/ipi/install/install/OWNERS~~ [stevekuznetsov,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2020-03-30T22:30:43Z

@wking: Updated the following 2 configmaps:

step-registry configmap in namespace ci at cluster default using the following files:
- key ipi-deprovision-deprovision-commands.sh using file ci-operator/step-registry/ipi/deprovision/deprovision/ipi-deprovision-deprovision-commands.sh
- key ipi-install-install-commands.sh using file ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh
step-registry configmap in namespace ci at cluster app.ci using the following files:
- key ipi-deprovision-deprovision-commands.sh using file ci-operator/step-registry/ipi/deprovision/deprovision/ipi-deprovision-deprovision-commands.sh
- key ipi-install-install-commands.sh using file ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh

In response to this:

Builds on #7935; review that first.

A few changes here:

Background the destroy cluster call and add a trap, so we can gracefully handle TERM. More on this in 4472ace (ci-operator/templates/openshift/installer: Restore backgrounded 'create cluster' #2680).

The set +e and related wrapping around the wait follows de3de20 (step-registry: add configure and install IPI steps #6708), and ensures we gather logs and other assets in the event of a failed openshift-install invocation.

Only attempt to gather .openshift_install.log if it exists. This way we do not have a failed cp kicking us out of the script and blocking the execution of subsequent gathers.

Pipe the installer's stderr into /dev/null. The same information is going to show up in .openshift_install.log, and .openshift_install.log includes timestamps which are not present in the container's captured standard streams. By using /dev/null, we can DRY up our password redaction, and the opened window for "installer dies before it sets up .openshift_install.log" is small.

Quote $?. It's never going to contain shell-sensitive characters, but neither is $! and we quote that.

Make the log copy the first thing that happens after the installer exits. This ensures we capture the logs even if the installer fails before creating a kubeconfig or metadata.json.

Condition the log-bundle copy on $ret. No need to fail a glob cp attempt if the installer succeeded. We still need to ignore failed copies, because there will be no log bundle if the installer dies early (e.g. during asset generation) or late (after bootstrap teardown).

Also shift this copy earlier, because it cannot fail, while the SHARED_DIR copy might fail. Although I'm not sure we have a case where we'd generate a log bundle but not generate the kubeconfig and metadata.json.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from abhinavdahiya and crawford March 25, 2020 23:07

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2020

wking force-pushed the multi-step-installer-logs branch 2 times, most recently from c84a2fb to 85d0d39 Compare March 25, 2020 23:18

abhinavdahiya reviewed Mar 25, 2020

View reviewed changes

ci-operator/step-registry/ipi/deprovision/deprovision/ipi-deprovision-deprovision-commands.sh Outdated Show resolved Hide resolved

abhinavdahiya reviewed Mar 25, 2020

View reviewed changes

ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh Outdated Show resolved Hide resolved

abhinavdahiya reviewed Mar 25, 2020

View reviewed changes

ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh Outdated Show resolved Hide resolved

wking force-pushed the multi-step-installer-logs branch 2 times, most recently from b5e4c9a to 5c1f20f Compare March 25, 2020 23:47

wking force-pushed the multi-step-installer-logs branch from 5c1f20f to 1044cf8 Compare March 25, 2020 23:52

wking mentioned this pull request Mar 26, 2020

ci-operator/step-registry/ipi/deprovision/vsphere: Decompose into sub-steps #7935

Merged

wking force-pushed the multi-step-installer-logs branch from 1044cf8 to 1a7b314 Compare March 26, 2020 02:09

bbguimaraes reviewed Mar 26, 2020

View reviewed changes

abhinavdahiya reviewed Mar 26, 2020

View reviewed changes

ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh Outdated Show resolved Hide resolved

wking force-pushed the multi-step-installer-logs branch from 1a7b314 to 068c5de Compare March 26, 2020 16:34

stevekuznetsov reviewed Mar 26, 2020

View reviewed changes

stevekuznetsov reviewed Mar 30, 2020

View reviewed changes

ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh Outdated Show resolved Hide resolved

stevekuznetsov approved these changes Mar 30, 2020

View reviewed changes

openshift-ci-robot assigned stevekuznetsov Mar 30, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 30, 2020

openshift-merge-robot merged commit ec885bd into openshift:master Mar 30, 2020

wking deleted the multi-step-installer-logs branch March 30, 2020 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci-operator/step-registry/ipi: Shuffle installer log gathering #7936

ci-operator/step-registry/ipi: Shuffle installer log gathering #7936

wking commented Mar 25, 2020

abhinavdahiya commented Mar 25, 2020

wking commented Mar 26, 2020

wking commented Mar 26, 2020

bbguimaraes Mar 26, 2020

wking Mar 26, 2020

wking commented Mar 26, 2020

stevekuznetsov Mar 26, 2020

bbguimaraes Mar 27, 2020

wking Mar 27, 2020

stevekuznetsov Mar 27, 2020

wking Mar 27, 2020

wking Mar 27, 2020

openshift-ci-robot commented Mar 26, 2020

wking commented Mar 28, 2020 •

edited

Loading

openshift-ci-robot commented Mar 30, 2020

openshift-ci-robot commented Mar 30, 2020

ci-operator/step-registry/ipi: Shuffle installer log gathering #7936

ci-operator/step-registry/ipi: Shuffle installer log gathering #7936

Conversation

wking commented Mar 25, 2020

abhinavdahiya commented Mar 25, 2020

wking commented Mar 26, 2020

wking commented Mar 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wking commented Mar 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Mar 26, 2020

wking commented Mar 28, 2020 • edited Loading

openshift-ci-robot commented Mar 30, 2020

openshift-ci-robot commented Mar 30, 2020

wking commented Mar 28, 2020 •

edited

Loading