-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-operator/step-registry/ipi: Shuffle installer log gathering #7936
ci-operator/step-registry/ipi: Shuffle installer log gathering #7936
Conversation
c84a2fb
to
85d0d39
Compare
ci-operator/step-registry/ipi/deprovision/deprovision/ipi-deprovision-deprovision-commands.sh
Outdated
Show resolved
Hide resolved
ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh
Outdated
Show resolved
Hide resolved
ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh
Outdated
Show resolved
Hide resolved
b5e4c9a
to
5c1f20f
Compare
one of the nice things about steps was the pre, test , post stages so that the steps can be simpler and failing a stage ensure the post stage is always executed. So I would like to no add these file exists checks, like we did previously with templates. if a step fails in stage i think it's okay to bail out... |
5c1f20f
to
1044cf8
Compare
1044cf8
to
1a7b314
Compare
Rebased with 1044cf8d1 -> 1a7b314f9, so this is just one commit on master now that #7935 has landed. |
No new vSphere rehearsal for the most recent commit, but AWS has:
But ShellCheck was happy, and that line in 1a7b314f91 looks correctly quoted to me. I dunno. /retest |
openshift-install --dir /tmp/installer destroy cluster | ||
cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}/" | ||
openshift-install --dir /tmp/installer destroy cluster & | ||
|
||
set +e | ||
wait "$!" | ||
ret="$?" | ||
set -e | ||
|
||
cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}" | ||
|
||
exit "$ret" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this the other day, would creating symlinks before executing the installer work in these cases? We could then drop this type of logic from this and other steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this the other day, would creating symlinks before executing the installer work in these cases?
Might work for destroy cluster
, but would not work for install because we need to grep
or sed
to drop the kubeadmin password. I'll try if I can sort out the unbounded-variable thing...
ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh
Outdated
Show resolved
Hide resolved
AWS rehearsal still failing on the nominally-unbound variable. And I'm still not sure what's going on with that. |
A few changes here: * Background the 'destroy cluster' call and add a trap, so we can gracefully handle TERM. More on this in 4472ace (ci-operator/templates/openshift/installer: Restore backgrounded 'create cluster', 2019-01-23, openshift#2680). * The 'set +e' and related wrapping around the 'wait' follows de3de20 (step-registry: add configure and install IPI steps, 2020-01-14, openshift#6708), and ensures we gather logs and other assets in the event of a failed openshift-install invocation. More on this below. * We considered piping the installer's stderr into /dev/null. The same information is going to show up in .openshift_install.log, and .openshift_install.log includes timestamps which are not present in the container's captured standard streams. By using /dev/null, we could DRY up our password redaction, but we really want installer output to end up in the build log [1], so keeping the grep business there (even though that means we end up with largely duplicated assets between the container stderr and .openshift_install.log). * Quote $?. It's never going to contain shell-sensitive characters, but neither is $! and we quote that. * Make the log copy the first thing that happens after the installer exits. This ensures we capture the logs even if the installer fails before creating a kubeconfig or metadata.json. * Shift the log-bundle copy earlier, because it cannot fail, while the SHARED_DIR copy might fail. Although I'm not sure we have a case where we'd generate a log bundle but not generate the kubeconfig and metadata.json. Testing the 'set +e' approach, just to make sure it works as expected: $ echo $BASH_VERSION 5.0.11(1)-release $ cat test.sh #!/bin/bash set -o nounset set -o errexit set -o pipefail trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi; echo "done cleanup"' TERM "${@}" & set +e wait "$!" ret="$?" set -e echo gather logs exit "$ret" $ ./test.sh sleep 600 & [2] 19397 $ kill 19397 done cleanup gather logs [2]- Exit 143 ./test.sh sleep 20 $ ./test.sh true gather logs $ echo $? 0 $ ./test.sh false gather logs $ echo $? 1 So that all looks good. [1]: openshift#7936 (comment)
1a7b314
to
068c5de
Compare
cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}/" | ||
openshift-install --dir /tmp/installer destroy cluster & | ||
|
||
set +e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the intent of this to continue even if it fails? If so, it's racy -- you can have the background fail before you set +e
. Why are we doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think errexit
, even with pipefail
, considers background processes. I can't find a positive reference, but:
$ time bash -ec 'false & sleep 2; wait "$!"'
real 0m2.010s
user 0m0.004s
sys 0m0.007s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the intent of this to continue even if it fails?
Yes. And capture the backgrounded command's exit status so we can return it later on.
If so, it's racy -- you can have the background fail before you
set +e
.
Indeed. As described in the PR topic comment, this approach is recycled from de3de20 (#6708). But testing locally with false &
, I can produce your race. Will reroll to handle log collection via trap
to avoid the race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
if ! openshift-install --dir /tmp/installer destroy cluster; then
# failure
else
# not
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I thought I'd reproduced the race in a shell earlier, but trying to nail this down for a commit message I'm having trouble:
$ echo $BASH_VERSION
5.0.11(1)-release
$ cat test.sh
#!/bin/bash
set -o nounset
set -o errexit
set -o pipefail
"${@}" &
sleep 1
set +e
wait "$!"
ret="$?"
set -e
echo "post-wait gather"
exit "${ret}"
$ ./test.sh false
post-wait gather
$ echo $?
1
$ ./test.sh sleep 2
post-wait gather
$ echo $?
0
Can you provide an example that exposes the race?
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Mysterious unbound-variable thing is gone in the most recent rehearsal, where AWS died on the multus TargetDown bug. We collected install std-stream logs (with the kubeadmin password redacted), debug install logs (also redacted), and debug destroy logs. So I think this is a solid improvement over where we are now, and we can merge this PR. There are two outstanding threads, but both are considering further pivots away from where master stands today, so I don't think they need to block this PR from landing. |
ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh
Outdated
Show resolved
Hide resolved
Steve found this opaque [1], so shift the cp into the 'set +e' block instead. The '|| :' syntax was originally from 20157f6 (Capture log bundle in install step, 2020-03-24, openshift#7899). [1]: https://github.com/openshift/release/pull/7936/files#r400512879
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stevekuznetsov, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@wking: Updated the following 2 configmaps:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Builds on #7935; review that first.
A few changes here:
Background the
destroy cluster
call and add atrap
, so we can gracefully handleTERM
. More on this in 4472ace (ci-operator/templates/openshift/installer: Restore backgrounded 'create cluster' #2680).The
set +e
and related wrapping around thewait
follows de3de20 (step-registry: add configure and install IPI steps #6708), and ensures we gather logs and other assets in the event of a failed openshift-install invocation.Only attempt to gather
.openshift_install.log
if it exists. This way we do not have a failedcp
kicking us out of the script and blocking the execution of subsequent gathers.Pipe the installer's stderr into
/dev/null
. The same information is going to show up in.openshift_install.log
, and.openshift_install.log
includes timestamps which are not present in the container's captured standard streams. By using/dev/null
, we can DRY up our password redaction, and the opened window for "installer dies before it sets up.openshift_install.log
" is small.Quote
$?
. It's never going to contain shell-sensitive characters, but neither is$!
and we quote that.Make the log copy the first thing that happens after the installer exits. This ensures we capture the logs even if the installer fails before creating a
kubeconfig
ormetadata.json
.Condition the log-bundle copy on
$ret
. No need to fail a globcp
attempt if the installer succeeded. We still need to ignore failed copies, because there will be no log bundle if the installer dies early (e.g. during asset generation) or late (after bootstrap teardown).Also shift this copy earlier, because it cannot fail, while the
SHARED_DIR
copy might fail. Although I'm not sure we have a case where we'd generate a log bundle but not generate thekubeconfig
andmetadata.json
.