Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-operator/step-registry/ipi: Shuffle installer log gathering #7936

Merged

Conversation

wking
Copy link
Member

@wking wking commented Mar 25, 2020

Builds on #7935; review that first.

A few changes here:

  • Background the destroy cluster call and add a trap, so we can gracefully handle TERM. More on this in 4472ace (ci-operator/templates/openshift/installer: Restore backgrounded 'create cluster' #2680).

  • The set +e and related wrapping around the wait follows de3de20 (step-registry: add configure and install IPI steps #6708), and ensures we gather logs and other assets in the event of a failed openshift-install invocation.

  • Only attempt to gather .openshift_install.log if it exists. This way we do not have a failed cp kicking us out of the script and blocking the execution of subsequent gathers.

  • Pipe the installer's stderr into /dev/null. The same information is going to show up in .openshift_install.log, and .openshift_install.log includes timestamps which are not present in the container's captured standard streams. By using /dev/null, we can DRY up our password redaction, and the opened window for "installer dies before it sets up .openshift_install.log" is small.

  • Quote $?. It's never going to contain shell-sensitive characters, but neither is $! and we quote that.

  • Make the log copy the first thing that happens after the installer exits. This ensures we capture the logs even if the installer fails before creating a kubeconfig or metadata.json.

  • Condition the log-bundle copy on $ret. No need to fail a glob cp attempt if the installer succeeded. We still need to ignore failed copies, because there will be no log bundle if the installer dies early (e.g. during asset generation) or late (after bootstrap teardown).

    Also shift this copy earlier, because it cannot fail, while the SHARED_DIR copy might fail. Although I'm not sure we have a case where we'd generate a log bundle but not generate the kubeconfig and metadata.json.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2020
@wking wking force-pushed the multi-step-installer-logs branch 2 times, most recently from c84a2fb to 85d0d39 Compare March 25, 2020 23:18
@wking wking force-pushed the multi-step-installer-logs branch 2 times, most recently from b5e4c9a to 5c1f20f Compare March 25, 2020 23:47
@abhinavdahiya
Copy link
Contributor

one of the nice things about steps was the pre, test , post stages so that the steps can be simpler and failing a stage ensure the post stage is always executed.

So I would like to no add these file exists checks, like we did previously with templates. if a step fails in stage i think it's okay to bail out...

@wking
Copy link
Member Author

wking commented Mar 26, 2020

Rebased with 1044cf8d1 -> 1a7b314f9, so this is just one commit on master now that #7935 has landed.

@wking
Copy link
Member Author

wking commented Mar 26, 2020

No new vSphere rehearsal for the most recent commit, but AWS has:

Container ipi-install-install exited with code 1, reason Error
---
/bin/bash: line 42: $!: unbound variable

But ShellCheck was happy, and that line in 1a7b314f91 looks correctly quoted to me. I dunno.

/retest

Comment on lines -13 to +24
openshift-install --dir /tmp/installer destroy cluster
cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}/"
openshift-install --dir /tmp/installer destroy cluster &

set +e
wait "$!"
ret="$?"
set -e

cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}"

exit "$ret"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this the other day, would creating symlinks before executing the installer work in these cases? We could then drop this type of logic from this and other steps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this the other day, would creating symlinks before executing the installer work in these cases?

Might work for destroy cluster, but would not work for install because we need to grep or sed to drop the kubeadmin password. I'll try if I can sort out the unbounded-variable thing...

@wking
Copy link
Member Author

wking commented Mar 26, 2020

AWS rehearsal still failing on the nominally-unbound variable. And I'm still not sure what's going on with that.

A few changes here:

* Background the 'destroy cluster' call and add a trap, so we can
  gracefully handle TERM.  More on this in 4472ace
  (ci-operator/templates/openshift/installer: Restore backgrounded
  'create cluster', 2019-01-23, openshift#2680).

* The 'set +e' and related wrapping around the 'wait' follows
  de3de20 (step-registry: add configure and install IPI steps,
  2020-01-14, openshift#6708), and ensures we gather logs and other assets in
  the event of a failed openshift-install invocation.  More on this
  below.

* We considered piping the installer's stderr into /dev/null.  The
  same information is going to show up in .openshift_install.log, and
  .openshift_install.log includes timestamps which are not present in
  the container's captured standard streams.  By using /dev/null, we
  could DRY up our password redaction, but we really want installer
  output to end up in the build log [1], so keeping the grep business
  there (even though that means we end up with largely duplicated
  assets between the container stderr and .openshift_install.log).

* Quote $?.  It's never going to contain shell-sensitive characters,
  but neither is $! and we quote that.

* Make the log copy the first thing that happens after the installer
  exits.  This ensures we capture the logs even if the installer fails
  before creating a kubeconfig or metadata.json.

* Shift the log-bundle copy earlier, because it cannot fail, while the
  SHARED_DIR copy might fail.  Although I'm not sure we have a case
  where we'd generate a log bundle but not generate the kubeconfig and
  metadata.json.

Testing the 'set +e' approach, just to make sure it works as expected:

  $ echo $BASH_VERSION
  5.0.11(1)-release
  $ cat test.sh
  #!/bin/bash

  set -o nounset
  set -o errexit
  set -o pipefail

  trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi; echo "done cleanup"' TERM

  "${@}" &

  set +e
  wait "$!"
  ret="$?"
  set -e

  echo gather logs
  exit "$ret"
  $ ./test.sh sleep 600 &
  [2] 19397
  $ kill 19397
  done cleanup
  gather logs

  [2]-  Exit 143                ./test.sh sleep 20
  $ ./test.sh true
  gather logs
  $ echo $?
  0
  $ ./test.sh false
  gather logs
  $ echo $?
  1

So that all looks good.

[1]: openshift#7936 (comment)
@wking wking force-pushed the multi-step-installer-logs branch from 1a7b314 to 068c5de Compare March 26, 2020 16:34
cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}/"
openshift-install --dir /tmp/installer destroy cluster &

set +e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the intent of this to continue even if it fails? If so, it's racy -- you can have the background fail before you set +e. Why are we doing this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think errexit, even with pipefail, considers background processes. I can't find a positive reference, but:

$ time bash -ec 'false & sleep 2; wait "$!"'

real    0m2.010s
user    0m0.004s
sys     0m0.007s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the intent of this to continue even if it fails?

Yes. And capture the backgrounded command's exit status so we can return it later on.

If so, it's racy -- you can have the background fail before you set +e.

Indeed. As described in the PR topic comment, this approach is recycled from de3de20 (#6708). But testing locally with false &, I can produce your race. Will reroll to handle log collection via trap to avoid the race.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just

if ! openshift-install --dir /tmp/installer destroy cluster; then
    # failure 
else
    # not
fi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to background it to get trap TERM handling; see 4472ace (#2680).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I thought I'd reproduced the race in a shell earlier, but trying to nail this down for a commit message I'm having trouble:

$ echo $BASH_VERSION
5.0.11(1)-release
$ cat test.sh
#!/bin/bash

set -o nounset
set -o errexit
set -o pipefail

"${@}" &
sleep 1
set +e
wait "$!"
ret="$?"
set -e
echo "post-wait gather"
exit "${ret}"
$ ./test.sh false
post-wait gather
$ echo $?
1
$ ./test.sh sleep 2
post-wait gather
$ echo $?
0

Can you provide an example that exposes the race?

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/installer/master/e2e-vsphere 1044cf8d18101ee387b8616d937a8a4c6078b60a link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-aws 068c5de link /test pj-rehearse
ci/prow/pj-rehearse 068c5de link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Mar 28, 2020

Mysterious unbound-variable thing is gone in the most recent rehearsal, where AWS died on the multus TargetDown bug. We collected install std-stream logs (with the kubeadmin password redacted), debug install logs (also redacted), and debug destroy logs. So I think this is a solid improvement over where we are now, and we can merge this PR.

There are two outstanding threads, but both are considering further pivots away from where master stands today, so I don't think they need to block this PR from landing.

Steve found this opaque [1], so shift the cp into the 'set +e' block
instead.  The '|| :' syntax was originally from 20157f6 (Capture
log bundle in install step, 2020-03-24, openshift#7899).

[1]: https://github.com/openshift/release/pull/7936/files#r400512879
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 30, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit ec885bd into openshift:master Mar 30, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 2 configmaps:

  • step-registry configmap in namespace ci at cluster default using the following files:
    • key ipi-deprovision-deprovision-commands.sh using file ci-operator/step-registry/ipi/deprovision/deprovision/ipi-deprovision-deprovision-commands.sh
    • key ipi-install-install-commands.sh using file ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh
  • step-registry configmap in namespace ci at cluster app.ci using the following files:
    • key ipi-deprovision-deprovision-commands.sh using file ci-operator/step-registry/ipi/deprovision/deprovision/ipi-deprovision-deprovision-commands.sh
    • key ipi-install-install-commands.sh using file ci-operator/step-registry/ipi/install/install/ipi-install-install-commands.sh

In response to this:

Builds on #7935; review that first.

A few changes here:

  • Background the destroy cluster call and add a trap, so we can gracefully handle TERM. More on this in 4472ace (ci-operator/templates/openshift/installer: Restore backgrounded 'create cluster' #2680).

  • The set +e and related wrapping around the wait follows de3de20 (step-registry: add configure and install IPI steps #6708), and ensures we gather logs and other assets in the event of a failed openshift-install invocation.

  • Only attempt to gather .openshift_install.log if it exists. This way we do not have a failed cp kicking us out of the script and blocking the execution of subsequent gathers.

  • Pipe the installer's stderr into /dev/null. The same information is going to show up in .openshift_install.log, and .openshift_install.log includes timestamps which are not present in the container's captured standard streams. By using /dev/null, we can DRY up our password redaction, and the opened window for "installer dies before it sets up .openshift_install.log" is small.

  • Quote $?. It's never going to contain shell-sensitive characters, but neither is $! and we quote that.

  • Make the log copy the first thing that happens after the installer exits. This ensures we capture the logs even if the installer fails before creating a kubeconfig or metadata.json.

  • Condition the log-bundle copy on $ret. No need to fail a glob cp attempt if the installer succeeded. We still need to ignore failed copies, because there will be no log bundle if the installer dies early (e.g. during asset generation) or late (after bootstrap teardown).

    Also shift this copy earlier, because it cannot fail, while the SHARED_DIR copy might fail. Although I'm not sure we have a case where we'd generate a log bundle but not generate the kubeconfig and metadata.json.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the multi-step-installer-logs branch March 30, 2020 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants