Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-operator/step-registry/ipi: Shuffle installer log gathering #7936

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,21 @@ set -o nounset
set -o errexit
set -o pipefail

trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi' TERM

cluster_profile=/var/run/secrets/ci.openshift.io/cluster-profile
export AWS_SHARED_CREDENTIALS_FILE=$cluster_profile/.awscred
export AZURE_AUTH_LOCATION=$cluster_profile/osServicePrincipal.json

echo "Deprovisioning cluster ..."
cp -ar "${SHARED_DIR}" /tmp/installer
openshift-install --dir /tmp/installer destroy cluster
cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}/"
openshift-install --dir /tmp/installer destroy cluster &

set +e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the intent of this to continue even if it fails? If so, it's racy -- you can have the background fail before you set +e. Why are we doing this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think errexit, even with pipefail, considers background processes. I can't find a positive reference, but:

$ time bash -ec 'false & sleep 2; wait "$!"'

real    0m2.010s
user    0m0.004s
sys     0m0.007s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the intent of this to continue even if it fails?

Yes. And capture the backgrounded command's exit status so we can return it later on.

If so, it's racy -- you can have the background fail before you set +e.

Indeed. As described in the PR topic comment, this approach is recycled from de3de20 (#6708). But testing locally with false &, I can produce your race. Will reroll to handle log collection via trap to avoid the race.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just

if ! openshift-install --dir /tmp/installer destroy cluster; then
    # failure 
else
    # not
fi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to background it to get trap TERM handling; see 4472ace (#2680).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I thought I'd reproduced the race in a shell earlier, but trying to nail this down for a commit message I'm having trouble:

$ echo $BASH_VERSION
5.0.11(1)-release
$ cat test.sh
#!/bin/bash

set -o nounset
set -o errexit
set -o pipefail

"${@}" &
sleep 1
set +e
wait "$!"
ret="$?"
set -e
echo "post-wait gather"
exit "${ret}"
$ ./test.sh false
post-wait gather
$ echo $?
1
$ ./test.sh sleep 2
post-wait gather
$ echo $?
0

Can you provide an example that exposes the race?

wait "$!"
ret="$?"
set -e

cp /tmp/installer/.openshift_install.log "${ARTIFACT_DIR}"

exit "$ret"
Comment on lines -13 to +24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this the other day, would creating symlinks before executing the installer work in these cases? We could then drop this type of logic from this and other steps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this the other day, would creating symlinks before executing the installer work in these cases?

Might work for destroy cluster, but would not work for install because we need to grep or sed to drop the kubeadmin password. I'll try if I can sort out the unbounded-variable thing...

Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,13 @@ TF_LOG=debug openshift-install --dir="${dir}" create cluster 2>&1 | grep --line-

set +e
wait "$!"
ret=$?
ret="$?"
cp "${dir}"/log-bundle-*.tar.gz "${ARTIFACT_DIR}/" 2>/dev/null
set -e

sed 's/password: .*/password: REDACTED/' "${dir}/.openshift_install.log" >"${ARTIFACT_DIR}/.openshift_install.log"
cp \
-t "${SHARED_DIR}" \
"${dir}/auth/kubeconfig" \
"${dir}/metadata.json"
cp "${dir}/.openshift_install.log" "${ARTIFACT_DIR}/"
cp "${dir}"/log-bundle-*.tar.gz "${ARTIFACT_DIR}/" 2>/dev/null || :
exit "$ret"