templates/openshift: grab bootstrap log on failure #2633

crawford · 2019-01-19T01:45:38Z

This uses the Terraform state to discover the IP address of the
bootstrap node (ideally, the installer will provide this information in
a form which easier to consume in the future). It then connects to the
gatewayd instance on that machine and pulls the logs for various
services. Hopefully, these logs will be useful when diagnosing
installation failures.

Continuing from #2624. Thanks @sttts!

This still requires a hole to be punched in the security group for gatewayd. I'm going to follow up next week.
/hold

This uses the Terraform state to discover the IP address of the bootstrap node (ideally, the installer will provide this information in a form which easier to consume in the future). It then connects to the gatewayd instance on that machine and pulls the logs for various services. Hopefully, these logs will be useful when diagnosing installation failures.

soltysh

lgtm

abhinavdahiya · 2019-01-22T01:11:34Z

@crawford this allows us to read container logs from bootstrap node kubelet as we opened 10250 here

abhinavdahiya@525c911

also @smarterclayton do you think this is sane?

smarterclayton · 2019-01-22T02:19:14Z

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

@@ -373,8 +373,7 @@ objects:
            exit 1
        fi

-        /bin/openshift-install --dir=/tmp/artifacts/installer create cluster &
-        wait "$!"
+        /bin/openshift-install --dir=/tmp/artifacts/installer create cluster


Why this change?

That was an artifact of some previous refactoring. Backgrounding the process and then immediately waiting on it should be equivalent to just running it in the foreground.

I've filed #2680, rolling this back and adding an additional wait to our TERM handlers. As discussed in that PR, we need the backgrounded create cluster so the shell will fire the TERM handler which TERMs the child processes. As master stands (with this PR but without #2680), aborting an e2e job will TERM the setup container's shell process, but that process will not forward the TERM to openshift-install and will wait until it exits normally. I haven't actually tested with a wrapping container, but my guess is that this means that jobs being aborted during openshift-install are mostly hitting their gentle-shutdown timeout and getting KILLed. With #2680, the forwarded TERM gives the installer a chance at a gentle shutdown (although at the moment, I don't think the installer takes advantage of that opportunity).

smarterclayton · 2019-01-22T02:21:25Z

Not unreasonable, although obv in future it would be better to have integrated. Maybe add a comment about that.

crawford · 2019-01-22T19:50:05Z

Verified that this now works with the latest installer. Service logs get written into /tmp/artifacts/bootstrap/.

/hold cancel

abhinavdahiya · 2019-01-22T19:51:21Z

/approve

abhinavdahiya · 2019-01-22T20:43:09Z

/lgtm

openshift-ci-robot · 2019-01-22T20:43:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, crawford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~ci-operator/templates/openshift/installer/OWNERS~~ [abhinavdahiya,crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-01-22T20:50:43Z

@crawford: Updated the following 4 configmaps:

prow-job-cluster-launch-installer-src configmap in namespace ci-stg using the following files:
- key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml
prow-job-cluster-launch-installer-e2e configmap using the following files:
- key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
- key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
prow-job-cluster-launch-installer-src configmap using the following files:
- key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml

In response to this:

This uses the Terraform state to discover the IP address of the
bootstrap node (ideally, the installer will provide this information in
a form which easier to consume in the future). It then connects to the
gatewayd instance on that machine and pulls the logs for various
services. Hopefully, these logs will be useful when diagnosing
installation failures.

Continuing from #2624. Thanks @sttts!

This still requires a hole to be punched in the security group for gatewayd. I'm going to follow up next week.
/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2019-01-24T00:45:04Z

ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml

@@ -338,8 +338,7 @@ objects:
            exit 1
        fi

-        /bin/openshift-install --dir=/tmp/artifacts/installer create cluster &
-        wait "$!"
+        /bin/openshift-install --dir=/tmp/artifacts/installer create cluster


Why did you touch this? Won't it break the /tmp/exit-setting trap we set at the beginning of this script? Previous discussion in d862f6f (#1957).

ah, previous discussion here. I still think this change breaks the trap.

…te cluster' Unwinding part of 7e56379 (templates/openshift: grab bootstrap log on failure, 2019-01-14, openshift#2633). I'd initially added the backgrounding in d862f6f (cluster-launch-installer-e2e: Restore EXIT and TERM signal handlers, 2018-10-17, openshift#1957). If you exec 'create cluster', it kills the traps, but I was less clear on whether the background/wait dance was required vs. the approach taken by 7e56379. Here are scripts comparing the two approaches vs. SIGTERM (e.g. if a test run is aborted because new updates have been pushed to a pull request): $ cat test-sleep.sh #!/bin/sh trap 'rc=$?; if test "${rc}" -eq 0; then echo touch /tmp/setup-success; else echo touch /tmp/exit; fi; exit "${rc}"' EXIT trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then echo kill ${CHILDREN} && kill ${CHILDREN}; fi' TERM echo $$ python -c 'import time; time.sleep(20); print("done sleeping")' $ ./test-sleep.sh # while this is running, in another shell: kill 17125 17125 done sleeping touch /tmp/setup-success That's not good, the Python process should have been killed before it got far enough in to print "done sleeping". Here's with a background/wait: $ cat test-sleep-wait.sh #!/bin/sh trap 'rc=$?; if test "${rc}" -eq 0; then echo touch /tmp/setup-success; else echo touch /tmp/exit; fi; exit "${rc}"' EXIT trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then echo kill ${CHILDREN} && kill ${CHILDREN}; fi' TERM echo $$ python -c 'import time; time.sleep(20); print("done sleeping")' & wait "$!" $ ./test-sleep-wait.sh # kill 17242 17242 kill 17243 touch /tmp/exit that looks better. Stracing to put a finger on the differences (e.g. 'strace -tfo /tmp/trace ./test-sleep.sh'), gives: 21243 21:59:18 execve("./test-sleep.sh", ["./test-sleep.sh"], [/* 74 vars */]) = 0 ... 21243 21:59:18 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe60e216a10) = 21244 ... 21244 21:59:18 execve("/usr/bin/python", ["python", "-c", "import time; time.sleep(20); pri"...], [/* 73 vars */]) = 0 ... 21244 21:59:18 select(0, NULL, NULL, NULL, {20, 0} <unfinished ...> 21243 21:59:25 <... wait4 resumed> 0x7ffe9feef350, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 21243 21:59:25 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=8603, si_uid=114032} --- ... 21244 21:59:38 write(1, "done sleeping\n", 14) = 14 21244 21:59:38 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f20f7b88680}, {0x7f20f7eab680, [], SA_RESTORER, 0x7f20f7b88680}, 8) = 0 21244 21:59:38 exit_group(0) = ? 21244 21:59:38 +++ exited with 0 +++ 21243 21:59:38 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 21244 ... 21243 21:59:38 pipe([3, 4]) = 0 ... 21243 21:59:38 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe60e216a10) = 21262 21243 21:59:38 rt_sigprocmask(SIG_SETMASK, [TERM], NULL, 8) = 0 21243 21:59:38 rt_sigaction(SIGCHLD, {0x441390, [], SA_RESTORER|SA_RESTART, 0x7fe60d850280}, <unfinished ...> 21262 21:59:38 close(255 <unfinished ...> ... 21243 21:59:38 read(3, <unfinished ...> ... 21262 21:59:38 close(4) = 0 ... 21262 21:59:38 exit_group(0) = ? 21243 21:59:38 <... read resumed> "", 128) = 0 21262 21:59:38 +++ exited with 0 +++ 21243 21:59:38 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21262, si_uid=114032, si_status=0, si_utime=0, si_stime=0} --- 21243 21:59:38 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 21262 ... 21243 21:59:38 write(1, "touch /tmp/setup-success\n", 25) = 25 21243 21:59:38 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 21243 21:59:38 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 21243 21:59:38 exit_group(0) = ? 21243 21:59:38 +++ exited with 0 +++ so it looks like the shell (21243) just swallows the TERM, waits out the sleeping Python process, and exits zero. 21262 is the 'jobs' process, which doesn't fire until Python finishes its sleep and returns. The trap is a no-op, because test-sleep.sh never spawns backgrounded children, so 'jobs' returns an empty string. And here's an strace on test-sleep-wait.sh: 21638 22:04:48 execve("./test-sleep-wait.sh", ["./test-sleep-wait.sh"], [/* 74 vars */]) = 0 ... 21638 22:04:48 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f78dc7c4a10) = 21639 ... 21639 22:04:48 execve("/usr/bin/python", ["python", "-c", "import time; time.sleep(20); pri"...], [/* 73 vars */]) = 0 ... 21639 22:04:48 select(0, NULL, NULL, NULL, {20, 0} <unfinished ...> 21638 22:04:54 <... wait4 resumed> 0x7ffc07d607a0, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 21638 22:04:54 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=8603, si_uid=114032} --- ... 21638 22:04:54 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f78dc7c4a10) = 21645 ... 21645 22:04:54 write(1, "21639\n", 6) = 6 ... 21645 22:04:54 exit_group(0) = ? 21645 22:04:54 +++ exited with 0 +++ 21638 22:04:54 <... read resumed> "", 128) = 0 21638 22:04:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21645, si_uid=114032, si_status=0, si_utime=0, si_stime=0} --- 21638 22:04:54 rt_sigreturn({mask=[TERM]}) = 0 ... 21638 22:04:54 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 21645 ... 21638 22:04:54 write(1, "kill 21639\n", 11) = 11 21638 22:04:54 kill(21639, SIGTERM) = 0 ... 21638 22:04:54 write(1, "touch /tmp/exit\n", 16 <unfinished ...> 21639 22:04:54 +++ killed by SIGTERM +++ 21638 22:04:54 <... write resumed> ) = 16 21638 22:04:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=21639, si_uid=114032, si_status=SIGTERM, si_utime=0, si_stime=0} --- 21638 22:04:54 wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG, NULL) = 21639 21638 22:04:54 wait4(-1, 0x7ffc07d60010, WNOHANG, NULL) = -1 ECHILD (No child processes) 21638 22:04:54 rt_sigreturn({mask=[]}) = 16 21638 22:04:54 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 21638 22:04:54 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 21638 22:04:54 exit_group(143) = ? 21638 22:04:54 +++ exited with 143 +++ With this approach, we see the shell (21638) catch the TERM and immediately launch the trap handler which launches 'jobs' (21645). The trap handler sends a new TERM to the Python process (21639), which exits non-zero. The shell reaps the 'jobs' process, echos "touch /tmp/exit", reaps the killed Python process, and exits 143. From the ordering there it looks like the shell isn't actually waiting for the Python process, although we could add a wait to the TERM handler if we wanted. Let's do that, and also drop some of the debugging echos now that we're committed to strace: $ cat test-sleep-wait-in-handler.sh #!/bin/sh trap 'rc=$?; if test "${rc}" -eq 0; then touch /tmp/success; else touch /tmp/exit; fi; exit "${rc}"' EXIT trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi' TERM echo $$ python -c 'import time; time.sleep(20); print("done sleeping")' & wait "$!" $ strace -tfo /tmp/trace ./test-sleep-wait-in-handler.sh 23095 As above, I then 'kill 23095' from another terminal. Excerpts from the trace: 23095 22:27:03 execve("./test-sleep-wait-in-handler.sh", ["./test-sleep-wait-in-handler.sh"], [/* 74 vars */]) = 0 ... 23095 22:27:03 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e464f2a10) = 23096 ... 23096 22:27:03 execve("/usr/bin/python", ["python", "-c", "import time; time.sleep(20); pri"...], [/* 73 vars */]) = 0 ... 23096 22:27:03 select(0, NULL, NULL, NULL, {20, 0} <unfinished ...> 23095 22:27:09 <... wait4 resumed> 0x7ffe54e18750, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 23095 22:27:09 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=8603, si_uid=114032} --- ... 23095 22:27:09 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e464f2a10) = 23110 ... 23110 22:27:09 write(1, "23096\n", 6) = 6 23095 22:27:09 <... read resumed> "23096\n", 128) = 6 ... 23110 22:27:09 +++ exited with 0 +++ 23095 22:27:09 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=23110, si_uid=114032, si_status=0, si_utime=0, si_stime=0} --- ... 23095 22:27:09 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 23110 ... 23095 22:27:09 kill(23096, SIGTERM) = 0 23096 22:27:09 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted if no handler) 23095 22:27:09 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...> 23096 22:27:09 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=23095, si_uid=114032} --- ... 23095 22:27:09 wait4(-1, <unfinished ...> 23096 22:27:09 +++ killed by SIGTERM +++ 23095 22:27:09 <... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 23096 ... 23095 22:27:09 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=23096, si_uid=114032, si_status=SIGTERM, si_utime=1, si_stime=0} --- 23095 22:27:09 wait4(-1, 0x7ffe54e17810, WNOHANG, NULL) = -1 ECHILD (No child processes) ... 23095 22:27:09 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e464f2a10) = 23111 ... 23111 22:27:09 execve("/usr/bin/touch", ["touch", "/tmp/exit"], [/* 73 vars */]) = 0 ... 23111 22:27:09 open("/tmp/exit", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3 ... 23111 22:27:09 exit_group(0) = ? 23111 22:27:09 +++ exited with 0 +++ 23095 22:27:09 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23111 23095 22:27:09 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 23095 22:27:09 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=23111, si_uid=114032, si_status=0, si_utime=0, si_stime=0} --- 23095 22:27:09 wait4(-1, 0x7ffe54e18050, WNOHANG, NULL) = -1 ECHILD (No child processes) ... 23095 22:27:09 exit_group(143) = ? 23095 22:27:09 +++ exited with 143 +++ With this approach, we see the shell (23095) catch the TERM and immediately launch the trap handler which launches 'jobs' (23110) and reaps it. The trap handler sends a new TERM to the Python process (23096), which exits non-zero. The shell reaps the killed Python process, waits but gets no children, launches and reaps the touch process (23111), and exits 143. That looks like what we want, so I've gone that way with this commit.

awscli will allow us to add CI for user-provided infrastructure, as documented by openshift/installer@39a926a918 (Adding initial user doc/guide & materials for UPI AWS installation, 2019-03-12, openshift/installer#1408). jq will allow us to simplify the templating which landed in openshift/release@7e56379d0b (templates/openshift: grab bootstrap log on failure, 2019-01-14, openshift/release#2633).

jq will allow us to simplify the templating which landed in openshift/release@7e56379d0b (templates/openshift: grab bootstrap log on failure, 2019-01-14, openshift/release#2633).

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 19, 2019

openshift-ci-robot requested review from russellb and staebler January 19, 2019 01:45

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 19, 2019

crawford mentioned this pull request Jan 19, 2019

templates/openshift: grab bootstrap log on failure #2624

Closed

soltysh approved these changes Jan 21, 2019

View reviewed changes

abhinavdahiya mentioned this pull request Jan 21, 2019

aws/boostrap: add boostrap node only security group rules openshift/installer#1105

Merged

smarterclayton reviewed Jan 22, 2019

View reviewed changes

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 22, 2019

openshift-ci-robot assigned abhinavdahiya Jan 22, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2019

openshift-merge-robot merged commit ff49929 into openshift:master Jan 22, 2019

crawford deleted the gatewayd branch January 22, 2019 20:52

wking mentioned this pull request Jan 24, 2019

e2e-aws: stream bootkube and 'kubectl get all' to artifacts #2064

Closed

wking reviewed Jan 24, 2019

View reviewed changes

wking mentioned this pull request Jan 24, 2019

ci-operator/templates/openshift/installer: Restore backgrounded 'create cluster' #2680

Merged

sttts mentioned this pull request Jan 24, 2019

Copy static pod logs to systemd via --send-static-pod-logs-to-systemd openshift/cluster-bootstrap#11

Closed

wking mentioned this pull request Mar 20, 2019

images/tests/Dockerfile.rhel: Install jq openshift/origin#22378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

templates/openshift: grab bootstrap log on failure #2633

templates/openshift: grab bootstrap log on failure #2633

crawford commented Jan 19, 2019

soltysh left a comment

abhinavdahiya commented Jan 22, 2019

smarterclayton Jan 22, 2019

crawford Jan 22, 2019

wking Jan 24, 2019

smarterclayton commented Jan 22, 2019

crawford commented Jan 22, 2019

abhinavdahiya commented Jan 22, 2019

abhinavdahiya commented Jan 22, 2019

openshift-ci-robot commented Jan 22, 2019

openshift-ci-robot commented Jan 22, 2019

wking Jan 24, 2019

wking Jan 24, 2019

templates/openshift: grab bootstrap log on failure #2633

templates/openshift: grab bootstrap log on failure #2633

Conversation

crawford commented Jan 19, 2019

soltysh left a comment

Choose a reason for hiding this comment

abhinavdahiya commented Jan 22, 2019

smarterclayton Jan 22, 2019

Choose a reason for hiding this comment

crawford Jan 22, 2019

Choose a reason for hiding this comment

wking Jan 24, 2019

Choose a reason for hiding this comment

smarterclayton commented Jan 22, 2019

crawford commented Jan 22, 2019

abhinavdahiya commented Jan 22, 2019

abhinavdahiya commented Jan 22, 2019

openshift-ci-robot commented Jan 22, 2019

openshift-ci-robot commented Jan 22, 2019

wking Jan 24, 2019

Choose a reason for hiding this comment

wking Jan 24, 2019

Choose a reason for hiding this comment