Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

templates/openshift: grab bootstrap log on failure #2633

Merged
merged 1 commit into from
Jan 22, 2019
Merged

templates/openshift: grab bootstrap log on failure #2633

merged 1 commit into from
Jan 22, 2019

Conversation

crawford
Copy link
Contributor

This uses the Terraform state to discover the IP address of the
bootstrap node (ideally, the installer will provide this information in
a form which easier to consume in the future). It then connects to the
gatewayd instance on that machine and pulls the logs for various
services. Hopefully, these logs will be useful when diagnosing
installation failures.

Continuing from #2624. Thanks @sttts!

This still requires a hole to be punched in the security group for gatewayd. I'm going to follow up next week.
/hold

This uses the Terraform state to discover the IP address of the
bootstrap node (ideally, the installer will provide this information in
a form which easier to consume in the future). It then connects to the
gatewayd instance on that machine and pulls the logs for various
services. Hopefully, these logs will be useful when diagnosing
installation failures.
@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 19, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 19, 2019
Copy link

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@abhinavdahiya
Copy link
Contributor

@crawford this allows us to read container logs from bootstrap node kubelet as we opened 10250 here

abhinavdahiya@525c911

also @smarterclayton do you think this is sane?

@@ -373,8 +373,7 @@ objects:
exit 1
fi

/bin/openshift-install --dir=/tmp/artifacts/installer create cluster &
wait "$!"
/bin/openshift-install --dir=/tmp/artifacts/installer create cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was an artifact of some previous refactoring. Backgrounding the process and then immediately waiting on it should be equivalent to just running it in the foreground.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed #2680, rolling this back and adding an additional wait to our TERM handlers. As discussed in that PR, we need the backgrounded create cluster so the shell will fire the TERM handler which TERMs the child processes. As master stands (with this PR but without #2680), aborting an e2e job will TERM the setup container's shell process, but that process will not forward the TERM to openshift-install and will wait until it exits normally. I haven't actually tested with a wrapping container, but my guess is that this means that jobs being aborted during openshift-install are mostly hitting their gentle-shutdown timeout and getting KILLed. With #2680, the forwarded TERM gives the installer a chance at a gentle shutdown (although at the moment, I don't think the installer takes advantage of that opportunity).

@smarterclayton
Copy link
Contributor

Not unreasonable, although obv in future it would be better to have integrated. Maybe add a comment about that.

@crawford
Copy link
Contributor Author

Verified that this now works with the latest installer. Service logs get written into /tmp/artifacts/bootstrap/.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 22, 2019
@abhinavdahiya
Copy link
Contributor

/approve

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, crawford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit ff49929 into openshift:master Jan 22, 2019
@openshift-ci-robot
Copy link
Contributor

@crawford: Updated the following 4 configmaps:

  • prow-job-cluster-launch-installer-src configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml
  • prow-job-cluster-launch-installer-e2e configmap using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-src configmap using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml

In response to this:

This uses the Terraform state to discover the IP address of the
bootstrap node (ideally, the installer will provide this information in
a form which easier to consume in the future). It then connects to the
gatewayd instance on that machine and pulls the logs for various
services. Hopefully, these logs will be useful when diagnosing
installation failures.

Continuing from #2624. Thanks @sttts!

This still requires a hole to be punched in the security group for gatewayd. I'm going to follow up next week.
/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -338,8 +338,7 @@ objects:
exit 1
fi

/bin/openshift-install --dir=/tmp/artifacts/installer create cluster &
wait "$!"
/bin/openshift-install --dir=/tmp/artifacts/installer create cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you touch this? Won't it break the /tmp/exit-setting trap we set at the beginning of this script? Previous discussion in d862f6f (#1957).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, previous discussion here. I still think this change breaks the trap.

wking added a commit to wking/openshift-release that referenced this pull request Jan 24, 2019
…te cluster'

Unwinding part of 7e56379 (templates/openshift: grab bootstrap log on
failure, 2019-01-14, openshift#2633).  I'd initially added the backgrounding in
d862f6f (cluster-launch-installer-e2e: Restore EXIT and TERM signal
handlers, 2018-10-17, openshift#1957).  If you exec 'create cluster', it kills
the traps, but I was less clear on whether the background/wait dance
was required vs. the approach taken by 7e56379.  Here are scripts
comparing the two approaches vs. SIGTERM (e.g. if a test run is
aborted because new updates have been pushed to a pull request):

  $ cat test-sleep.sh
  #!/bin/sh
  trap 'rc=$?; if test "${rc}" -eq 0; then echo touch /tmp/setup-success; else echo touch /tmp/exit; fi; exit "${rc}"' EXIT
  trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then echo kill ${CHILDREN} && kill ${CHILDREN}; fi' TERM
  echo $$
  python -c 'import time; time.sleep(20); print("done sleeping")'
  $ ./test-sleep.sh  # while this is running, in another shell: kill 17125
  17125
  done sleeping
  touch /tmp/setup-success

That's not good, the Python process should have been killed before it
got far enough in to print "done sleeping".  Here's with a
background/wait:

  $ cat test-sleep-wait.sh
  #!/bin/sh
  trap 'rc=$?; if test "${rc}" -eq 0; then echo touch /tmp/setup-success; else echo touch /tmp/exit; fi; exit "${rc}"' EXIT
  trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then echo kill ${CHILDREN} && kill ${CHILDREN}; fi' TERM
  echo $$
  python -c 'import time; time.sleep(20); print("done sleeping")' &
  wait "$!"
  $ ./test-sleep-wait.sh  # kill 17242
  17242
  kill 17243
  touch /tmp/exit

that looks better.  Stracing to put a finger on the differences
(e.g. 'strace -tfo /tmp/trace ./test-sleep.sh'), gives:

  21243 21:59:18 execve("./test-sleep.sh", ["./test-sleep.sh"], [/* 74 vars */]) = 0
  ...
  21243 21:59:18 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe60e216a10) = 21244
  ...
  21244 21:59:18 execve("/usr/bin/python", ["python", "-c", "import time; time.sleep(20); pri"...], [/* 73 vars */]) = 0
  ...
  21244 21:59:18 select(0, NULL, NULL, NULL, {20, 0} <unfinished ...>
  21243 21:59:25 <... wait4 resumed> 0x7ffe9feef350, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  21243 21:59:25 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=8603, si_uid=114032} ---
  ...
  21244 21:59:38 write(1, "done sleeping\n", 14) = 14
  21244 21:59:38 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f20f7b88680}, {0x7f20f7eab680, [], SA_RESTORER, 0x7f20f7b88680}, 8) = 0
  21244 21:59:38 exit_group(0)            = ?
  21244 21:59:38 +++ exited with 0 +++
  21243 21:59:38 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 21244
  ...
  21243 21:59:38 pipe([3, 4])             = 0
  ...
  21243 21:59:38 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe60e216a10) = 21262
  21243 21:59:38 rt_sigprocmask(SIG_SETMASK, [TERM], NULL, 8) = 0
  21243 21:59:38 rt_sigaction(SIGCHLD, {0x441390, [], SA_RESTORER|SA_RESTART, 0x7fe60d850280},  <unfinished ...>
  21262 21:59:38 close(255 <unfinished ...>
  ...
  21243 21:59:38 read(3,  <unfinished ...>
  ...
  21262 21:59:38 close(4)                 = 0
  ...
  21262 21:59:38 exit_group(0)            = ?
  21243 21:59:38 <... read resumed> "", 128) = 0
  21262 21:59:38 +++ exited with 0 +++
  21243 21:59:38 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21262, si_uid=114032, si_status=0, si_utime=0, si_stime=0} ---
  21243 21:59:38 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 21262
  ...
  21243 21:59:38 write(1, "touch /tmp/setup-success\n", 25) = 25
  21243 21:59:38 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  21243 21:59:38 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  21243 21:59:38 exit_group(0)            = ?
  21243 21:59:38 +++ exited with 0 +++

so it looks like the shell (21243) just swallows the TERM, waits out
the sleeping Python process, and exits zero.  21262 is the 'jobs'
process, which doesn't fire until Python finishes its sleep and
returns.  The trap is a no-op, because test-sleep.sh never spawns
backgrounded children, so 'jobs' returns an empty string.

And here's an strace on test-sleep-wait.sh:

  21638 22:04:48 execve("./test-sleep-wait.sh", ["./test-sleep-wait.sh"], [/* 74 vars */]) = 0
  ...
  21638 22:04:48 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f78dc7c4a10) = 21639
  ...
  21639 22:04:48 execve("/usr/bin/python", ["python", "-c", "import time; time.sleep(20); pri"...], [/* 73 vars */]) = 0
  ...
  21639 22:04:48 select(0, NULL, NULL, NULL, {20, 0} <unfinished ...>
  21638 22:04:54 <... wait4 resumed> 0x7ffc07d607a0, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  21638 22:04:54 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=8603, si_uid=114032} ---
  ...
  21638 22:04:54 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f78dc7c4a10) = 21645
  ...
  21645 22:04:54 write(1, "21639\n", 6)   = 6
  ...
  21645 22:04:54 exit_group(0)            = ?
  21645 22:04:54 +++ exited with 0 +++
  21638 22:04:54 <... read resumed> "", 128) = 0
  21638 22:04:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21645, si_uid=114032, si_status=0, si_utime=0, si_stime=0} ---
  21638 22:04:54 rt_sigreturn({mask=[TERM]}) = 0
  ...
  21638 22:04:54 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 21645
  ...
  21638 22:04:54 write(1, "kill 21639\n", 11) = 11
  21638 22:04:54 kill(21639, SIGTERM)     = 0
  ...
  21638 22:04:54 write(1, "touch /tmp/exit\n", 16 <unfinished ...>
  21639 22:04:54 +++ killed by SIGTERM +++
  21638 22:04:54 <... write resumed> )    = 16
  21638 22:04:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=21639, si_uid=114032, si_status=SIGTERM, si_utime=0, si_stime=0} ---
  21638 22:04:54 wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG, NULL) = 21639
  21638 22:04:54 wait4(-1, 0x7ffc07d60010, WNOHANG, NULL) = -1 ECHILD (No child processes)
  21638 22:04:54 rt_sigreturn({mask=[]})  = 16
  21638 22:04:54 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  21638 22:04:54 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  21638 22:04:54 exit_group(143)          = ?
  21638 22:04:54 +++ exited with 143 +++

With this approach, we see the shell (21638) catch the TERM and
immediately launch the trap handler which launches 'jobs' (21645).
The trap handler sends a new TERM to the Python process (21639), which
exits non-zero.  The shell reaps the 'jobs' process, echos "touch
/tmp/exit", reaps the killed Python process, and exits 143.  From the
ordering there it looks like the shell isn't actually waiting for the
Python process, although we could add a wait to the TERM handler if we
wanted.  Let's do that, and also drop some of the debugging echos now
that we're committed to strace:

  $ cat test-sleep-wait-in-handler.sh
  #!/bin/sh
  trap 'rc=$?; if test "${rc}" -eq 0; then touch /tmp/success; else touch /tmp/exit; fi; exit "${rc}"' EXIT
  trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi' TERM
  echo $$
  python -c 'import time; time.sleep(20); print("done sleeping")' &
  wait "$!"
  $ strace -tfo /tmp/trace ./test-sleep-wait-in-handler.sh
  23095

As above, I then 'kill 23095' from another terminal.  Excerpts from
the trace:

  23095 22:27:03 execve("./test-sleep-wait-in-handler.sh", ["./test-sleep-wait-in-handler.sh"], [/* 74 vars */]) = 0
  ...
  23095 22:27:03 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e464f2a10) = 23096
  ...
  23096 22:27:03 execve("/usr/bin/python", ["python", "-c", "import time; time.sleep(20); pri"...], [/* 73 vars */]) = 0
  ...
  23096 22:27:03 select(0, NULL, NULL, NULL, {20, 0} <unfinished ...>
  23095 22:27:09 <... wait4 resumed> 0x7ffe54e18750, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  23095 22:27:09 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=8603, si_uid=114032} ---
  ...
  23095 22:27:09 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e464f2a10) = 23110
  ...
  23110 22:27:09 write(1, "23096\n", 6)   = 6
  23095 22:27:09 <... read resumed> "23096\n", 128) = 6
  ...
  23110 22:27:09 +++ exited with 0 +++
  23095 22:27:09 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=23110, si_uid=114032, si_status=0, si_utime=0, si_stime=0} ---
  ...
  23095 22:27:09 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 23110
  ...
  23095 22:27:09 kill(23096, SIGTERM)     = 0
  23096 22:27:09 <... select resumed> )   = ? ERESTARTNOHAND (To be restarted if no handler)
  23095 22:27:09 rt_sigprocmask(SIG_BLOCK, NULL,  <unfinished ...>
  23096 22:27:09 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=23095, si_uid=114032} ---
  ...
  23095 22:27:09 wait4(-1,  <unfinished ...>
  23096 22:27:09 +++ killed by SIGTERM +++
  23095 22:27:09 <... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], 0, NULL) = 23096
  ...
  23095 22:27:09 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=23096, si_uid=114032, si_status=SIGTERM, si_utime=1, si_stime=0} ---
  23095 22:27:09 wait4(-1, 0x7ffe54e17810, WNOHANG, NULL) = -1 ECHILD (No child processes)
  ...
  23095 22:27:09 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e464f2a10) = 23111
  ...
  23111 22:27:09 execve("/usr/bin/touch", ["touch", "/tmp/exit"], [/* 73 vars */]) = 0
  ...
  23111 22:27:09 open("/tmp/exit", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3
  ...
  23111 22:27:09 exit_group(0)            = ?
  23111 22:27:09 +++ exited with 0 +++
  23095 22:27:09 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 23111
  23095 22:27:09 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  23095 22:27:09 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=23111, si_uid=114032, si_status=0, si_utime=0, si_stime=0} ---
  23095 22:27:09 wait4(-1, 0x7ffe54e18050, WNOHANG, NULL) = -1 ECHILD (No child processes)
  ...
  23095 22:27:09 exit_group(143)          = ?
  23095 22:27:09 +++ exited with 143 +++

With this approach, we see the shell (23095) catch the TERM and
immediately launch the trap handler which launches 'jobs' (23110) and
reaps it.  The trap handler sends a new TERM to the Python process
(23096), which exits non-zero.  The shell reaps the killed Python
process, waits but gets no children, launches and reaps the touch
process (23111), and exits 143.  That looks like what we want, so I've
gone that way with this commit.
wking added a commit to wking/origin that referenced this pull request Mar 20, 2019
awscli will allow us to add CI for user-provided infrastructure, as
documented by openshift/installer@39a926a918 (Adding initial user
doc/guide & materials for UPI AWS installation, 2019-03-12,
openshift/installer#1408).

jq will allow us to simplify the templating which landed in
openshift/release@7e56379d0b (templates/openshift: grab bootstrap log
on failure, 2019-01-14, openshift/release#2633).
wking added a commit to wking/origin that referenced this pull request Mar 22, 2019
jq will allow us to simplify the templating which landed in
openshift/release@7e56379d0b (templates/openshift: grab bootstrap log
on failure, 2019-01-14, openshift/release#2633).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants