pkg/start/start: Drop bootstrapPodsRunningTimeout #24

wking · 2019-04-17T16:43:44Z

And plumb through contexts from runCmdStart so we can drop the context.TODO() calls.

bootstrapPodsRunningTimeout was added in d07548e (#9), although @sttts had no strong opinion on them then. But as it stands, a hung pod creates loops like:

$ tar xf log-bundle.tar.gz
$ cd bootstrap/journals
$ grep 'Started Bootstrap\|Error: error while checking pod status' bootkube.log
Apr 16 17:46:23 ip-10-0-4-87 systemd[1]: Started Bootstrap a Kubernetes cluster.
Apr 16 18:12:41 ip-10-0-4-87 bootkube.sh[1510]: Error: error while checking pod status: timed out waiting for the condition
Apr 16 18:12:41 ip-10-0-4-87 bootkube.sh[1510]: Error: error while checking pod status: timed out waiting for the condition
Apr 16 18:12:46 ip-10-0-4-87 systemd[1]: Started Bootstrap a Kubernetes cluster.
Apr 16 18:33:02 ip-10-0-4-87 bootkube.sh[11418]: Error: error while checking pod status: timed out waiting for the condition
Apr 16 18:33:02 ip-10-0-4-87 bootkube.sh[11418]: Error: error while checking pod status: timed out waiting for the condition
Apr 16 18:33:07 ip-10-0-4-87 systemd[1]: Started Bootstrap a Kubernetes cluster.

Instead of having systemd keep kicking bootkube.sh (which in turn keeps launching cluster-bootstrap), removing this timeout will just leave cluster-bootstrap running while folks gather logs from the broken cluster. And the less spurious-restart noise there is in those logs, the easier it will be to find what actually broke.

And plumb through contexts from runCmdStart so we can drop the context.TODO() calls. bootstrapPodsRunningTimeout was added in d07548e (Add --tear-down-event flag to delay tear down, 2019-01-24, openshift#9), although Stefan had no strong opinion on them then [1]. But as it stands, a hung pod creates loops like [2]: $ tar xf log-bundle.tar.gz $ cd bootstrap/journals $ grep 'Started Bootstrap\|Error: error while checking pod status' bootkube.log Apr 16 17:46:23 ip-10-0-4-87 systemd[1]: Started Bootstrap a Kubernetes cluster. Apr 16 18:12:41 ip-10-0-4-87 bootkube.sh[1510]: Error: error while checking pod status: timed out waiting for the condition Apr 16 18:12:41 ip-10-0-4-87 bootkube.sh[1510]: Error: error while checking pod status: timed out waiting for the condition Apr 16 18:12:46 ip-10-0-4-87 systemd[1]: Started Bootstrap a Kubernetes cluster. Apr 16 18:33:02 ip-10-0-4-87 bootkube.sh[11418]: Error: error while checking pod status: timed out waiting for the condition Apr 16 18:33:02 ip-10-0-4-87 bootkube.sh[11418]: Error: error while checking pod status: timed out waiting for the condition Apr 16 18:33:07 ip-10-0-4-87 systemd[1]: Started Bootstrap a Kubernetes cluster. Instead of having systemd keep kicking bootkube.sh (which in turn keeps launching cluster-bootstrap), removing this timeout will just leave cluster-bootstrap running while folks gather logs from the broken cluster. And the less spurious-restart noise there is in those logs, the easier it will be to find what actually broke. [1]: openshift#9 (comment) [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1700504#c14

openshift-ci-robot · 2019-04-17T16:43:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mfojtik

If they are not already assigned, you can assign the PR to them by writing /assign @mfojtik in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sttts · 2019-04-25T07:33:27Z

pkg/start/start.go

 	if err = waitUntilPodsRunning(ctx, client, b.requiredPodPrefixes); err != nil {
 		return err
 	}
-	cancel()


this doesn't work. We break the switch over logic this way. We have to force createAssetsInBackground to stop creating assets. We did this with this cancel() call, and restarted asset creation afterwards, potentially with the ELB such that we can shut down the bootstrap control plane.

To fix this: wrap assetContext into another context, cancel it here and then use the inner context below.

openshift-bot · 2020-09-09T16:25:58Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci-robot · 2020-09-18T10:49:35Z

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-upgrade	`440dac9`	link	`/test e2e-aws-upgrade`
ci/prow/e2e-upgrade	`440dac9`	link	`/test e2e-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-10-20T06:55:31Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2020-11-19T08:43:25Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2020-11-19T08:43:28Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from mfojtik and sttts April 17, 2019 16:43

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 17, 2019

wking force-pushed the drop-pod-timeout branch from 99d5a5a to 440dac9 Compare April 17, 2019 16:43

sttts reviewed Apr 25, 2019

View reviewed changes

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 20, 2020

openshift-ci-robot closed this Nov 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/start/start: Drop bootstrapPodsRunningTimeout #24

pkg/start/start: Drop bootstrapPodsRunningTimeout #24

wking commented Apr 17, 2019

openshift-ci-robot commented Apr 17, 2019

sttts Apr 25, 2019

sttts Apr 25, 2019

openshift-bot commented Sep 9, 2020

openshift-ci-robot commented Sep 18, 2020

openshift-bot commented Oct 20, 2020

openshift-bot commented Nov 19, 2020

openshift-ci-robot commented Nov 19, 2020

pkg/start/start: Drop bootstrapPodsRunningTimeout #24

pkg/start/start: Drop bootstrapPodsRunningTimeout #24

Conversation

wking commented Apr 17, 2019

openshift-ci-robot commented Apr 17, 2019

sttts Apr 25, 2019

Choose a reason for hiding this comment

sttts Apr 25, 2019

Choose a reason for hiding this comment

openshift-bot commented Sep 9, 2020

openshift-ci-robot commented Sep 18, 2020

openshift-bot commented Oct 20, 2020

openshift-bot commented Nov 19, 2020

openshift-ci-robot commented Nov 19, 2020