Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m #2279

wking · 2019-08-28T04:45:07Z

Up from their default 10 minutes, using the knob that dates back to the original network load balancer support. This should help us avoid:

Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)

that cropped up again this week, while still generally keeping us under the 30m timeout we set for the whole infrastructure-provisioning step. Sometimes even 20m will not be enough, but should make us a bit more resilient anyway.

While I was comparing the docs with our config, I also noticed that we should have dropped idle_timeout (which does not apply to network load balancers) when we made the shift from classic to network load balancers in 16dfbb3 (#594), so I'm doing that too.

openshift-ci-robot · 2019-08-28T04:45:10Z

@wking: This pull request references Bugzilla bug 1717604, which is invalid:

expected the bug to target the "4.2.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1717604: data/data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2019-08-28T04:46:41Z

/bugzilla refresh

openshift-ci-robot · 2019-08-28T04:46:47Z

@wking: This pull request references Bugzilla bug 1717604, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abhinavdahiya · 2019-08-28T05:26:17Z

can you drop a301036 as it doesn't belong to the bz attached.

abhinavdahiya · 2019-08-28T05:26:59Z

while still generally keeping us under the 30m timeout we set for the whole infrastructure-provisioning step.

we don't have a timeout for infra provisioning. AFAIK

praveenkumar · 2019-08-28T14:14:02Z

/test e2e-libvirt

wking · 2019-08-28T15:04:56Z

can you drop a301036 as it doesn't belong to the bz attached.

Spun off into #2283.

we don't have a timeout for infra provisioning.

Oops, I was misremembering my motivation for 20m from #1682. Updated the commit message to just lean on that previous commit with a301036 -> ad26356.

Up from their default 10 minutes, using the knob that dates back to the original network load balancer support [1]. This should help us avoid the [2]: Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s) that cropped up again this week. 20m matches the timeout we set for routes and security groups in 246f4a1 (data/aws: 20-minute create timeouts for routes and security groups, 2019-04-26, openshift#1682). Sometimes even 20m will not be enough [3], but should make us a bit more resilient anyway. [1]: hashicorp/terraform-provider-aws@1af53b1#diff-f4b0dbdc7e3eede6ba70cd286c834f37R92 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604#c15

abhinavdahiya · 2019-08-28T16:14:16Z

/lgtm

openshift-ci-robot · 2019-08-28T16:14:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhinavdahiya · 2019-08-28T16:50:39Z

/hold

hmm, it doesn;t look like it is working..
see https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2279/pull-ci-openshift-installer-master-e2e-aws-upgrade/2029#0:build-log.txt%3A49
failing for active lb for 10m

abhinavdahiya · 2019-08-28T16:54:05Z

it looks like we need to set update https://github.com/terraform-providers/terraform-provider-aws/blob/3ac0b235870a52cb8b5090d5f7d0b2825d9e76fc/aws/resource_aws_lb.go#L490

Or maybe not, was looking at the wrong function https://github.com/terraform-providers/terraform-provider-aws/blob/3ac0b235870a52cb8b5090d5f7d0b2825d9e76fc/aws/resource_aws_lb.go#L300

wking · 2019-08-28T17:47:48Z

hmm, it doesn;t look like it is working..
see https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2279/pull-ci-openshift-installer-master-e2e-aws-upgrade/2029#0:build-log.txt%3A49

Ah, is this "upgrade uses the tip release image as the source and then upgrades to tip+PR-change as the target"? In which case, do we care about running upgrade tests in the installer repo?

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2279/pull-ci-openshift-installer-master-e2e-aws-upgrade/2029/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'Built from commit' | head -n1
time="2019-08-28T16:21:09Z" level=debug msg="Built from commit 6318c72cd078fb26f835e28117d7d02bc50d81a5"

6318c72 is the parent of 1ec2758 and does not include my bump.

wking · 2019-08-28T18:00:51Z

e2e-aws:

level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"

Checking ./bootstrap/journals/bootkube.log in the logs:

Aug 28 16:40:45 ip-10-0-6-7 bootkube.sh[1603]: Error: unhealthy cluster
Aug 28 16:40:45 ip-10-0-6-7 bootkube.sh[1603]: etcd cluster up. Killing etcd certificate signer...

so this got bit by the podman bug mentioned in #2274.

/retest

abhinavdahiya · 2019-08-28T18:06:41Z

hmm, it doesn;t look like it is working..
see https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2279/pull-ci-openshift-installer-master-e2e-aws-upgrade/2029#0:build-log.txt%3A49

Ah, is this "upgrade uses the tip release image as the source and then upgrades to tip+PR-change as the target"? In which case, do we care about running upgrade tests in the installer repo?
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2279/pull-ci-openshift-installer-master-e2e-aws-upgrade/2029/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'Built from commit' | head -n1
time="2019-08-28T16:21:09Z" level=debug msg="Built from commit 6318c72cd078fb26f835e28117d7d02bc50d81a5"
6318c72 is the parent of 1ec2758 and does not include my bump.

/hold cancel

wking · 2019-08-28T20:42:19Z

e2e-aws-upgrade:

Cluster did not complete upgrade: timed out waiting for the condition: Working towards 0.0.1-2019-08-28-180928: 86% complete

But upgrade CI is unaffected by in-flight installer PRs, so must be a flake.

/retest

bparees · 2019-08-28T20:56:58Z

manually merging to hopefully unjam our CI system which is suffering from hitting these timeouts in a lot of jobs. PR has passed e2e-aws.

openshift-ci-robot · 2019-08-28T20:57:15Z

@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1717604 has been moved to the MODIFIED state.

In response to this:

Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-08-28T21:06:34Z

@wking: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-libvirt	`1ec2758`	link	`/test e2e-libvirt`
ci/prow/e2e-aws-upgrade	`1ec2758`	link	`/test e2e-aws-upgrade`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2019-09-06T03:06:32Z

/cherrypick release-4.1

openshift-cherrypick-robot · 2019-09-06T03:06:43Z

@wking: #2279 failed to apply on top of branch "release-4.1":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	data/data/aws/vpc/master-elb.tf
Falling back to patching base and 3-way merge...
Auto-merging data/data/aws/vpc/master-elb.tf
CONFLICT (content): Merge conflict in data/data/aws/vpc/master-elb.tf
Patch failed at 0001 data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

We've been hitting the default 10m timeout recently, starting around [1]. Use the creation timeout knob [2] to wait a bit longer before giving up on AWS. 20m matches the value we've used bumping this timeout for other resources in 1ec2758 (data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m, 2019-08-27, openshift#2279) and previous. There's no guarantee that 20m will be sufficient, the issue could be due to internal AWS issues like a shortage of on-demand instances of the requested type in the requested availability zone. But it gives AWS an even easier target to hit ;). [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console/2708/pull-ci-openshift-console-master-e2e-aws-console-olm/8584 [2]: https://www.terraform.io/docs/providers/aws/r/instance.html#timeouts

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 28, 2019

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 28, 2019

openshift-ci-robot requested review from jhixson74 and jstuever August 28, 2019 04:45

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 28, 2019

wking mentioned this pull request Aug 28, 2019

data/aws/vpc/master-elb: Drop idle_timeout from aws_lb resources #2283

Merged

wking force-pushed the bump-load-balancer-timeouts branch from a301036 to ad26356 Compare August 28, 2019 15:04

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 28, 2019

wking changed the title ~~Bug 1717604: data/data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m~~ Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m Aug 28, 2019

wking force-pushed the bump-load-balancer-timeouts branch from ad26356 to 1ec2758 Compare August 28, 2019 16:06

openshift-ci-robot assigned abhinavdahiya Aug 28, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 28, 2019

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 28, 2019

bparees merged commit 35b73c0 into openshift:master Aug 28, 2019

wking deleted the bump-load-balancer-timeouts branch August 28, 2019 20:57

wking mentioned this pull request Sep 6, 2019

Bug 1749624: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m #2324

Merged

wking mentioned this pull request Sep 13, 2019

Bug 1752135: data/aws: Set 20m creation timeouts for instances #2359

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m #2279

Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m #2279

wking commented Aug 28, 2019

openshift-ci-robot commented Aug 28, 2019

wking commented Aug 28, 2019

openshift-ci-robot commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019

praveenkumar commented Aug 28, 2019

wking commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019

openshift-ci-robot commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019 •

edited

Loading

wking commented Aug 28, 2019

wking commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019

wking commented Aug 28, 2019

bparees commented Aug 28, 2019

openshift-ci-robot commented Aug 28, 2019

openshift-ci-robot commented Aug 28, 2019

wking commented Sep 6, 2019

openshift-cherrypick-robot commented Sep 6, 2019

Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m #2279

Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m #2279

Conversation

wking commented Aug 28, 2019

openshift-ci-robot commented Aug 28, 2019

wking commented Aug 28, 2019

openshift-ci-robot commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019

praveenkumar commented Aug 28, 2019

wking commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019

openshift-ci-robot commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019 • edited Loading

wking commented Aug 28, 2019

wking commented Aug 28, 2019

abhinavdahiya commented Aug 28, 2019

wking commented Aug 28, 2019

bparees commented Aug 28, 2019

openshift-ci-robot commented Aug 28, 2019

openshift-ci-robot commented Aug 28, 2019

wking commented Sep 6, 2019

openshift-cherrypick-robot commented Sep 6, 2019

abhinavdahiya commented Aug 28, 2019 •

edited

Loading