Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m #2279

Merged
merged 1 commit into from
Aug 28, 2019

Conversation

wking
Copy link
Member

@wking wking commented Aug 28, 2019

Up from their default 10 minutes, using the knob that dates back to the original network load balancer support. This should help us avoid:

Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)

that cropped up again this week, while still generally keeping us under the 30m timeout we set for the whole infrastructure-provisioning step. Sometimes even 20m will not be enough, but should make us a bit more resilient anyway.

While I was comparing the docs with our config, I also noticed that we should have dropped idle_timeout (which does not apply to network load balancers) when we made the shift from classic to network load balancers in 16dfbb3 (#594), so I'm doing that too.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 28, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1717604, which is invalid:

  • expected the bug to target the "4.2.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1717604: data/data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 28, 2019
@wking
Copy link
Member Author

wking commented Aug 28, 2019

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 28, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1717604, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhinavdahiya
Copy link
Contributor

can you drop a301036 as it doesn't belong to the bz attached.

@abhinavdahiya
Copy link
Contributor

while still generally keeping us under the 30m timeout we set for the whole infrastructure-provisioning step.

we don't have a timeout for infra provisioning. AFAIK

@praveenkumar
Copy link
Contributor

/test e2e-libvirt

@wking wking force-pushed the bump-load-balancer-timeouts branch from a301036 to ad26356 Compare August 28, 2019 15:04
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 28, 2019
@wking
Copy link
Member Author

wking commented Aug 28, 2019

can you drop a301036 as it doesn't belong to the bz attached.

Spun off into #2283.

we don't have a timeout for infra provisioning.

Oops, I was misremembering my motivation for 20m from #1682. Updated the commit message to just lean on that previous commit with a301036 -> ad26356.

@wking wking changed the title Bug 1717604: data/data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m Aug 28, 2019
Up from their default 10 minutes, using the knob that dates back to
the original network load balancer support [1].  This should help us
avoid the [2]:

  Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)

that cropped up again this week.  20m matches the timeout we set for
routes and security groups in 246f4a1 (data/aws: 20-minute create
timeouts for routes and security groups, 2019-04-26, openshift#1682).
Sometimes even 20m will not be enough [3], but should make us a bit
more resilient anyway.

[1]: hashicorp/terraform-provider-aws@1af53b1#diff-f4b0dbdc7e3eede6ba70cd286c834f37R92
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604#c15
@wking wking force-pushed the bump-load-balancer-timeouts branch from ad26356 to 1ec2758 Compare August 28, 2019 16:06
@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abhinavdahiya
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 28, 2019
@wking
Copy link
Member Author

wking commented Aug 28, 2019

hmm, it doesn;t look like it is working..
see https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2279/pull-ci-openshift-installer-master-e2e-aws-upgrade/2029#0:build-log.txt%3A49

Ah, is this "upgrade uses the tip release image as the source and then upgrades to tip+PR-change as the target"? In which case, do we care about running upgrade tests in the installer repo?

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2279/pull-ci-openshift-installer-master-e2e-aws-upgrade/2029/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'Built from commit' | head -n1
time="2019-08-28T16:21:09Z" level=debug msg="Built from commit 6318c72cd078fb26f835e28117d7d02bc50d81a5"

6318c72 is the parent of 1ec2758 and does not include my bump.

@wking
Copy link
Member Author

wking commented Aug 28, 2019

e2e-aws:

level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"

Checking ./bootstrap/journals/bootkube.log in the logs:

Aug 28 16:40:45 ip-10-0-6-7 bootkube.sh[1603]: Error: unhealthy cluster
Aug 28 16:40:45 ip-10-0-6-7 bootkube.sh[1603]: etcd cluster up. Killing etcd certificate signer...

so this got bit by the podman bug mentioned in #2274.

/retest

@abhinavdahiya
Copy link
Contributor

hmm, it doesn;t look like it is working..
see https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2279/pull-ci-openshift-installer-master-e2e-aws-upgrade/2029#0:build-log.txt%3A49

Ah, is this "upgrade uses the tip release image as the source and then upgrades to tip+PR-change as the target"? In which case, do we care about running upgrade tests in the installer repo?

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2279/pull-ci-openshift-installer-master-e2e-aws-upgrade/2029/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'Built from commit' | head -n1
time="2019-08-28T16:21:09Z" level=debug msg="Built from commit 6318c72cd078fb26f835e28117d7d02bc50d81a5"

6318c72 is the parent of 1ec2758 and does not include my bump.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 28, 2019
@wking
Copy link
Member Author

wking commented Aug 28, 2019

e2e-aws-upgrade:

Cluster did not complete upgrade: timed out waiting for the condition: Working towards 0.0.1-2019-08-28-180928: 86% complete

But upgrade CI is unaffected by in-flight installer PRs, so must be a flake.

/retest

@bparees
Copy link
Contributor

bparees commented Aug 28, 2019

manually merging to hopefully unjam our CI system which is suffering from hitting these timeouts in a lot of jobs. PR has passed e2e-aws.

@bparees bparees merged commit 35b73c0 into openshift:master Aug 28, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1717604 has been moved to the MODIFIED state.

In response to this:

Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the bump-load-balancer-timeouts branch August 28, 2019 20:57
@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 1ec2758 link /test e2e-libvirt
ci/prow/e2e-aws-upgrade 1ec2758 link /test e2e-aws-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Sep 6, 2019

/cherrypick release-4.1

@openshift-cherrypick-robot

@wking: #2279 failed to apply on top of branch "release-4.1":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	data/data/aws/vpc/master-elb.tf
Falling back to patching base and 3-way merge...
Auto-merging data/data/aws/vpc/master-elb.tf
CONFLICT (content): Merge conflict in data/data/aws/vpc/master-elb.tf
Patch failed at 0001 data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/openshift-installer that referenced this pull request Sep 13, 2019
We've been hitting the default 10m timeout recently, starting around
[1].  Use the creation timeout knob [2] to wait a bit longer before
giving up on AWS.  20m matches the value we've used bumping this
timeout for other resources in 1ec2758 (data/aws/vpc/master-elb:
Bump load-balancer timeouts to 20m, 2019-08-27, openshift#2279) and previous.
There's no guarantee that 20m will be sufficient, the issue could be
due to internal AWS issues like a shortage of on-demand instances of
the requested type in the requested availability zone.  But it gives
AWS an even easier target to hit ;).

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console/2708/pull-ci-openshift-console-master-e2e-aws-console-olm/8584
[2]: https://www.terraform.io/docs/providers/aws/r/instance.html#timeouts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants