cvo: Release our leader lease when we are gracefully terminated #87

smarterclayton · 2019-01-17T03:43:35Z

On an upgrade the CVO is always restarted but we don't release our
lease, which causes the new CVO pod to have to wait. We should
release our lease on exit.

Kube 1.14 (pr 71490) contains a new flag on leader lease that allows
the caller to step down gracefully. Because backporting that change
to client-go is complicated, we instead emulate the logic. When that
code is available we can simplify down.

This also makes install about 90s faster

abhinavdahiya · 2019-01-17T17:50:35Z

cmd/start.go

+					time.Sleep(100 * time.Millisecond)
+					// if we still hold the leader lease, clear the owner identity (other lease watchers
+					// still have to wait for expiration) like the new ReleaseOnCancel code will do.
+					if err := lock.Update(resourcelock.LeaderElectionRecord{}); err == nil {


reseting to empty looses the https://github.com/kubernetes/client-go/blob/v9.0.0/tools/leaderelection/resourcelock/interface.go#L42 is there a way we can step down without completely loosing that information...?

other lease watchers still have to wait for expiration

if we reset the https://github.com/kubernetes/client-go/blob/v9.0.0/tools/leaderelection/resourcelock/interface.go#L39-L41 wouldn't the next candidate immediately try to reelect?

No, see the upstream PR - clients wait until the time is clear. They don't step down today until my PR lands in 1.14.

abhinavdahiya · 2019-01-17T17:52:41Z

cmd/start.go

+					// still have to wait for expiration) like the new ReleaseOnCancel code will do.
+					if err := lock.Update(resourcelock.LeaderElectionRecord{}); err == nil {
+						// if we successfully clear the owner identity, we can safely delete the record
+						if err := lock.Client.ConfigMaps(lock.ConfigMapMeta.Namespace).Delete(lock.ConfigMapMeta.Name, nil); err != nil {


Why delete the config map at all?

Deleting the config map is the only way to trigger a new election today.

smarterclayton · 2019-01-17T19:13:54Z

/hold

I need to refactor this to be testable in an integration test (cmd/ is too full anyway)

smarterclayton · 2019-01-17T22:09:48Z

I took your feedback from previous PR and refactored the start command to have reusable code, and then had the integration test use it and cut the internal dependencies. Should be cleaner, and start is noticeably simpler (well, to me).

smarterclayton · 2019-01-17T22:10:02Z

I also added an integration test for step down.

abhinavdahiya · 2019-01-17T22:24:31Z

pkg/cvo/cvo.go

@@ -452,3 +452,8 @@ func (optr *Operator) currentVersion() configv1.Update {
 		Payload: optr.releaseImage,
 	}
 }
+
+// SetSyncWorkerForTesting updates the sync worker for whitebox testing.
+func (optr *Operator) SetSyncWorkerForTesting(worker ConfigSyncWorker) {


I contemplated having start lazily configure the worker, but this is a reasonable taint.

abhinavdahiya · 2019-01-17T22:26:38Z

Will take a look. 👍

abhinavdahiya · 2019-01-17T23:51:57Z

pkg/start/start.go

+
+// createResourceLock initializes the lock.
+func createResourceLock(cb *ClientBuilder, namespace, name string) (*resourcelock.ConfigMapLock, error) {
+	recorder := record.NewBroadcaster().NewRecorder(runtime.NewScheme(), v1.EventSource{Component: namespace})


ques: shouldn't runtime.NewScheme() be kubernetes.Scheme ?

I’m not 100% sure, it doesn’t actually matter for events (we fall back to a generic path if an object isn’t in the scheme)

Actually, we didn't have a sink, so we weren't actually recording metrics. I fixed that here, and made sure each of the places in the CVO we log events we were sending it to our namespace rather than the default namespace (which if you pass "" is where you go). Added an integration test to verify we receive a leader event, which is a good proxy for us accidentally breaking it in the future.

abhinavdahiya · 2019-01-17T23:56:59Z

pkg/start/start_integration_test.go

+	}
+	t.Logf("lock deleted in %s", endTime.Sub(startTime))
+
+	select {


I like this ;)

abhinavdahiya · 2019-01-17T23:57:57Z

/approve

i'll rebase #78 on this and add a test for rotation..

abhinavdahiya · 2019-01-18T00:00:05Z

also can you

squash this to single commit or
2 commits (move to pkg/start) and (leader step down on exit)

?

smarterclayton · 2019-01-18T00:12:04Z

Will squash - teasing them apart is harder now

On an upgrade the CVO is always restarted but we don't release our lease, which causes the new CVO pod to have to wait. We should release our lease on exit. Kube 1.14 (pr 71490) contains a new flag on leader lease that allows the caller to step down gracefully. Because backporting that change to client-go is complicated, we instead emulate the logic. When that code is available we can simplify down. cmd: Refactor how the CVO is started so the integration test is consistent Move the integration test logic into a new package and reuse startup logic so that we have a much cleaner start command than before and so that we are testing what we run in the command. Remove rootOpts and startOpts and replace them with nested operations. Add a test that verifies we send leader election events.

smarterclayton · 2019-01-18T01:21:05Z

Rebased, squashed, and fixed the event thing.

abhinavdahiya · 2019-01-18T01:29:52Z

/lgtm
/hold cancel

openshift-ci-robot · 2019-01-18T01:29:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2019-01-18T02:38:58Z

/retest

smarterclayton · 2019-01-18T03:44:53Z

/retest

We've had it since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87), but it's redundant vs. "ListenAddr is not an empty string". I'm also switching to: o.ListenAddr != "" instead of: len(o.ListenAddr) > 0 because it seems slightly easier to understand, but obviously either will work.

So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while [1]: I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... ... I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version and time out the: Cluster did not acknowledge request to upgrade in a reasonable time testcase [2]. Using ReleaseOnCancel has been the plan since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). I'm not clear on why it (sometimes?) doesn't work today. The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (cvo: When the CVO restarts, perform one final sync to write status, 2019-04-27, openshift#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout. I'm also pushing the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole 'start' invocation. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7

So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while [1]: I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... ... I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version and time out the: Cluster did not acknowledge request to upgrade in a reasonable time testcase [2]. Using ReleaseOnCancel has been the plan since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). I'm not clear on why it (sometimes?) doesn't work today. The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (cvo: When the CVO restarts, perform one final sync to write status, 2019-04-27, openshift#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout. Also set terminationGracePeriodSeconds 130 to extend the default 30s [3], to give the container the full two-minute graceful timeout window before the kubelet steps in with a KILL. Also push the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole 'start' invocation. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7 [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core

So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while [1]: I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... ... I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version and time out the: Cluster did not acknowledge request to upgrade in a reasonable time testcase [2]. Using ReleaseOnCancel has been the plan since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). I'm not clear on why it (sometimes?) doesn't work today. The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (cvo: When the CVO restarts, perform one final sync to write status, 2019-04-27, openshift#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout. Also drop an unnecessary runCancel() from the "shutting down" branch of the error collector. I'd added it in b30aa0e, but you can only ever get into the "shutting down" branch if runCancel has already been called. Also add some logging to the error collector, so it's easier to see where we are in the collection process from the operator logs. Although because we only collect errors at the moment, it's not obvious from the logs exactly what has been collected. Also set terminationGracePeriodSeconds 130 to extend the default 30s [3], to give the container the full two-minute graceful timeout window before the kubelet steps in with a KILL. Also push the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole 'start' invocation. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7 [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core

So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while [1]: I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... ... I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version and time out the: Cluster did not acknowledge request to upgrade in a reasonable time testcase [2]. Using ReleaseOnCancel has been the plan since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). I'm not clear on why it (sometimes?) doesn't work today. The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (cvo: When the CVO restarts, perform one final sync to write status, 2019-04-27, openshift#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout. Also drop an unnecessary runCancel() from the "shutting down" branch of the error collector. I'd added it in b30aa0e, but you can only ever get into the "shutting down" branch if runCancel has already been called. And fix the scoping for the shutdownTimer variable so we don't clear it on each for-loop iteration (oops :p, bug from b30aa0e). Also add some logging to the error collector, so it's easier to see where we are in the collection process from the operator logs. Although because we only collect errors at the moment, it's not obvious from the logs exactly what has been collected. Also set terminationGracePeriodSeconds 130 to extend the default 30s [3], to give the container the full two-minute graceful timeout window before the kubelet steps in with a KILL. Also push the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole 'start' invocation. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7 [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core

So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while [1]: I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... ... I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version and time out the: Cluster did not acknowledge request to upgrade in a reasonable time testcase [2]. Using ReleaseOnCancel has been the plan since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). I'm not clear on why it (sometimes?) doesn't work today. The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (cvo: When the CVO restarts, perform one final sync to write status, 2019-04-27, openshift#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout. Decouple shutdownContext from the context passed into Options.run, to allow TestIntegrationCVO_gracefulStepDown to request a graceful shutdown. And remove Context.Start(), inlining the logic in Options.run so we can count and reap the goroutines it used to launch. This also allows us to be more targeted with the context for each goroutines: * Informers are now launched before the lease controller, so they're up and running by the time we acquire the lease. They remain running until the main operator CVO.Run() exits, after which we shut them down. Having informers running before we have a lease is somewhat expensive in terms of API traffic, but we should rarely have two CVO pods competing for leadership since we transitioned to the Recreate Deployment strategy in 078686d (install/0000_00_cluster-version-operator_03_deployment: Set 'strategy: Recreate', 2019-03-20, openshift#140) and 5d8a527 (install/0000_00_cluster-version-operator_03_deployment: Fix Recreate strategy, 2019-04-03, openshift#155). * The metrics server also continues to run until CVO.Run() exits, where previously we began gracefully shutting it down at the same time we started shutting down CVO.Run(). This ensures we are around and publishing any last-minute CVO.Run() changes. * Leader election also continues to run until CVO.Run() exits. We don't want to release the lease while we're still controlling things. * CVO.Run() and AutoUpdate.Run() both stop immediately when the passed-in context is canceled or we call runCancel internally (because of a TERM, error from a goroutine, or loss of leadership). These are the only two goroutines that are actually writing to the API servers, so we want to shut them down as quickly as possible. Drop an unnecessary runCancel() from the "shutting down" branch of the error collector. I'd added it in b30aa0e, but you can only ever get into the "shutting down" branch if runCancel has already been called. And fix the scoping for the shutdownTimer variable so we don't clear it on each for-loop iteration (oops :p, bug from b30aa0e). Add some logging to the error collector, so it's easier to see where we are in the collection process from the operator logs. Also start logging collected goroutines by name, so we can figure out which may still be outstanding. Set terminationGracePeriodSeconds 130 to extend the default 30s [3], to give the container the full two-minute graceful timeout window before the kubelet steps in with a KILL. Push the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole 'start' invocation. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7 [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core

So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while [1]: I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... ... I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version and time out the: Cluster did not acknowledge request to upgrade in a reasonable time testcase [2]. Using ReleaseOnCancel has been the plan since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). I'm not clear on why it (sometimes?) doesn't work today. The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (cvo: When the CVO restarts, perform one final sync to write status, 2019-04-27, openshift#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout. Decouple shutdownContext from the context passed into Options.run, to allow TestIntegrationCVO_gracefulStepDown to request a graceful shutdown. And remove Context.Start(), inlining the logic in Options.run so we can count and reap the goroutines it used to launch. This also allows us to be more targeted with the context for each goroutines: * Informers are now launched before the lease controller, so they're up and running by the time we acquire the lease. They remain running until the main operator CVO.Run() exits, after which we shut them down. Having informers running before we have a lease is somewhat expensive in terms of API traffic, but we should rarely have two CVO pods competing for leadership since we transitioned to the Recreate Deployment strategy in 078686d (install/0000_00_cluster-version-operator_03_deployment: Set 'strategy: Recreate', 2019-03-20, openshift#140) and 5d8a527 (install/0000_00_cluster-version-operator_03_deployment: Fix Recreate strategy, 2019-04-03, openshift#155). I don't see a way to block on their internal goroutine's completion, but maybe informers will grow an API for that in the future. * The metrics server also continues to run until CVO.Run() exits, where previously we began gracefully shutting it down at the same time we started shutting down CVO.Run(). This ensures we are around and publishing any last-minute CVO.Run() changes. * Leader election also continues to run until CVO.Run() exits. We don't want to release the lease while we're still controlling things. * CVO.Run() and AutoUpdate.Run() both stop immediately when the passed-in context is canceled or we call runCancel internally (because of a TERM, error from a goroutine, or loss of leadership). These are the only two goroutines that are actually writing to the API servers, so we want to shut them down as quickly as possible. Drop an unnecessary runCancel() from the "shutting down" branch of the error collector. I'd added it in b30aa0e, but you can only ever get into the "shutting down" branch if runCancel has already been called. And fix the scoping for the shutdownTimer variable so we don't clear it on each for-loop iteration (oops :p, bug from b30aa0e). Add some logging to the error collector, so it's easier to see where we are in the collection process from the operator logs. Also start logging collected goroutines by name, so we can figure out which may still be outstanding. Set terminationGracePeriodSeconds 130 to extend the default 30s [3], to give the container the full two-minute graceful timeout window before the kubelet steps in with a KILL. Push the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole 'start' invocation. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7 [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core squash! pkg/start: Release leader lease on graceful shutdown

…ap lock release From the godocs: $ grep -A5 '// HolderIdentity' vendor/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go // HolderIdentity is the ID that owns the lease. If empty, no one owns this lease and // all callers may acquire. Versions of this library prior to Kubernetes 1.14 will not // attempt to acquire leases with empty identities and will wait for the full lease // interval to expire before attempting to reacquire. This value is set to empty when // a client voluntarily steps down. HolderIdentity string `json:"holderIdentity"` The previous assumption that the release would involve ConfigMap deletion was born with the test in 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87).