Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): handle race when starting metrics server. Fixes #10807 #13731

Merged
merged 6 commits into from
Oct 10, 2024

Conversation

MasonM
Copy link
Contributor

@MasonM MasonM commented Oct 9, 2024

Partial fix for #10807

Motivation

While investigating the flaky MetricsSuite/TestMetricsEndpoint test, I noticed listen tcp :9090: bind: address already in use errors in the controller logs (example):

controller: time="2024-10-07T18:22:14.793Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics"
server: time="2024-10-07T18:22:14.793Z" level=info msg="Creating event controller" asyncDispatch=false operationQueueSize=16 workerCount=4
server: time="2024-10-07T18:22:14.800Z" level=info msg="GRPC Server Max Message Size, MaxGRPCMessageSize, is set" GRPC_MESSAGE_SIZE=104857600
server: time="2024-10-07T18:22:14.800Z" level=info msg="Argo Server started successfully on http://localhost:2746" url="http://localhost:2746"
controller: I1007 18:22:14.800947   25045 leaderelection.go:260] successfully acquired lease argo/workflow-controller
controller: time="2024-10-07T18:22:14.801Z" level=info msg="new leader" leader=local
controller: time="2024-10-07T18:22:14.801Z" level=info msg="Generating Self Signed TLS Certificates for Telemetry Servers"
controller: time="2024-10-07T18:22:14.802Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
controller: panic: listen tcp :9090: bind: address already in use
controller:
controller: goroutine 37 [running]:
controller: github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer.func2()
controller: 	/home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:94 +0x16a
controller: created by github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer in goroutine 36
controller: 	/home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:91 +0x53c
2024/10/07 18:22:14 controller: process exited 25045: exit status 2
controller: exit status 2
2024/10/07 18:22:14 controller: backing off 4s

The controller gracefully handles this by automatically restarting, but it definitely shouldn't be happening. I believe this is a race condition introduced in #11295. Here's the sequence of events that trigger this:

  1. Controller starts
  2. Dummy metrics server started on port 9090:
    go wfController.RunPrometheusServer(dummyCtx, true)
  3. Leader election takes place and controller starts leading:
    go leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
    Lock: &resourcelock.LeaseLock{
    LeaseMeta: metav1.ObjectMeta{Name: leaderName, Namespace: namespace}, Client: kubeclientset.CoordinationV1(),
    LockConfig: resourcelock.ResourceLockConfig{Identity: nodeID, EventRecorder: events.NewEventRecorderManager(kubeclientset).Get(namespace)},
    },
    ReleaseOnCancel: false,
    LeaseDuration: env.LookupEnvDurationOr("LEADER_ELECTION_LEASE_DURATION", 15*time.Second),
    RenewDeadline: env.LookupEnvDurationOr("LEADER_ELECTION_RENEW_DEADLINE", 10*time.Second),
    RetryPeriod: env.LookupEnvDurationOr("LEADER_ELECTION_RETRY_PERIOD", 5*time.Second),
    Callbacks: leaderelection.LeaderCallbacks{
    OnStartedLeading: func(ctx context.Context) {
  4. Context for dummy metrics server cancelled:
  5. Dummy metrics server shuts down:
    // Waiting for stop signal
    <-ctx.Done()
    // Shutdown the server gracefully with a 1 second timeout
    ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)
    defer cancel()
    if err := srv.Shutdown(ctx); err != nil {
  6. Prometheus metrics server started on 9090:
    go wfController.RunPrometheusServer(ctx, false)

The problems is steps 5-6 can happen out-of-order, because the shutdown happens after the context is cancelled. Per the docs, "a CancelFunc does not wait for the work to stop".

Modifications

The controller needs to explicitly wait for the dummy metrics server to shut down properly before starting the Prometheus metrics server. There's many ways of doing that, and this uses a WaitGroup, as that's the simplest approach I could think of.

Verification

Ran tests locally

…er. Fixes argoproj#10807

While investigating the flaky `MetricsSuite/TestMetricsEndpoint` test
that's been failing periodically for awhile now, I noticed this in the
controller logs
([example](https://github.com/argoproj/argo-workflows/actions/runs/11221357877/job/31191811077)):

```
controller: time="2024-10-07T18:22:14.793Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics"
server: time="2024-10-07T18:22:14.793Z" level=info msg="Creating event controller" asyncDispatch=false operationQueueSize=16 workerCount=4
server: time="2024-10-07T18:22:14.800Z" level=info msg="GRPC Server Max Message Size, MaxGRPCMessageSize, is set" GRPC_MESSAGE_SIZE=104857600
server: time="2024-10-07T18:22:14.800Z" level=info msg="Argo Server started successfully on http://localhost:2746" url="http://localhost:2746"
controller: I1007 18:22:14.800947   25045 leaderelection.go:260] successfully acquired lease argo/workflow-controller
controller: time="2024-10-07T18:22:14.801Z" level=info msg="new leader" leader=local
controller: time="2024-10-07T18:22:14.801Z" level=info msg="Generating Self Signed TLS Certificates for Telemetry Servers"
controller: time="2024-10-07T18:22:14.802Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
controller: panic: listen tcp :9090: bind: address already in use
controller:
controller: goroutine 37 [running]:
controller: github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer.func2()
controller: 	/home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:94 +0x16a
controller: created by github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer in goroutine 36
controller: 	/home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:91 +0x53c
2024/10/07 18:22:14 controller: process exited 25045: exit status 2
controller: exit status 2
2024/10/07 18:22:14 controller: backing off 4s
```

I believe this is a race condition introduced in
argoproj#11295. Here's
the sequence of events that trigger this:
1. Controller starts
2. Dummy metrics server started on port 9090
3. Leader election takes place and controller starts leading
4. Context for dummy metrics server cancelled
5. Metrics server shuts down
6. Prometheus metrics server started on 9090

The problems is steps 5-6 can happen out-of-order, because the shutdown
happens after the contxt is cancelled. Per the docs, "a CancelFunc does
not wait for the work to stop" (https://pkg.go.dev/context#CancelFunc).

The controller needs to explicitly wait for the dummy metrics server to
shut down properly before starting the Prometheus metrics server.
There's many ways of doing that, and this uses a `WaitGroup`, as that's
the simplest approach I could think of.

Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
@agilgur5
Copy link
Member

agilgur5 commented Oct 9, 2024

I believe this is a race condition introduced in #11295.

I guess some changes from #13265 or related made this happen more frequently? I was trying to figure out why it recently started 🤔

@agilgur5 agilgur5 requested a review from Joibel October 9, 2024 16:04
@MasonM MasonM marked this pull request as ready for review October 10, 2024 02:13
@agilgur5 agilgur5 added area/controller Controller issues, panics area/metrics labels Oct 10, 2024
Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some ideas below

util/telemetry/exporter_prometheus.go Outdated Show resolved Hide resolved
cmd/workflow-controller/main.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
@agilgur5 agilgur5 changed the title fix(controller): race condition when starting metrics server. Fixes #10807 fix(controller): handle race when starting metrics server. Fixes #10807 Oct 10, 2024
Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for investigating the test flake and adding all the details here!

I am still curious about why it started happening more frequently (per above) though if you uncovered that already. If not, then I wouldn't spend more time on it since this is a root cause fix anyway

@agilgur5 agilgur5 merged commit 8100f99 into argoproj:main Oct 10, 2024
29 checks passed
@Joibel
Copy link
Member

Joibel commented Oct 10, 2024

Thank you for fixing this.

I am not sure why #13265 made this worse, but I suspect it did. Whatever is happening under the hood between otel and prometheus go library will be radically different.

I would probably like a fix where the waitgroup was encapsulated on the metrics side so the tests could test the same thing as the actual running code - a redesign of the interface to switch prometheus metrics on and off. This is good for now though, so thank you.

I was going to revisit shutdown of telemetry at some point because it becomes more fun when you're trying to ensure traces have all been dispatched before you terminate - so it might come out of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants