v3.3.1: only the leader workflow-controller pod can expose metrics. #8283

wshi5985 · 2022-03-31T21:51:17Z

Checklist

Double-checked my configuration.
Tested using the latest version.
Used the Emissary executor.

Summary

We upgraded argo to 3.3.1 from 3.1.3, There are 2 workflow-controller replica pods
somehow after upgrade, only the leader workflow-controller pod can expose metrics.

What version are you running? v3.3.1

Diagnostics

From our prometheus server UI, it shows one of the workflow-controller pod's metrics target DOWN with error: Get "http://10.127.217.135:9090/metrics": dial tcp 10.127.217.135:9090: connect: connection refused
While the other pod(the leader pod) shows UP, status is ok.

We tried run "wget" to service endpoint workflow-controller-metrics port 9090 (from a testing pod), it could intermittently get metrics result back, and half the time it returns error "wget: can't connect to remote host (172.20.116.209): Connection refused"

no related errors found in pod log.

# sometimes it can get result back
 $ wget http://workflow-controller-metrics.argo:9090/metrics
Connecting to workflow-controller-metrics.argo:9090 (172.20.116.209:9090)
saving to 'metrics'
metrics              100% |*************************************************************************************************************************| 23326  0:00:00 ETA
'metrics' saved

# sometimes connection refused
$ wget http://workflow-controller-metrics.argo:9090/metrics
Connecting to workflow-controller-metrics.argo:9090 (172.20.116.209:9090)
wget: can't connect to remote host (172.20.116.209): Connection refused

Argo runs in our multiple eks clusters, we upgraded argo to v3.3.1 in two clusters, both have the same issue. the rest clusters with argo v3.1.3 does not have this problem.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

github-actions · 2022-03-31T21:51:27Z

@wshi5985: There are no area labels on this issue. Adding a label will help expedite your issue. If you are unsure what to do, make your best guess. We can change it later

/area api
/area artifacts
/area build
/area cli
/area controller
/area cron-workflows
/area daemon-steps
/area docs
/area executor
/area exit-handler
/area hooks
/area looping
/area manifests
/area memoization
/area metrics
/area multi-cluster
/area mutex-semaphore
/area plugins
/area sdks
/area spec
/area sso-rbac
/area suspend-resume
/area templates/container
/area templates/container-set
/area templates/dag
/area templates/data
/area templates/http
/area templates/resource
/area templates/script
/area templates/steps

Details

I am a bot created to help the argoproj developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the DeFiCh/oss-governance-bot repository.

wshi5985 · 2022-03-31T21:59:03Z

/area metrics

alexec · 2022-03-31T22:19:25Z

@whynowy we only have one pod serving metrics in v3.3. The service should always route to that pod. Is that something we can do, do you know?

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec · 2022-03-31T22:54:29Z

@wshi5985 could you please try adding the readiness check in the attached PR?

whynowy · 2022-03-31T23:21:23Z

Why adding a readiness check helps? @alexec

alexec · 2022-03-31T23:23:27Z

The kubelet uses readiness probes to know when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

You can put a service in front of the pods for the metrics (we should vend that). It should work.

alexec · 2022-03-31T23:24:02Z

Non-leader pods should not expose metrics. They don't have anything interesting.

whynowy · 2022-03-31T23:31:58Z

This means the standby pod will be periodically restarted.

alexec · 2022-03-31T23:34:00Z

I think that is a liveness probe. If the readiness probe fails, then the pod is just not ready. Correct me if I'm wrong?

wshi5985 · 2022-03-31T23:39:24Z

workflow-controller-869578f6c7-7jvhj   0/1     Running   0          3m27s
workflow-controller-869578f6c7-sc9kx   1/1     Running   0          12m

as expected, one pod not ready.

  Warning  Unhealthy         24s    kubelet            Readiness probe failed: Get "http://10.127.212.119:9090/metrics": dial tcp 10.127.212.119:9090: connect: connection refused

This means there will always be one pod in notReady mode ..... This will trigger our monitor alert i think.
Can we just let both pod expose metrics as before ?

alexec · 2022-03-31T23:41:04Z

Removing the metrics endpoint was not intentional, side-effect of other changes. Do you think it provides useful data?

wshi5985 · 2022-03-31T23:41:31Z

plus it is tricky for update/deploy, since new pod could not come up while the old pod is leader.
i had to delete the old replicaset to let the new pod come up.

wshi5985 · 2022-03-31T23:47:37Z

@alexec we are not using those metrics right now. we noticed this when prometheus server shows errors.

alexec · 2022-03-31T23:50:33Z

I'm unclear. Is there another problem here? You should not have to delete any pods for the stand-by leader to come up.

wshi5985 · 2022-04-01T00:47:40Z

i don't see other problems.

there is a rolling update strategy,

    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%

when apply the new deploy yaml, one new pod tried to come up, the old 2 pods were holding there waiting for the new one up. but the new one kept in notready mode. the 3 pods hanging there. deploy stuck. until i deleted the old replicaset which deleted the 2 old pods. the first new pod became leader and ready, and the second pod came up in notready mode too.

alexec · 2022-04-01T00:58:52Z

Are you saying we can't use readiness because it prevents a rolling update?

wshi5985 · 2022-04-01T02:26:36Z

It looks like a pod could not be picked as leader if it is not ready

just did another test, set rolling update strategy to 50%, with readinessProbe configured
when change applied, 2 new pods tried to come up, the old leader pod kept up, the old stand-by pod terminated.
both 2 new pods failed readiness probe, none of them ready, none of them could be picked up as leader. until i delete the old replicaset(old pods gone), then one of the new one ready and picked as leader, the other one in notReady mode as expected.

if remove readinessProbe, both pod ready, and one picked as leader. no need to delete old replicaset.

wshi5985 · 2022-04-01T02:31:15Z

btw, if we remove the workflow-controller-metrics service, what will be the impact (beside no metric exposed)? anything depends on the exposed metrics ?

alexec · 2022-04-01T02:35:51Z

Leader election is unreleated to readiness. I don't believe readiness (or liveness) will affect it.

whynowy · 2022-04-01T06:06:08Z

Why don't abandon using Service to expose metrics, instead, recommend to configure pod discovery in prometheus?

wshi5985 · 2022-04-01T13:45:50Z

during deployment, when readinessprobe failed, pod "ready" status showed "0/1"(instead of "1/1") and deployment stuck. i am not sure why.
and our setup use prometheus-operator, which use serviceMonitor -> service for metrics collection.

alexec · 2022-04-01T16:49:44Z

I don't think you can use rolling update at all.

Consider this. You have 3 replicas. If you want to use service, you must have readiness check, otherwise you get errors.

Only one is leader, so only one is ready.

When rolling update occurs, the leader will not be deleted until another replica becomes ready. This will never happen, because replicas only become ready when they become leader. Catch 22!

Instead, you could use:

  strategy:
    type: Recreate

Could you please try that?

Signed-off-by: Alex Collins <alex_collins@intuit.com>

…rgoproj#8285)" This reverts commit 283f6b5. Signed-off-by: Alex Collins <alex_collins@intuit.com>

MatthewHou · 2022-05-02T18:54:54Z

@whynowy Can you plz elaborate a bit more on configure pod discovery? At the moment we use ServiceMonitor CRD to collect metrics from workflow-controller. We also noticed that 1 of the scrape target is failing due the change in non-leader controller metrics. How should we work around it? Looks like PodMonitor won't work unless some additional labels could added to the leader pod

Why don't abandon using Service to expose metrics, instead, recommend to configure pod discovery in prometheus?

tyrken · 2022-06-14T16:35:40Z

FWIW adding the readinessCheck (& switching to strategy Recreate) didn't actually fix the kube-prometheus-stack TargetDown alert - the endpoints were still there (in our EKS 1.22 cluster) and so used by ServiceMonitor to config Prom - not sure why as your discussion above sounds reasonable.

... but even if that worked it doesn't seem a good solution, as then we'd have a pod continually sitting "unready" which would trigger another alert looking for containers staying "unready" for too long. Could we not have the metrics endpoint back, but serving up no metrics values? This idea just as an alternative to @MatthewHou 's suggestion above if you don't want to edit your own metadata on the fly (not sure which is easiest/cleanest).

sherifabdlnaby · 2023-01-12T08:44:38Z

@Tryken Totally Agree! Deliberately making replicas unhealthy is a hacky workaround (And I think it won't work in Prometheus' issue).

An "Unhealthy" Pod will trigger other alerts for Pods staying unhealthy for a long time because you shouldn't have unhealthy pods in the cluster for too long. Also, the pod is not really unhealthy... it's just idle 🤷🏻

Other solutions like adding Per-Pod Monitor in Prometheus to let it ignore non-leader pods are also hacky, as you'll need to re-adjust Prometheus whenever there is a new leader.

I believe having /metrics on the non-primary leader that serves no metrics is the best option.

yevhen-harmonizehr · 2023-05-17T10:19:35Z

@alexec sorry to bother, but is there any chance this will be reopened? I can see it is still an very annoying issue.

sherifabdlnaby · 2023-05-17T19:40:17Z

This issue shouldn't have been closed :/

terrytangyuan · 2023-07-07T14:37:18Z

#11295 should fix this. Could you try latest tag?

stale · 2023-09-17T10:24:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

srinivin · 2023-09-21T17:48:26Z

Still seeing the issue. Using the version 3.4.11.

I have replicaset as 3 and when trying to access metrics through the service, i see the response is empty otherwise works if i explicitly portforward from the leader pod.

terrytangyuan · 2023-09-21T18:04:19Z

@sakai-ast Would you like to help take a look to see what's still missing?

sakai-ast · 2023-09-24T04:41:22Z

@srinivin @terrytangyuan
I think this is correct behavior.
You sometimes can get metrics through the service, but it's not always because the service can't detect the leader pod, so sometimes the service randomly accesses to the non-leader pods, and then you get empty response.

Before #11295, you get an error like "Connection refused" when your service accesses the non-leader pods.

I also tested the behavior of v3.4.11 and v3.4.9(before the fix version) with curl to the service locally.

v3.4.11

when accessing non-leader pod

~ $ curl -v http://argo-argo-workflows-workflow-controller.argo.svc.cluster.local:8080/metrics
*   Trying 192.168.194.156:8080...
* Connected to argo-argo-workflows-workflow-controller.argo.svc.cluster.local (192.168.194.156) port 8080
> GET /metrics HTTP/1.1
> Host: argo-argo-workflows-workflow-controller.argo.svc.cluster.local:8080
> User-Agent: curl/8.3.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Sun, 24 Sep 2023 04:10:59 GMT
< Content-Length: 0

when accessing leader pod

~ $ curl -v http://argo-argo-workflows-workflow-controller.argo.svc.cluster.local:8080/metrics
*   Trying 192.168.194.156:8080...
* Connected to argo-argo-workflows-workflow-controller.argo.svc.cluster.local (192.168.194.156) port 8080
> GET /metrics HTTP/1.1
> Host: argo-argo-workflows-workflow-controller.argo.svc.cluster.local:8080
> User-Agent: curl/8.3.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/plain; version=0.0.4; charset=utf-8
< Date: Sun, 24 Sep 2023 04:11:30 GMT
< Transfer-Encoding: chunked
<
# HELP argo_workflows_count Number of Workflows currently accessible by the controller by status (refreshed every 15s)
# TYPE argo_workflows_count gauge
argo_workflows_count{status="Error"} 0
argo_workflows_count{status="Failed"} 0
argo_workflows_count{status="Pending"} 0
...

v3.4.9

when accessing non-leader pod

~ $ curl -v http://argo-argo-workflows-workflow-controller.argo.svc.cluster.local:8080/metrics
*   Trying 192.168.194.156:8080...
* connect to 192.168.194.156 port 8080 failed: Connection refused
* Failed to connect to argo-argo-workflows-workflow-controller.argo.svc.cluster.local port 8080 after 2 ms: Couldn't connect to server
* Closing connection
curl: (7) Failed to connect to argo-argo-workflows-workflow-controller.argo.svc.cluster.local port 8080 after 2 ms: Couldn't connect to server

when accessing leader pod

~ $ curl -v http://argo-argo-workflows-workflow-controller.argo.svc.cluster.local:8080/metrics
*   Trying 192.168.194.156:8080...
* Connected to argo-argo-workflows-workflow-controller.argo.svc.cluster.local (192.168.194.156) port 8080
> GET /metrics HTTP/1.1
> Host: argo-argo-workflows-workflow-controller.argo.svc.cluster.local:8080
> User-Agent: curl/8.3.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/plain; version=0.0.4; charset=utf-8
< Date: Sun, 24 Sep 2023 04:14:19 GMT
< Transfer-Encoding: chunked
<
# HELP argo_workflows_count Number of Workflows currently accessible by the controller by status (refreshed every 15s)
# TYPE argo_workflows_count gauge
argo_workflows_count{status="Error"} 0
argo_workflows_count{status="Failed"} 0
argo_workflows_count{status="Pending"} 0
argo_workflows_count{status="Running"} 0
...

If your service always gets an empty response, that is an another problem.

srinivin · 2023-09-25T06:18:54Z

Thanks for the update.

I assume in these cases, shouldnt it behave same as a like sql write and read replicas?

I mean though we direct write to only primary relicas, however read can happen from any replicas. same ways, though the leader pod is responsible for triggering workflows and other capablities, shouldnt the non-leader pods provide the metrics metadata. I assume the state of the metrics can be in a sharable persistence layer b/w leader and non-leader pods?

wshi5985 added type/bug triage labels Mar 31, 2022

github-actions bot added the needs/area label Mar 31, 2022

alexec added the type/regression Regression from previous behavior (a specific type of bug) label Mar 31, 2022

alexec added a commit to alexec/argo-workflows that referenced this issue Mar 31, 2022

fix: Add readiness check to controller. Fixes argoproj#8283

6a11a53

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec mentioned this issue Mar 31, 2022

fix: Add readiness check to controller. Fixes #8283 #8285

Merged

sarabala1979 assigned dpadhiar and unassigned sarabala1979 Apr 6, 2022

alexec closed this as completed in #8285 Apr 13, 2022

alexec added a commit that referenced this issue Apr 13, 2022

fix: Add readiness check to controller. Fixes #8283 (#8285)

283f6b5

Signed-off-by: Alex Collins <alex_collins@intuit.com>

sarabala1979 mentioned this issue Apr 14, 2022

Cherry pick v3.3.2 #8401

Closed

85 tasks

sarabala1979 pushed a commit that referenced this issue Apr 18, 2022

fix: Add readiness check to controller. Fixes #8283 (#8285)

b9f8b35

Signed-off-by: Alex Collins <alex_collins@intuit.com>

acj mentioned this issue Apr 22, 2022

Readiness probe failed on version 3.3.2 #8441

Closed

alexec added a commit to alexec/argo-workflows that referenced this issue Apr 22, 2022

Revert "fix: Add readiness check to controller. Fixes argoproj#8283 (a…

24add21

…rgoproj#8285)" This reverts commit 283f6b5. Signed-off-by: Alex Collins <alex_collins@intuit.com>

vbarbaresi mentioned this issue Apr 26, 2022

Workflow-controller non-leader replicas are unhealthy #5525

Closed

mccricardo mentioned this issue Jun 22, 2022

Prometheus metrics failing for Workflow argoproj/argo-helm#1301

Closed

yevhen-harmonizehr mentioned this issue May 17, 2023

Workflow controller metrics server only starts after leadership election is won #10037

Closed

3 tasks

sakai-ast mentioned this issue Jul 5, 2023

fix(controller): Enable dummy metrics server on non-leader workflow controller #11295

Merged

terrytangyuan reopened this Jul 7, 2023

stale bot added the problem/stale This has not had a response in some time label Sep 17, 2023

terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023

Joibel self-assigned this Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.3.1: only the leader workflow-controller pod can expose metrics. #8283

v3.3.1: only the leader workflow-controller pod can expose metrics. #8283

wshi5985 commented Mar 31, 2022 •

edited

Loading

github-actions bot commented Mar 31, 2022

wshi5985 commented Mar 31, 2022

alexec commented Mar 31, 2022

alexec commented Mar 31, 2022

whynowy commented Mar 31, 2022

alexec commented Mar 31, 2022

alexec commented Mar 31, 2022

whynowy commented Mar 31, 2022

alexec commented Mar 31, 2022 •

edited

Loading

wshi5985 commented Mar 31, 2022 •

edited

Loading

alexec commented Mar 31, 2022

wshi5985 commented Mar 31, 2022

wshi5985 commented Mar 31, 2022

alexec commented Mar 31, 2022

wshi5985 commented Apr 1, 2022 •

edited

Loading

alexec commented Apr 1, 2022

wshi5985 commented Apr 1, 2022

wshi5985 commented Apr 1, 2022

alexec commented Apr 1, 2022

whynowy commented Apr 1, 2022

wshi5985 commented Apr 1, 2022

alexec commented Apr 1, 2022

MatthewHou commented May 2, 2022

tyrken commented Jun 14, 2022

sherifabdlnaby commented Jan 12, 2023

yevhen-harmonizehr commented May 17, 2023

sherifabdlnaby commented May 17, 2023

terrytangyuan commented Jul 7, 2023

stale bot commented Sep 17, 2023

srinivin commented Sep 21, 2023

terrytangyuan commented Sep 21, 2023 •

edited

Loading

sakai-ast commented Sep 24, 2023

srinivin commented Sep 25, 2023

v3.3.1: only the leader workflow-controller pod can expose metrics. #8283

v3.3.1: only the leader workflow-controller pod can expose metrics. #8283

Comments

wshi5985 commented Mar 31, 2022 • edited Loading

Checklist

Summary

Diagnostics

github-actions bot commented Mar 31, 2022

wshi5985 commented Mar 31, 2022

alexec commented Mar 31, 2022

alexec commented Mar 31, 2022

whynowy commented Mar 31, 2022

alexec commented Mar 31, 2022

alexec commented Mar 31, 2022

whynowy commented Mar 31, 2022

alexec commented Mar 31, 2022 • edited Loading

wshi5985 commented Mar 31, 2022 • edited Loading

alexec commented Mar 31, 2022

wshi5985 commented Mar 31, 2022

wshi5985 commented Mar 31, 2022

alexec commented Mar 31, 2022

wshi5985 commented Apr 1, 2022 • edited Loading

alexec commented Apr 1, 2022

wshi5985 commented Apr 1, 2022

wshi5985 commented Apr 1, 2022

alexec commented Apr 1, 2022

whynowy commented Apr 1, 2022

wshi5985 commented Apr 1, 2022

alexec commented Apr 1, 2022

MatthewHou commented May 2, 2022

tyrken commented Jun 14, 2022

sherifabdlnaby commented Jan 12, 2023

yevhen-harmonizehr commented May 17, 2023

sherifabdlnaby commented May 17, 2023

terrytangyuan commented Jul 7, 2023

stale bot commented Sep 17, 2023

srinivin commented Sep 21, 2023

terrytangyuan commented Sep 21, 2023 • edited Loading

sakai-ast commented Sep 24, 2023

v3.4.11

v3.4.9

srinivin commented Sep 25, 2023

wshi5985 commented Mar 31, 2022 •

edited

Loading

alexec commented Mar 31, 2022 •

edited

Loading

wshi5985 commented Mar 31, 2022 •

edited

Loading

wshi5985 commented Apr 1, 2022 •

edited

Loading

terrytangyuan commented Sep 21, 2023 •

edited

Loading