Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow controller metrics server only starts after leadership election is won #10037

Closed
2 of 3 tasks
avestuk opened this issue Nov 15, 2022 · 5 comments · Fixed by #11295
Closed
2 of 3 tasks

Workflow controller metrics server only starts after leadership election is won #10037

avestuk opened this issue Nov 15, 2022 · 5 comments · Fixed by #11295
Labels
area/controller Controller issues, panics area/metrics type/feature Feature request

Comments

@avestuk
Copy link

avestuk commented Nov 15, 2022

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I'd expect all workflow controllers to begin serving metrics but only get proper metrics from the leader.

Version

v3.4.3

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Run the workflow controllers with leadership election enabled.

Logs from the workflow controller

Leader logs:
argo-workflows-workflow-controller-7799fc99f4-hczpd controller time="2022-11-15T14:40:50.012Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"

Non-leader doesn't start the metrics server. 

Logs from in your workflow's wait container

n/a

It looks like the metrics server is started here: https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/controller.go#L301

but it could (I think) be started earlier when the controller is instantiated here: https://github.com/argoproj/argo-workflows/blob/master/cmd/workflow-controller/main.go#LL112

@sarabala1979 sarabala1979 added type/feature Feature request and removed type/bug labels Nov 21, 2022
@sarabala1979
Copy link
Member

@avestuk This is expected behavior. Can you provide your usecase for why you need all other controllers need to start the metrics server?

@RenePinnow
Copy link

@sarabala1979
How should monitoring in this case work. When I create a ServiceMonitor I will always have a pod that is sending no metrics and on monitoring Side I can't see that this is because the pod is currently not the leader. What is the best recommendation to monitor only the leader?

@avestuk
Copy link
Author

avestuk commented Mar 13, 2023

@sarabala1979 Sorry for the delayed response. The issue with this is that if Prometheus is scraping both pods for metrics an alert is gets raised that it cannot get metrics from one pod. There's no way for Proemtheus to know which pod is the leader.

We noticed that the behavior changed between v3.2.6 and v3.4.3.

I'd expect that the metrics server gets started even if no metrics are served. That way Prometheus can scrape all pods without errors.

@yevhen-harmonizehr
Copy link

I can see related issue been raised already, but closed without any fix #8283
I can not see any normal way to solve serviceMonitor auto-discovery question, i think best solution will be to enable metrics server for all pods as mentioned in this comment #8283 (comment)

@sakai-ast
Copy link
Contributor

sakai-ast commented Jun 30, 2023

Hello, I'm interested in making my first contribution to Argo Workflows. May I address this issue?
I have almost completed implementation of the suggestions in this comment: #8283 (comment)
I have created a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/metrics type/feature Feature request
Projects
None yet
6 participants