Readiness probe failed on version 3.3.2 #8441

omarkalloush · 2022-04-21T15:10:24Z

Summary

What happened/what you expected to happen?

Upgraded Argo Workflows from v3.3.1 to v3.3.2 , I have 3 workflow-controller pods in the replicaset 2 of them stay stuck on unhealty state with the following error Readiness probe failed: Get "http://10.57.3.31:9090/metrics": dial tcp 10.57.3.31:9090: connect: connection refused

What version are you running?

v3.3.2

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

alexec · 2022-04-21T16:43:22Z

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

This has changed in v3.3. Service only route traffic to ready pods. Metrics were being routed from the metrics service to the non-leader controller pods, but these cannot field the metrics request - their metrics is worse than useless. This is semantically correct - the pods are not ready to accept traffic.

These pods will pass their liveness probe.

Could you explain more what you meant by "unhealthy"?

If you need a work-around, and you do not use the metrics service, then you should remove these readiness probes.

acj · 2022-04-22T14:59:00Z

I ran into this when upgrading from 3.2.x to 3.3.2 and wanted to share a few notes. My previous deployment used a rolling deploy strategy, which then caused the following error when I tried to apply the upgrade:

error: Deployment.apps "workflow-controller" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy type is 'Recreate'

My initial plan was to continue using the rolling deploy (instead of Recreate), which is when I ran into this issue with the readiness probe. The failing probe blocked the rolling deploy indefinitely because the first new pod never becomes ready, which might be what OP means by "unhealthy". The discussion in #8283 helped me to understand why the new strategy is needed. (Not scraping the non-leader pods fixed several confusing quirks in our metrics, so thanks for the improvement!)

To perform the upgrade, I manually edited the old deployment to use strategy type Recreate, gave it a moment to cycle, and then applied the 3.3.2 manifest. Then the new controller pods started up correctly.

alexec · 2022-04-22T16:49:56Z

We've now seen this issue. Argo CD treats un-ready pods an un-healthy. While I think it is semantically wrong, it is correct in practice.

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec · 2022-04-22T16:52:34Z

I'm reverting that fix.

Signed-off-by: Alex Collins <alex_collins@intuit.com>

omarkalloush added type/bug triage labels Apr 21, 2022

sarabala1979 assigned alexec Apr 21, 2022

alexec added the area/controller Controller issues, panics label Apr 21, 2022

alexec added the solution/workaround There's a workaround, might not be great, but exists label Apr 21, 2022

alexec removed the triage label Apr 22, 2022

alexec added a commit to alexec/argo-workflows that referenced this issue Apr 22, 2022

fix: Revert controller readiness changes. Fixes argoproj#8441

c0e11c2

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec mentioned this issue Apr 22, 2022

fix: Revert controller readiness changes. Fixes #8441 #8454

Merged

alexec closed this as completed in #8454 Apr 22, 2022

alexec added a commit that referenced this issue Apr 22, 2022

fix: Revert controller readiness changes. Fixes #8441 (#8454)

ae803bb

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec added a commit that referenced this issue Apr 25, 2022

fix: Revert controller readiness changes. Fixes #8441 (#8454)

9c08aed

Signed-off-by: Alex Collins <alex_collins@intuit.com>

posquit0 mentioned this issue Dec 8, 2022

chore: Remove deprecated metrics service from manifests #9399

Merged

agilgur5 added the area/manifests label Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readiness probe failed on version 3.3.2 #8441

Readiness probe failed on version 3.3.2 #8441

omarkalloush commented Apr 21, 2022

alexec commented Apr 21, 2022

acj commented Apr 22, 2022

alexec commented Apr 22, 2022

alexec commented Apr 22, 2022

Readiness probe failed on version 3.3.2 #8441

Readiness probe failed on version 3.3.2 #8441

Comments

omarkalloush commented Apr 21, 2022

Summary

alexec commented Apr 21, 2022

acj commented Apr 22, 2022

alexec commented Apr 22, 2022

alexec commented Apr 22, 2022