Flagger will not progress if 'primary' is down #639

ewarwick · 2020-06-30T15:38:06Z

I've got Flagger setup to do a blue/green deployment of our service, using contour mesh provider and ingress. Under normal circumstances it works exactly as expected; canary comes up, runs my testing hook, promotes if it passes, does not promote if it does not pass.

The issue happens when the existing 'primary' pods become unhealthy. Flagger hangs and refuses to progress the deployment, repeating, ex:
{"level":"info","ts":"2020-06-30T14:19:09.297Z","caller":"controller/events.go:28","msg":"myapp-primary.mynamespace not ready: waiting for rollout to finish: 0 of 3 updated replicas are available","canary":"myapp.mynamespace"}

forever, regardless of the progressDeadlineSeconds setting.

This makes it impossible to roll out a fix to a broken application, because Flagger waits until the 'primary' pods are healthy before progressing. I think we could get around this by setting the field to skip canary analysis ( #380 ), but then we would not be able to make use of the analysis features.

Canary Yaml:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: mynamespace
spec:
  provider: contour
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  progressDeadlineSeconds: 180
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: myapp
  service:
    port: 3000
    portDiscovery: true
  analysis:
    interval: 10s
    threshold: 5
    iterations: 10
    metrics: []
    webhooks:
      - name: smoke-test
        type: pre-rollout
        url: http://flagger-loadtester.fluxcd/
        timeout: 15s
        metadata:
          type: bash
          cmd: "curl -sd 'anon' http://myapp-canary.test:3000/healthcheck"

The text was updated successfully, but these errors were encountered:

Imaskar · 2020-06-30T15:57:52Z

I'm having the same issue with Flagger on Linkerd.
Steps to reproduce:

install some app
delete the image from the registry that you use, delete the pod, so it will be stuck on pending
create a "fixed" image
try to deploy it

As a workaround I think it's possible to deploy with skipAnalysis: true once and then continue with skipAnalysis: false. Is there any blockers for that?
Meanwhile I'm fixing such problems by uninstalling, because the app is not yet on prod.

stefanprodan · 2020-06-30T16:37:14Z

@ewarwick I don't see why would you want to run an analysis if primary is down, my guess is that you want to deliver a fix as soon as possible and that's achievable with skipAnalysis: true.

ewarwick · 2020-06-30T17:10:19Z

@stefanprodan We're using Flagger in concert with Flux monitoring an image repo; so the goal is to be able to simply push the updated image and have it roll out as expected without needing to additionally first go and alter the canary definition, and then come back later and un-alter it. Our checks currently are relatively brief and are intended to ensure that the thing we are rolling out is functioning correctly. At least with our organization's change management processes, it would likely be faster for the canary to do it's thing than to do the change, push, un-change dance.

It may be that we are misunderstanding something about the intended use case, but my team is confused that rolling out a new version of the application is blocked upon the currently running version being 'healthy'; especially for the blue/green option. Is there a reason that it behaves this way?

stefanprodan · 2020-06-30T17:30:00Z

Is there a reason that it behaves this way?

Yes, the primary check is needed for Flagger to take over an existing deployment without downtime. I'm considering adding a global flag to Flagger to disable the primary check.

ewarwick · 2020-06-30T17:39:09Z

That would be really helpful, regardless of if it's global or on the individual Canary resource. Thanks for the explanation!

Imaskar · 2020-06-30T19:39:01Z

@stefanprodan could you please elaborate why does primary have to be healthy? And do you mean all pods should be healthy or at least one?

I was thinking that if the primary is not healthy, Flagger could skip analysis automatically. As @ewarwick said, CD tool is unaware of the primary state.

mdibaiee · 2021-11-10T10:39:10Z

Hello,
We are facing a similar issue. In our case, the primary is not completely unhealthy, but we have deployments with very high number of pods on Spot instances, which means they are volatile (as things can be expected to be in a Kubernetes environment). As such, most of the time we don't have 100% of desired replicas of our primary, it means our pods get killed and re-created continuously, and while that's considered healthy by us, it takes Flagger a lot longer to roll out a canary because it is considered unhealthy by Flagger and it waits until a short stable state is found, and then it starts. Meanwhile during the rollout it's still volatile and pods keep getting killed and re-created, which also contributes to a slower rollout.

I would suggest, instead of a global flag to "disable" primary check, allow configuration of the primary check: i.e. how many minimum pods are considered "healthy" before Flagger proceeds to deploy the canary? In our case, our primary has 150 pods, but having 100 pods is considered healthy and we are happy to proceed with a deployment in that state. This could be a percentage or a static number.

cc @stefanprodan we would be happy to contribute this to Flagger if we get green light that this is something you would review / want to support.

Thank you!

stefanprodan · 2021-11-10T10:59:21Z

@mdibaiee I think it would be OK having an optional filed under analysis, that specifies the percentage of the primary pods required to be healthy, if this is set to zero, then it will ignore the whole primary status.

see fluxcd#639

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me> Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

mdibaiee · 2021-11-23T09:48:16Z

Now it's possible (since 1.16.0) to configure the threshold of ready pods necessary to consider primary as healthy and proceed with the canary. So in case of a completely down primary, you can specify a primaryReadyThreshold of zero:

analysis:
  primaryReadyThreshold: 0

You can then revert it back to 100 or your own value after you are done deploying.

See the updated documentation on Canary Analysis.

mdibaiee pushed a commit to personio/flagger that referenced this issue Nov 10, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

437bde4

see fluxcd#639

mdibaiee mentioned this issue Nov 10, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary #1048

Merged

mdibaiee pushed a commit to personio/flagger that referenced this issue Nov 10, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

75fa433

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 10, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

d570035

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 10, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

05c6da4

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me> Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 10, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

9c06fd4

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 10, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

0d38a4b

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

8ab1c4c

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

c021701

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

c9eebe8

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

ed5d634

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

c2dd011

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

aae14c8

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

5cc36a0

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021

[feat] primaryReadyThreshold: allow configuring threshold for primary

8f50521

see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>

stefanprodan closed this as completed Nov 23, 2021

stefanprodan mentioned this issue May 27, 2022

Current process need human intervention -- after invalid image tag pushed to primary deploy #1207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flagger will not progress if 'primary' is down #639

Flagger will not progress if 'primary' is down #639

ewarwick commented Jun 30, 2020

Imaskar commented Jun 30, 2020

stefanprodan commented Jun 30, 2020

ewarwick commented Jun 30, 2020

stefanprodan commented Jun 30, 2020

ewarwick commented Jun 30, 2020

Imaskar commented Jun 30, 2020

mdibaiee commented Nov 10, 2021 •

edited

Loading

stefanprodan commented Nov 10, 2021

mdibaiee commented Nov 23, 2021 •

edited

Loading

Flagger will not progress if 'primary' is down #639

Flagger will not progress if 'primary' is down #639

Comments

ewarwick commented Jun 30, 2020

Imaskar commented Jun 30, 2020

stefanprodan commented Jun 30, 2020

ewarwick commented Jun 30, 2020

stefanprodan commented Jun 30, 2020

ewarwick commented Jun 30, 2020

Imaskar commented Jun 30, 2020

mdibaiee commented Nov 10, 2021 • edited Loading

stefanprodan commented Nov 10, 2021

mdibaiee commented Nov 23, 2021 • edited Loading

mdibaiee commented Nov 10, 2021 •

edited

Loading

mdibaiee commented Nov 23, 2021 •

edited

Loading