Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flagger will not progress if 'primary' is down #639

Closed
ewarwick opened this issue Jun 30, 2020 · 9 comments
Closed

Flagger will not progress if 'primary' is down #639

ewarwick opened this issue Jun 30, 2020 · 9 comments

Comments

@ewarwick
Copy link

I've got Flagger setup to do a blue/green deployment of our service, using contour mesh provider and ingress. Under normal circumstances it works exactly as expected; canary comes up, runs my testing hook, promotes if it passes, does not promote if it does not pass.

The issue happens when the existing 'primary' pods become unhealthy. Flagger hangs and refuses to progress the deployment, repeating, ex:
{"level":"info","ts":"2020-06-30T14:19:09.297Z","caller":"controller/events.go:28","msg":"myapp-primary.mynamespace not ready: waiting for rollout to finish: 0 of 3 updated replicas are available","canary":"myapp.mynamespace"}

forever, regardless of the progressDeadlineSeconds setting.

This makes it impossible to roll out a fix to a broken application, because Flagger waits until the 'primary' pods are healthy before progressing. I think we could get around this by setting the field to skip canary analysis ( #380 ), but then we would not be able to make use of the analysis features.

Canary Yaml:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: mynamespace
spec:
  provider: contour
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  progressDeadlineSeconds: 180
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: myapp
  service:
    port: 3000
    portDiscovery: true
  analysis:
    interval: 10s
    threshold: 5
    iterations: 10
    metrics: []
    webhooks:
      - name: smoke-test
        type: pre-rollout
        url: http://flagger-loadtester.fluxcd/
        timeout: 15s
        metadata:
          type: bash
          cmd: "curl -sd 'anon' http://myapp-canary.test:3000/healthcheck"
@Imaskar
Copy link

Imaskar commented Jun 30, 2020

I'm having the same issue with Flagger on Linkerd.
Steps to reproduce:

  • install some app
  • delete the image from the registry that you use, delete the pod, so it will be stuck on pending
  • create a "fixed" image
  • try to deploy it

As a workaround I think it's possible to deploy with skipAnalysis: true once and then continue with skipAnalysis: false. Is there any blockers for that?
Meanwhile I'm fixing such problems by uninstalling, because the app is not yet on prod.

@stefanprodan
Copy link
Member

@ewarwick I don't see why would you want to run an analysis if primary is down, my guess is that you want to deliver a fix as soon as possible and that's achievable with skipAnalysis: true.

@ewarwick
Copy link
Author

@stefanprodan We're using Flagger in concert with Flux monitoring an image repo; so the goal is to be able to simply push the updated image and have it roll out as expected without needing to additionally first go and alter the canary definition, and then come back later and un-alter it. Our checks currently are relatively brief and are intended to ensure that the thing we are rolling out is functioning correctly. At least with our organization's change management processes, it would likely be faster for the canary to do it's thing than to do the change, push, un-change dance.

It may be that we are misunderstanding something about the intended use case, but my team is confused that rolling out a new version of the application is blocked upon the currently running version being 'healthy'; especially for the blue/green option. Is there a reason that it behaves this way?

@stefanprodan
Copy link
Member

Is there a reason that it behaves this way?

Yes, the primary check is needed for Flagger to take over an existing deployment without downtime. I'm considering adding a global flag to Flagger to disable the primary check.

@ewarwick
Copy link
Author

That would be really helpful, regardless of if it's global or on the individual Canary resource. Thanks for the explanation!

@Imaskar
Copy link

Imaskar commented Jun 30, 2020

@stefanprodan could you please elaborate why does primary have to be healthy? And do you mean all pods should be healthy or at least one?

I was thinking that if the primary is not healthy, Flagger could skip analysis automatically. As @ewarwick said, CD tool is unaware of the primary state.

@mdibaiee
Copy link
Contributor

mdibaiee commented Nov 10, 2021

Hello,
We are facing a similar issue. In our case, the primary is not completely unhealthy, but we have deployments with very high number of pods on Spot instances, which means they are volatile (as things can be expected to be in a Kubernetes environment). As such, most of the time we don't have 100% of desired replicas of our primary, it means our pods get killed and re-created continuously, and while that's considered healthy by us, it takes Flagger a lot longer to roll out a canary because it is considered unhealthy by Flagger and it waits until a short stable state is found, and then it starts. Meanwhile during the rollout it's still volatile and pods keep getting killed and re-created, which also contributes to a slower rollout.

I would suggest, instead of a global flag to "disable" primary check, allow configuration of the primary check: i.e. how many minimum pods are considered "healthy" before Flagger proceeds to deploy the canary? In our case, our primary has 150 pods, but having 100 pods is considered healthy and we are happy to proceed with a deployment in that state. This could be a percentage or a static number.

cc @stefanprodan we would be happy to contribute this to Flagger if we get green light that this is something you would review / want to support.

Thank you!

@stefanprodan
Copy link
Member

@mdibaiee I think it would be OK having an optional filed under analysis, that specifies the percentage of the primary pods required to be healthy, if this is set to zero, then it will ignore the whole primary status.

mdibaiee pushed a commit to personio/flagger that referenced this issue Nov 10, 2021
mdibaiee pushed a commit to personio/flagger that referenced this issue Nov 10, 2021
see fluxcd#639

Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>
mdibaiee added a commit to personio/flagger that referenced this issue Nov 10, 2021
see fluxcd#639

Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>
mdibaiee added a commit to personio/flagger that referenced this issue Nov 10, 2021
see fluxcd#639

Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>
mdibaiee added a commit to personio/flagger that referenced this issue Nov 10, 2021
mdibaiee added a commit to personio/flagger that referenced this issue Nov 10, 2021
mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021
mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021
mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021
mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021
mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021
mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021
mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021
mdibaiee added a commit to personio/flagger that referenced this issue Nov 11, 2021
@mdibaiee
Copy link
Contributor

mdibaiee commented Nov 23, 2021

Now it's possible (since 1.16.0) to configure the threshold of ready pods necessary to consider primary as healthy and proceed with the canary. So in case of a completely down primary, you can specify a primaryReadyThreshold of zero:

analysis:
  primaryReadyThreshold: 0

You can then revert it back to 100 or your own value after you are done deploying.

See the updated documentation on Canary Analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants