-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flagger will not progress if 'primary' is down #639
Comments
I'm having the same issue with Flagger on Linkerd.
As a workaround I think it's possible to deploy with |
@ewarwick I don't see why would you want to run an analysis if primary is down, my guess is that you want to deliver a fix as soon as possible and that's achievable with |
@stefanprodan We're using Flagger in concert with Flux monitoring an image repo; so the goal is to be able to simply push the updated image and have it roll out as expected without needing to additionally first go and alter the canary definition, and then come back later and un-alter it. Our checks currently are relatively brief and are intended to ensure that the thing we are rolling out is functioning correctly. At least with our organization's change management processes, it would likely be faster for the canary to do it's thing than to do the change, push, un-change dance. It may be that we are misunderstanding something about the intended use case, but my team is confused that rolling out a new version of the application is blocked upon the currently running version being 'healthy'; especially for the blue/green option. Is there a reason that it behaves this way? |
Yes, the primary check is needed for Flagger to take over an existing deployment without downtime. I'm considering adding a global flag to Flagger to disable the primary check. |
That would be really helpful, regardless of if it's global or on the individual Canary resource. Thanks for the explanation! |
@stefanprodan could you please elaborate why does primary have to be healthy? And do you mean all pods should be healthy or at least one? I was thinking that if the primary is not healthy, Flagger could skip analysis automatically. As @ewarwick said, CD tool is unaware of the primary state. |
Hello, I would suggest, instead of a global flag to "disable" primary check, allow configuration of the primary check: i.e. how many minimum pods are considered "healthy" before Flagger proceeds to deploy the canary? In our case, our primary has 150 pods, but having 100 pods is considered healthy and we are happy to proceed with a deployment in that state. This could be a percentage or a static number. cc @stefanprodan we would be happy to contribute this to Flagger if we get green light that this is something you would review / want to support. Thank you! |
@mdibaiee I think it would be OK having an optional filed under analysis, that specifies the percentage of the primary pods required to be healthy, if this is set to zero, then it will ignore the whole primary status. |
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me> Signed-off-by: Mahdi Dibaiee <mahdi.dibaiee@personio.de>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
see fluxcd#639 Signed-off-by: Mahdi Dibaiee <mdibaiee@pm.me>
Now it's possible (since 1.16.0) to configure the threshold of ready pods necessary to consider primary as healthy and proceed with the canary. So in case of a completely down primary, you can specify a
You can then revert it back to 100 or your own value after you are done deploying. See the updated documentation on Canary Analysis. |
I've got Flagger setup to do a blue/green deployment of our service, using contour mesh provider and ingress. Under normal circumstances it works exactly as expected; canary comes up, runs my testing hook, promotes if it passes, does not promote if it does not pass.
The issue happens when the existing 'primary' pods become unhealthy. Flagger hangs and refuses to progress the deployment, repeating, ex:
{"level":"info","ts":"2020-06-30T14:19:09.297Z","caller":"controller/events.go:28","msg":"myapp-primary.mynamespace not ready: waiting for rollout to finish: 0 of 3 updated replicas are available","canary":"myapp.mynamespace"}
forever, regardless of the progressDeadlineSeconds setting.
This makes it impossible to roll out a fix to a broken application, because Flagger waits until the 'primary' pods are healthy before progressing. I think we could get around this by setting the field to skip canary analysis ( #380 ), but then we would not be able to make use of the analysis features.
Canary Yaml:
The text was updated successfully, but these errors were encountered: