-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite loop on canary deployment using Gateway API and EKS #1732
Comments
Can you please try Flagger 1.39, we fixed a drift detection problem for Gateway API |
Thanks @stefanprodan. I have now upgraded but unfortunately still seem to have the same issue:
|
Any updates on this please @stefanprodan ? It does appear to be a similar bug to the one that was fixed in 1.39 in terms of the behaviour. |
We can't replicate this behaviour, I think something else is changing the objects while the analysis runs. Can you post here the HTTPRoutes while this happens, run |
Here are the HTTPRoute object states at various points during the steps: Before starting canary:
Just after starting canary:
Point at which it gets reset back to 100% traffic routing:
The only other thing I could think of that would be changing it other than Flagger is the AWS Gateway API controller itself (https://www.gateway-api-controller.eks.aws.dev/latest/). Although as far as I can tell that is meant to just apply any changes to the HTTPRoute object to the equivalent routing configuration in VPC Lattice. Thanks |
Ok this is clear now, the AWS controller injects an annotation |
Describe the bug
I'm attempting to use Flagger on AWS EKS with the Gateway API making use of the AWS Gateway API Controller. I have followed the instructions in the tutorial at: https://docs.flagger.app/tutorials/gatewayapi-progressive-delivery but when triggering a canary deployment Flagger seems to get stuck in a loop of starting the canary deployment, changing the HTTRoute object weightings (in this case 10% to the canary, 90% to the primary) and then restarting the canary deployment, it never fails the canary after reaching the progress deadline timeout. It doesn't even appear to be getting to the rollout stage as the logs don't indicate the webhook ever running, however the pre-rollout check does run and succeed, but then runs again the next time round the loop. As an experiment I also disabled all metric checks as I don't think it is even getting as far as running them. Looking at the traffic weightings in AWS VPC Lattice I can see it alternating between the 90%/10% split and then briefly goes back up to 100%/0% before going back round the loop.
I have also tried setting skipAnalysis to true, which successfully promoted the canary, so the problem seems to be something to do with the analysis stage itself.
My canary configuration is as follows:
Flagger logs (in debug mode):
Any ideas on what might be going wrong?
To Reproduce
Expected behavior
Canary rollout progresses and succeeds
Additional context
The text was updated successfully, but these errors were encountered: