Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when updating the number of replicas, no effect on the number of primary pods #574

Closed
fabianpiau opened this issue Apr 30, 2020 · 13 comments · Fixed by #1106
Closed
Labels
kind/bug Something isn't working

Comments

@fabianpiau
Copy link

Hi,

I notice when I change the number of replicas to scale up the app, then it does not have the desired effect, like if Flagger is interfering and hijack the new setting for the canary pods only. I did not try scaling down, but I assume it is the same.

PS: I do not use an HPA (I plan to try out at a later stage)

Scenario to reproduce:

Let's say I have replicaCount set to 2, I did not change it. When I did my deployment of my app v1.1, there were 2 canary pods and 2 primary pods during the analysis. At the end of the promotion 0 canary pods and 2 primary pods. That's fine and was expected.

Now, I increase the number of replica to 4 and canary deploy a new version v1.2. There were 4 canary pods and 2 primary pods during the analysis. I guess that's fine as I don't use an HPA. But, at the end of the promotion, I have 0 canary pods and 2 primary pods still.

To go further, then I deploy a v1.3 keeping the number of replica to 4. There were 1 canary pod (not sure why not 2 here?) and 2 primary pods during the analysis. And, at the end of the promotion, I have 0 canary pods and 2 primary pods. So the number 4 is totally ignored and the behavior is quite different, I can't explain why 1 canary pod and not 2.

Last test, I disabled Flagger and tried again the same scenario (i.e. replica set from 2 to 4) and it was ok and the new setting was taken into account ending up with 4 pods of my app.

I was able to reproduce the exact same scenario on my local Kubernetes but also on an AWS Sandbox Kube.

Flagger version used (the last one): 1.0.0 RC4

Can you help?

@fabianpiau
Copy link
Author

I am posting a second message with some extra info on this.

So I tried to use the canary release with an HPA but it does not change the behavior and the scenario I described above is still reproducible.

The impact of the HPA was the fact it spins up less canary pods.

That's strange this has never been raised before, unless I have a miss-configuration somewhere but apart from the scaling issue, the canary deployment works well.

@stefanprodan
Copy link
Member

Have you added the HPA reference to the canary spec?

@fabianpiau
Copy link
Author

fabianpiau commented Apr 30, 2020

Yes I did (FYI, avd is the app name)

autoscaling.yaml

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app: "avd"
  name: "avd"
spec:
  scaleTargetRef:
    apiVersion: extensions/v2beta1
    kind: Deployment
    name: "avd"
  minReplicas: 1
  maxReplicas: 6
  metrics:
    - type: Resource
      resource:
        name: cpu
        targetAverageUtilization: 50

canary.yaml

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: avd
  namespace: poc-flagger
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: avd
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 600
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: avd
  service:
    # service port number
    port: 8080
    # container port number or name (optional)
    targetPort: 8080
    # Istio gateways (optional)
    gateways:
      - istio-system/wildcard-istio-gateway
    # Istio virtual service host names (optional)
    hosts:
      - avd.poc-flagger.svc.cluster.local
      - avd.istio-gateway.backend.k8s.us-west-2.hcom-sandbox-aws.aws.hcom
    # Istio retry policy (optional)
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
  analysis:
    # schedule interval (default 60s)
    interval: 1m
    # max number of failed metric checks before rollback
    threshold: 5
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 10
    metrics:
      - name: request-success-rate
        # minimum req success rate (non 5xx responses)
        # percentage (0-100)
        thresholdRange:
          min: 99
        interval: 30s
      - name: request-duration
        # maximum req duration P99
        # milliseconds
        thresholdRange:
          max: 500
        interval: 30s

@fabianpiau
Copy link
Author

Actually, after a second thought, I think when HPA is enabled then the replicaCount is not taken into account anymore. That may explain why my number of replicas does not scale up to 4 (Kube is not using it anymore and just look at the actual cpu usage).

But it does not explain why it does not work when I don't use an HPA 🤔

@stefanprodan
Copy link
Member

You should remove the replicaCount from your deployment when using HPA. As for the non-HPA setup it looks like a bug indeed.

@stefanprodan
Copy link
Member

For now, if you don't want to use autoscaling you can use a dummy HPA with minReplicas= maxReplicas.

@fabianpiau
Copy link
Author

I tried to scale up and down using the same values minReplicas=maxReplicas and this workaround worked 👍

I will leave the issue open so you can investigate why it does not work without a "dummy HPA".

Thanks for your support!

@stefanprodan stefanprodan added the kind/bug Something isn't working label Apr 30, 2020
@ridhoq
Copy link

ridhoq commented Nov 4, 2020

Hello there, we just ran into this issue as well on v1.0.0. Is there any progress on this?

@oavdonin
Copy link

The same for me, it seems that flagger doesn't track the spec.replicas change when hpa is not used.

@Alpacius
Copy link

Alpacius commented Dec 28, 2020

I tried to scale up and down using the same values minReplicas=maxReplicas and this workaround worked 👍

I will leave the issue open so you can investigate why it does not work without a "dummy HPA".

Thanks for your support!

It may be caused by how flagger detects changes in spec:

// HasTargetChanged returns true if the canary deployment pod spec has changed
func (c *DeploymentController) HasTargetChanged(cd *flaggerv1.Canary) (bool, error) {
	targetName := cd.Spec.TargetRef.Name
	canary, err := c.kubeClient.AppsV1().Deployments(cd.Namespace).Get(context.TODO(), targetName, metav1.GetOptions{})
	if err != nil {
		return false, fmt.Errorf("deployment %s.%s get query error: %w", targetName, cd.Namespace, err)
	}

	return hasSpecChanged(cd, canary.Spec.Template)
}

For deployments, only changes in pod templates would be recognized. Change of spec.replicas is simply omitted.

@segevmatuti1
Copy link

Hey @stefanprodan ,
any updates regarding this issue?
thank you !

@eloo
Copy link

eloo commented Feb 10, 2022

Just stumbled over this issue.
It seems that this problem is still present and the number of replicas can not be adjusted when HPA is not used

@eloo
Copy link

eloo commented Feb 14, 2022

Awesome that this is fixed so fast

@somtochiama thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants