Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaledobject fallback not working as expected when prometheus trigger is failing #4249

Closed
lwebbz opened this issue Feb 17, 2023 · 14 comments · Fixed by #4263
Closed

Scaledobject fallback not working as expected when prometheus trigger is failing #4249

lwebbz opened this issue Feb 17, 2023 · 14 comments · Fixed by #4263
Assignees
Labels
bug Something isn't working

Comments

@lwebbz
Copy link

lwebbz commented Feb 17, 2023

Report

When using prometheus as a trigger for a Scaledobject I've run into some unexpected behaviour where the number of replicas of a deployment will oscillate between the minReplicaCount and fallback.replicas despite the status of the trigger consistently being in the failing state. Furthermore the keda_scaler_metrics_value consistently returns a value of 0 instead of fallback.replicas when the trigger is failing, this is different to the cron scaler for example which will fallback to the correct number of replicas specified by fallback.replicas when given an invalid timezone.

Here are the manifests I've been using to test this. Note that spec.triggers.metadata.query is invalid since the promQL rate function requires an argument and in this example I haven't provided it with one. This consistently produces a 400 bad request to get the scaler to fail

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: keda-scaledobject-test
spec:
  scaleTargetRef:
    name: keda-fallback-deployment-test     
  minReplicaCount:  1                            
  maxReplicaCount:  20                         
  fallback:                                      
    failureThreshold: 3             
    replicas: 5                                
  triggers:
    - type: prometheus
      metadata:
        metricName: test-fallback-metric
        query:  rate() 
        serverAddress:  http://thanos-prometheus-thanos-querier.telemetry:10902/metrics
        threshold: '1'
        namespace: 'telemetry'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: keda-fallback-deployment-test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: keda-fallback-test
  template:
    metadata:
      labels:
        app: keda-fallback-test
    spec:
      containers:
      - name: keda-fallback-test
        image: nginx
        resources:
          requests:
            memory: 150Mi
            cpu: 150m
          limits:
            memory: 1000Mi
            cpu: 250m

Expected Behavior

Deployment scales to fallback.replicas i.e. 5 pods and then stays at 5 replicas.

Actual Behavior

  1. Deployment will scale to 5 pods i.e. fallback.replicas.
  2. Within the next 3 minutes the deployment will scale down to 1 pod i.e. minReplicaCount
  3. Repeat cycle after 30 seconds

Steps to Reproduce the Problem

  1. Deploy the scaledobject and deployment as stated in the report section above but change the prometheus serverAddress to your server address
  2. Observe the number of pods in the deployment fluctuate

Logs from KEDA operator

Deployment state 1:

➜  ~ k get deployments -n keda
NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
keda-fallback-deployment-test     5/5     5            5           4d1h
keda-operator                     1/1     1            1           302d
keda-operator-metrics-apiserver   1/1     1            1           302d

Deployment state 2:

➜  ~ k get deployments -n keda
NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
keda-fallback-deployment-test     1/1     1            1           4d1h
keda-operator                     1/1     1            1           302d
keda-operator-metrics-apiserver   1/1     1            1           302d

KEDAScalerFailed event which is as expected because I've used a bad query

 ~ k get events -n keda
63s         Warning   KEDAScalerFailed          scaledobject/keda-scaledobject-test                       prometheus query api returned error. status: 400 response: {"status":"error","errorType":"bad_data","error":"1:1: parse error: expected 1 argument(s) in call to \"rate\", got 0"}

This metric returns 0 instead of fallback.replicas (5)

➜  ~ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/keda/s0-prometheus-test-fallback-metric?labelSelector=scaledobject.keda.sh%2Fname%3Dkeda-scaledobject-test"

{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-prometheus-test-fallback-metric","metricLabels":null,"timestamp":"2023-02-17T11:25:58Z","value":"0"}]}

Can see that the status of the scaler is Failing

➜  ~ k get scaledobjects.keda.sh -n keda keda-scaledobject-test -o yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"keda.sh/v1alpha1","kind":"ScaledObject","metadata":{"annotations":{},"name":"keda-scaledobject-test","namespace":"keda"},"spec":{"fallback":{"failureThreshold":3,"replicas":5},"maxReplicaCount":20,"minReplicaCount":1,"scaleTargetRef":{"name":"keda-fallback-deployment-test"},"triggers":[{"metadata":{"metricName":"test-fallback-metric","namespace":"telemetry","query":"rate()","serverAddress":"http://thanos-prometheus-thanos-querier.telemetry:10902/metrics","threshold":"1"},"type":"prometheus"}]}}
  creationTimestamp: "2023-02-13T10:22:35Z"
  finalizers:
  - finalizer.keda.sh
  generation: 73
  labels:
    scaledobject.keda.sh/name: keda-scaledobject-test
  name: keda-scaledobject-test
  namespace: keda
  resourceVersion: "463687104"
  uid: c3e59326-71e3-4c49-b1a9-06291cb50509
spec:
  fallback:
    failureThreshold: 3
    replicas: 5
  maxReplicaCount: 20
  minReplicaCount: 1
  scaleTargetRef:
    name: keda-fallback-deployment-test
  triggers:
  - metadata:
      metricName: test-fallback-metric
      namespace: telemetry
      query: rate()
      serverAddress: http://thanos-prometheus-thanos-querier.telemetry:10902/metrics
      threshold: "1"
    type: prometheus
status:
  conditions:
  - message: ScaledObject is defined correctly and is ready for scaling
    reason: ScaledObjectReady
    status: "True"
    type: Ready
  - message: Scaling is not performed because triggers are not active
    reason: ScalerNotActive
    status: "False"
    type: Active
  - message: At least one trigger is falling back on this scaled object
    reason: FallbackExists
    status: "True"
    type: Fallback
  externalMetricNames:
  - s0-prometheus-test-fallback-metric
  health:
    s0-prometheus-test-fallback-metric:
      numberOfFailures: 33
      status: Failing
  hpaName: keda-hpa-keda-scaledobject-test
  lastActiveTime: "2023-02-17T11:00:54Z"
  originalReplicaCount: 10
  scaleTargetGVKR:
    group: apps
    kind: Deployment
    resource: deployments
    version: v1
  scaleTargetKind: apps/v1.Deployment

Desired HPA metric is 1 instead of 5

~ k describe hpa -n keda keda-hpa-keda-scaledobject-test
Name:                                                           keda-hpa-keda-scaledobject-test
Namespace:                                                      keda
Labels:                                                         app.kubernetes.io/managed-by=keda-operator
                                                                app.kubernetes.io/name=keda-hpa-keda-scaledobject-test
                                                                app.kubernetes.io/part-of=keda-scaledobject-test
                                                                app.kubernetes.io/version=2.8.2
                                                                scaledobject.keda.sh/name=keda-scaledobject-test
Annotations:                                                    <none>
CreationTimestamp:                                              Mon, 13 Feb 2023 10:22:35 +0000
Reference:                                                      Deployment/keda-fallback-deployment-test
Metrics:                                                        ( current / target )
  "s0-prometheus-test-fallback-metric" (target average value):  0 / 1
Min replicas:                                                   1
Max replicas:                                                   20
Deployment pods:                                                5 current / 1 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    SucceededRescale  the HPA controller was able to update the target scale to 1
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from external metric s0-prometheus-test-fallback-metric(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: keda-scaledobject-test,},MatchExpressions:[]LabelSelectorRequirement{},})
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:
  Type     Reason                   Age                   From                       Message
  ----     ------                   ----                  ----                       -------
  Normal   SuccessfulRescale        105s (x1889 over 4d)  horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

KEDA Version

2.8.2

Kubernetes Version

< 1.23

Platform

Amazon Web Services

Scaler Details

Prometheus

Anything else?

No response

@lwebbz lwebbz added the bug Something isn't working label Feb 17, 2023
@JorTurFer
Copy link
Member

Hello,
I think that the problem is related with ignoring null values:
image

In case of successful connection (not successful query) with invalid values, this parameter can modify the returned value to 0 without any error. I think that's why you see 0 instead of the fallback value.

Could you try again setting ignoreNullValues: "false" in trigger metadata and share the result?

@lwebbz
Copy link
Author

lwebbz commented Feb 17, 2023

Hi Jorge,
Unfortunately I'm seeing the same thing, you can see that ignoreNullValues is now set to false on the scaled object. But the number of pods still fluctuates between 1 and 5.

➜  ~ k describe scaledobjects.keda.sh -n keda keda-scaledobject-test
Name:         keda-scaledobject-test
Namespace:    keda
Labels:       scaledobject.keda.sh/name=keda-scaledobject-test
Annotations:  <none>
API Version:  keda.sh/v1alpha1
Kind:         ScaledObject
Metadata:
  Creation Timestamp:  2023-02-17T13:38:33Z
  Finalizers:
    finalizer.keda.sh
  Generation:  1
  Managed Fields:
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"finalizer.keda.sh":
        f:labels:
          .:
          f:scaledobject.keda.sh/name:
    Manager:      keda
    Operation:    Update
    Time:         2023-02-17T13:38:33Z
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:fallback:
          .:
          f:failureThreshold:
          f:replicas:
        f:maxReplicaCount:
        f:minReplicaCount:
        f:scaleTargetRef:
          .:
          f:name:
        f:triggers:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2023-02-17T13:38:33Z
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:externalMetricNames:
        f:hpaName:
        f:originalReplicaCount:
        f:scaleTargetGVKR:
          .:
          f:group:
          f:kind:
          f:resource:
          f:version:
        f:scaleTargetKind:
    Manager:      keda
    Operation:    Update
    Subresource:  status
    Time:         2023-02-17T13:39:33Z
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:health:
          .:
          f:s0-prometheus-test-fallback-metric:
            .:
            f:numberOfFailures:
            f:status:
    Manager:         keda-adapter
    Operation:       Update
    Subresource:     status
    Time:            2023-02-17T13:39:34Z
  Resource Version:  463945549
  UID:               1c9505d7-28a2-4449-a28c-5c92f9fda8c9
Spec:
  Fallback:
    Failure Threshold:  3
    Replicas:           5
  Max Replica Count:    20
  Min Replica Count:    1
  Scale Target Ref:
    Name:  keda-fallback-deployment-test
  Triggers:
    Metadata:
      Ignore Null Values:  false
      Metric Name:         test-fallback-metric
      Namespace:           telemetry
      Query:               rate()
      Server Address:      http://thanos-prometheus-thanos-querier.telemetry:10902/metrics
      Threshold:           1
    Type:                  prometheus
Status:
  Conditions:
    Message:  ScaledObject is defined correctly and is ready for scaling
    Reason:   ScaledObjectReady
    Status:   True
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Message:  At least one trigger is falling back on this scaled object
    Reason:   FallbackExists
    Status:   True
    Type:     Fallback
  External Metric Names:
    s0-prometheus-test-fallback-metric
  Health:
    s0-prometheus-test-fallback-metric:
      Number Of Failures:  115
      Status:              Failing
  Hpa Name:                keda-hpa-keda-scaledobject-test
  Original Replica Count:  7
  Scale Target GVKR:
    Group:            apps
    Kind:             Deployment
    Resource:         deployments
    Version:          v1
  Scale Target Kind:  apps/v1.Deployment
Events:
  Type     Reason              Age                   From           Message
  ----     ------              ----                  ----           -------
  Normal   KEDAScalersStarted  28m                   keda-operator  Started scalers watch
  Normal   ScaledObjectReady   28m (x2 over 28m)     keda-operator  ScaledObject is ready for scaling
  Warning  KEDAScalerFailed    8m30s (x41 over 28m)  keda-operator  prometheus query api returned error. status: 400 response: {"status":"error","errorType":"bad_data","error":"1:1: parse error: expected 1 argument(s) in call to \"rate\", got 0"}
  Normal   KEDAScalersStarted  6m21s                 keda-operator  Started scalers watch
  Warning  KEDAScalerFailed    21s (x13 over 6m21s)  keda-operator  prometheus query api returned error. status: 400 response: {"status":"error","errorType":"bad_data","error":"1:1: parse error: expected 1 argument(s) in call to \"rate\", got 0"}

This is what the hpa looks like

➜  ~ kubectl describe hpa -n keda keda-hpa-keda-scaledobject-test
Name:                                                           keda-hpa-keda-scaledobject-test
Namespace:                                                      keda
Labels:                                                         app.kubernetes.io/managed-by=keda-operator
                                                                app.kubernetes.io/name=keda-hpa-keda-scaledobject-test
                                                                app.kubernetes.io/part-of=keda-scaledobject-test
                                                                app.kubernetes.io/version=2.8.2
                                                                scaledobject.keda.sh/name=keda-scaledobject-test
Annotations:                                                    <none>
CreationTimestamp:                                              Fri, 17 Feb 2023 13:38:33 +0000
Reference:                                                      Deployment/keda-fallback-deployment-test
Metrics:                                                        ( current / target )
  "s0-prometheus-test-fallback-metric" (target average value):  0 / 1
Min replicas:                                                   1
Max replicas:                                                   20
Deployment pods:                                                5 current / 1 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    SucceededRescale  the HPA controller was able to update the target scale to 1
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from external metric s0-prometheus-test-fallback-metric(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: keda-scaledobject-test,},MatchExpressions:[]LabelSelectorRequirement{},})
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:
  Type     Reason                        Age                  From                       Message
  ----     ------                        ----                 ----                       -------
  Warning  FailedGetExternalMetric       32m (x3 over 33m)    horizontal-pod-autoscaler  unable to get external metric keda/s0-prometheus-test-fallback-metric/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: keda-scaledobject-test,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s0-prometheus-test-fallback-metric
  Warning  FailedComputeMetricsReplicas  32m (x3 over 33m)    horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get s0-prometheus-test-fallback-metric external metric: unable to get external metric keda/s0-prometheus-test-fallback-metric/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: keda-scaledobject-test,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s0-prometheus-test-fallback-metric
  Normal   SuccessfulRescale             3m9s (x51 over 28m)  horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

And the metric being returned

➜  ~ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/keda/s0-prometheus-test-fallback-metric?labelSelector=scaledobject.keda.sh%2Fname%3Dkeda-scaledobject-test"
{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-prometheus-test-fallback-metric","metricLabels":null,"timestamp":"2023-02-17T13:54:41Z","value":"0"}]}

@JorTurFer
Copy link
Member

Could you share operator logs?

@lwebbz
Copy link
Author

lwebbz commented Feb 17, 2023

Sure here are the operator logs:
operatorLogs.txt

@JorTurFer
Copy link
Member

This is weird. It looks like something is scaling your deployment in parallel. Do you have any other HPA or ScaledObject scaling the workload? Maybe one with CPU?

@lwebbz
Copy link
Author

lwebbz commented Feb 17, 2023

Which line are you looking on? There are some other ScaledObjects which the operator is picking up but the ScaledObject keda-scaledobject-test only has a single prometheus trigger and the only HPA associated with the deployment it references is the one generated by keda called keda-hpa-keda-scaledobject-test

@JorTurFer
Copy link
Member

yes, I though so, but I see this line every 10 seconds:

2023-02-17T14:24:12Z	INFO	scaleexecutor	Successfully set ScaleTarget replicas count to ScaledObject fallback.replicas	{"scaledobject.Name": "keda-scaledobject-test", "scaledObject.Namespace": "keda", "scaleTarget.Name": "keda-fallback-deployment-test", "Original Replicas Count": 1, "New Replicas Count": 5}

Basically, every 10 seconds I can see how the operator is scaling out to fallback, and I can't see any row saying that it's scaled in. I'll try to replicate the issue in my own env

@lwebbz
Copy link
Author

lwebbz commented Feb 17, 2023

Ah I get you, thanks

@JorTurFer JorTurFer self-assigned this Feb 20, 2023
@JorTurFer
Copy link
Member

I have been able to reproduce it, thanks for reporting the issue

@lwebbz
Copy link
Author

lwebbz commented Feb 21, 2023

Amazing! Any ideas what's causing the issue?

@JorTurFer
Copy link
Member

We have an idea, but it's complex, we are debugging the code to discover the root cause

@zroubalik
Copy link
Member

We have found the problem, should be fixed in the next release, thanks for reporting!

@lwebbz
Copy link
Author

lwebbz commented Feb 22, 2023

Thanks for solving the problem! Do you know when the next release will be shipped?

@JorTurFer
Copy link
Member

it'll be in 2 week approx

@github-project-automation github-project-automation bot moved this from To Triage to Ready To Ship in Roadmap - KEDA Core Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants