Scaledobject fallback not working as expected when prometheus trigger is failing #4249

lwebbz · 2023-02-17T11:41:26Z

Report

When using prometheus as a trigger for a Scaledobject I've run into some unexpected behaviour where the number of replicas of a deployment will oscillate between the minReplicaCount and fallback.replicas despite the status of the trigger consistently being in the failing state. Furthermore the keda_scaler_metrics_value consistently returns a value of 0 instead of fallback.replicas when the trigger is failing, this is different to the cron scaler for example which will fallback to the correct number of replicas specified by fallback.replicas when given an invalid timezone.

Here are the manifests I've been using to test this. Note that spec.triggers.metadata.query is invalid since the promQL rate function requires an argument and in this example I haven't provided it with one. This consistently produces a 400 bad request to get the scaler to fail

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: keda-scaledobject-test
spec:
  scaleTargetRef:
    name: keda-fallback-deployment-test     
  minReplicaCount:  1                            
  maxReplicaCount:  20                         
  fallback:                                      
    failureThreshold: 3             
    replicas: 5                                
  triggers:
    - type: prometheus
      metadata:
        metricName: test-fallback-metric
        query:  rate() 
        serverAddress:  http://thanos-prometheus-thanos-querier.telemetry:10902/metrics
        threshold: '1'
        namespace: 'telemetry'

apiVersion: apps/v1
kind: Deployment
metadata:
  name: keda-fallback-deployment-test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: keda-fallback-test
  template:
    metadata:
      labels:
        app: keda-fallback-test
    spec:
      containers:
      - name: keda-fallback-test
        image: nginx
        resources:
          requests:
            memory: 150Mi
            cpu: 150m
          limits:
            memory: 1000Mi
            cpu: 250m

Expected Behavior

Deployment scales to fallback.replicas i.e. 5 pods and then stays at 5 replicas.

Actual Behavior

Deployment will scale to 5 pods i.e. fallback.replicas.
Within the next 3 minutes the deployment will scale down to 1 pod i.e. minReplicaCount
Repeat cycle after 30 seconds

Steps to Reproduce the Problem

Deploy the scaledobject and deployment as stated in the report section above but change the prometheus serverAddress to your server address
Observe the number of pods in the deployment fluctuate

Logs from KEDA operator

Deployment state 1:

➜  ~ k get deployments -n keda
NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
keda-fallback-deployment-test     5/5     5            5           4d1h
keda-operator                     1/1     1            1           302d
keda-operator-metrics-apiserver   1/1     1            1           302d

Deployment state 2:

➜  ~ k get deployments -n keda
NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
keda-fallback-deployment-test     1/1     1            1           4d1h
keda-operator                     1/1     1            1           302d
keda-operator-metrics-apiserver   1/1     1            1           302d

KEDAScalerFailed event which is as expected because I've used a bad query

 ~ k get events -n keda
63s         Warning   KEDAScalerFailed          scaledobject/keda-scaledobject-test                       prometheus query api returned error. status: 400 response: {"status":"error","errorType":"bad_data","error":"1:1: parse error: expected 1 argument(s) in call to \"rate\", got 0"}

This metric returns 0 instead of fallback.replicas (5)

➜  ~ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/keda/s0-prometheus-test-fallback-metric?labelSelector=scaledobject.keda.sh%2Fname%3Dkeda-scaledobject-test"

{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-prometheus-test-fallback-metric","metricLabels":null,"timestamp":"2023-02-17T11:25:58Z","value":"0"}]}

Can see that the status of the scaler is Failing

➜  ~ k get scaledobjects.keda.sh -n keda keda-scaledobject-test -o yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"keda.sh/v1alpha1","kind":"ScaledObject","metadata":{"annotations":{},"name":"keda-scaledobject-test","namespace":"keda"},"spec":{"fallback":{"failureThreshold":3,"replicas":5},"maxReplicaCount":20,"minReplicaCount":1,"scaleTargetRef":{"name":"keda-fallback-deployment-test"},"triggers":[{"metadata":{"metricName":"test-fallback-metric","namespace":"telemetry","query":"rate()","serverAddress":"http://thanos-prometheus-thanos-querier.telemetry:10902/metrics","threshold":"1"},"type":"prometheus"}]}}
  creationTimestamp: "2023-02-13T10:22:35Z"
  finalizers:
  - finalizer.keda.sh
  generation: 73
  labels:
    scaledobject.keda.sh/name: keda-scaledobject-test
  name: keda-scaledobject-test
  namespace: keda
  resourceVersion: "463687104"
  uid: c3e59326-71e3-4c49-b1a9-06291cb50509
spec:
  fallback:
    failureThreshold: 3
    replicas: 5
  maxReplicaCount: 20
  minReplicaCount: 1
  scaleTargetRef:
    name: keda-fallback-deployment-test
  triggers:
  - metadata:
      metricName: test-fallback-metric
      namespace: telemetry
      query: rate()
      serverAddress: http://thanos-prometheus-thanos-querier.telemetry:10902/metrics
      threshold: "1"
    type: prometheus
status:
  conditions:
  - message: ScaledObject is defined correctly and is ready for scaling
    reason: ScaledObjectReady
    status: "True"
    type: Ready
  - message: Scaling is not performed because triggers are not active
    reason: ScalerNotActive
    status: "False"
    type: Active
  - message: At least one trigger is falling back on this scaled object
    reason: FallbackExists
    status: "True"
    type: Fallback
  externalMetricNames:
  - s0-prometheus-test-fallback-metric
  health:
    s0-prometheus-test-fallback-metric:
      numberOfFailures: 33
      status: Failing
  hpaName: keda-hpa-keda-scaledobject-test
  lastActiveTime: "2023-02-17T11:00:54Z"
  originalReplicaCount: 10
  scaleTargetGVKR:
    group: apps
    kind: Deployment
    resource: deployments
    version: v1
  scaleTargetKind: apps/v1.Deployment

Desired HPA metric is 1 instead of 5

~ k describe hpa -n keda keda-hpa-keda-scaledobject-test
Name:                                                           keda-hpa-keda-scaledobject-test
Namespace:                                                      keda
Labels:                                                         app.kubernetes.io/managed-by=keda-operator
                                                                app.kubernetes.io/name=keda-hpa-keda-scaledobject-test
                                                                app.kubernetes.io/part-of=keda-scaledobject-test
                                                                app.kubernetes.io/version=2.8.2
                                                                scaledobject.keda.sh/name=keda-scaledobject-test
Annotations:                                                    <none>
CreationTimestamp:                                              Mon, 13 Feb 2023 10:22:35 +0000
Reference:                                                      Deployment/keda-fallback-deployment-test
Metrics:                                                        ( current / target )
  "s0-prometheus-test-fallback-metric" (target average value):  0 / 1
Min replicas:                                                   1
Max replicas:                                                   20
Deployment pods:                                                5 current / 1 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    SucceededRescale  the HPA controller was able to update the target scale to 1
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from external metric s0-prometheus-test-fallback-metric(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: keda-scaledobject-test,},MatchExpressions:[]LabelSelectorRequirement{},})
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:
  Type     Reason                   Age                   From                       Message
  ----     ------                   ----                  ----                       -------
  Normal   SuccessfulRescale        105s (x1889 over 4d)  horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

KEDA Version

2.8.2

Kubernetes Version

< 1.23

Platform

Amazon Web Services

Scaler Details

Prometheus

Anything else?

No response

The text was updated successfully, but these errors were encountered:

JorTurFer · 2023-02-17T12:52:45Z

Hello,
I think that the problem is related with ignoring null values:

In case of successful connection (not successful query) with invalid values, this parameter can modify the returned value to 0 without any error. I think that's why you see 0 instead of the fallback value.

Could you try again setting ignoreNullValues: "false" in trigger metadata and share the result?

lwebbz · 2023-02-17T14:14:38Z

Hi Jorge,
Unfortunately I'm seeing the same thing, you can see that ignoreNullValues is now set to false on the scaled object. But the number of pods still fluctuates between 1 and 5.

➜  ~ k describe scaledobjects.keda.sh -n keda keda-scaledobject-test
Name:         keda-scaledobject-test
Namespace:    keda
Labels:       scaledobject.keda.sh/name=keda-scaledobject-test
Annotations:  <none>
API Version:  keda.sh/v1alpha1
Kind:         ScaledObject
Metadata:
  Creation Timestamp:  2023-02-17T13:38:33Z
  Finalizers:
    finalizer.keda.sh
  Generation:  1
  Managed Fields:
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"finalizer.keda.sh":
        f:labels:
          .:
          f:scaledobject.keda.sh/name:
    Manager:      keda
    Operation:    Update
    Time:         2023-02-17T13:38:33Z
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:fallback:
          .:
          f:failureThreshold:
          f:replicas:
        f:maxReplicaCount:
        f:minReplicaCount:
        f:scaleTargetRef:
          .:
          f:name:
        f:triggers:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2023-02-17T13:38:33Z
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:externalMetricNames:
        f:hpaName:
        f:originalReplicaCount:
        f:scaleTargetGVKR:
          .:
          f:group:
          f:kind:
          f:resource:
          f:version:
        f:scaleTargetKind:
    Manager:      keda
    Operation:    Update
    Subresource:  status
    Time:         2023-02-17T13:39:33Z
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:health:
          .:
          f:s0-prometheus-test-fallback-metric:
            .:
            f:numberOfFailures:
            f:status:
    Manager:         keda-adapter
    Operation:       Update
    Subresource:     status
    Time:            2023-02-17T13:39:34Z
  Resource Version:  463945549
  UID:               1c9505d7-28a2-4449-a28c-5c92f9fda8c9
Spec:
  Fallback:
    Failure Threshold:  3
    Replicas:           5
  Max Replica Count:    20
  Min Replica Count:    1
  Scale Target Ref:
    Name:  keda-fallback-deployment-test
  Triggers:
    Metadata:
      Ignore Null Values:  false
      Metric Name:         test-fallback-metric
      Namespace:           telemetry
      Query:               rate()
      Server Address:      http://thanos-prometheus-thanos-querier.telemetry:10902/metrics
      Threshold:           1
    Type:                  prometheus
Status:
  Conditions:
    Message:  ScaledObject is defined correctly and is ready for scaling
    Reason:   ScaledObjectReady
    Status:   True
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Message:  At least one trigger is falling back on this scaled object
    Reason:   FallbackExists
    Status:   True
    Type:     Fallback
  External Metric Names:
    s0-prometheus-test-fallback-metric
  Health:
    s0-prometheus-test-fallback-metric:
      Number Of Failures:  115
      Status:              Failing
  Hpa Name:                keda-hpa-keda-scaledobject-test
  Original Replica Count:  7
  Scale Target GVKR:
    Group:            apps
    Kind:             Deployment
    Resource:         deployments
    Version:          v1
  Scale Target Kind:  apps/v1.Deployment
Events:
  Type     Reason              Age                   From           Message
  ----     ------              ----                  ----           -------
  Normal   KEDAScalersStarted  28m                   keda-operator  Started scalers watch
  Normal   ScaledObjectReady   28m (x2 over 28m)     keda-operator  ScaledObject is ready for scaling
  Warning  KEDAScalerFailed    8m30s (x41 over 28m)  keda-operator  prometheus query api returned error. status: 400 response: {"status":"error","errorType":"bad_data","error":"1:1: parse error: expected 1 argument(s) in call to \"rate\", got 0"}
  Normal   KEDAScalersStarted  6m21s                 keda-operator  Started scalers watch
  Warning  KEDAScalerFailed    21s (x13 over 6m21s)  keda-operator  prometheus query api returned error. status: 400 response: {"status":"error","errorType":"bad_data","error":"1:1: parse error: expected 1 argument(s) in call to \"rate\", got 0"}

This is what the hpa looks like

➜  ~ kubectl describe hpa -n keda keda-hpa-keda-scaledobject-test
Name:                                                           keda-hpa-keda-scaledobject-test
Namespace:                                                      keda
Labels:                                                         app.kubernetes.io/managed-by=keda-operator
                                                                app.kubernetes.io/name=keda-hpa-keda-scaledobject-test
                                                                app.kubernetes.io/part-of=keda-scaledobject-test
                                                                app.kubernetes.io/version=2.8.2
                                                                scaledobject.keda.sh/name=keda-scaledobject-test
Annotations:                                                    <none>
CreationTimestamp:                                              Fri, 17 Feb 2023 13:38:33 +0000
Reference:                                                      Deployment/keda-fallback-deployment-test
Metrics:                                                        ( current / target )
  "s0-prometheus-test-fallback-metric" (target average value):  0 / 1
Min replicas:                                                   1
Max replicas:                                                   20
Deployment pods:                                                5 current / 1 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    SucceededRescale  the HPA controller was able to update the target scale to 1
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from external metric s0-prometheus-test-fallback-metric(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: keda-scaledobject-test,},MatchExpressions:[]LabelSelectorRequirement{},})
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:
  Type     Reason                        Age                  From                       Message
  ----     ------                        ----                 ----                       -------
  Warning  FailedGetExternalMetric       32m (x3 over 33m)    horizontal-pod-autoscaler  unable to get external metric keda/s0-prometheus-test-fallback-metric/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: keda-scaledobject-test,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s0-prometheus-test-fallback-metric
  Warning  FailedComputeMetricsReplicas  32m (x3 over 33m)    horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get s0-prometheus-test-fallback-metric external metric: unable to get external metric keda/s0-prometheus-test-fallback-metric/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: keda-scaledobject-test,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s0-prometheus-test-fallback-metric
  Normal   SuccessfulRescale             3m9s (x51 over 28m)  horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

And the metric being returned

➜  ~ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/keda/s0-prometheus-test-fallback-metric?labelSelector=scaledobject.keda.sh%2Fname%3Dkeda-scaledobject-test"
{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-prometheus-test-fallback-metric","metricLabels":null,"timestamp":"2023-02-17T13:54:41Z","value":"0"}]}

JorTurFer · 2023-02-17T14:24:42Z

Could you share operator logs?

lwebbz · 2023-02-17T14:53:54Z

Sure here are the operator logs:
operatorLogs.txt

JorTurFer · 2023-02-17T15:07:41Z

This is weird. It looks like something is scaling your deployment in parallel. Do you have any other HPA or ScaledObject scaling the workload? Maybe one with CPU?

lwebbz · 2023-02-17T15:22:47Z

Which line are you looking on? There are some other ScaledObjects which the operator is picking up but the ScaledObject keda-scaledobject-test only has a single prometheus trigger and the only HPA associated with the deployment it references is the one generated by keda called keda-hpa-keda-scaledobject-test

JorTurFer · 2023-02-17T16:01:26Z

yes, I though so, but I see this line every 10 seconds:

2023-02-17T14:24:12Z	INFO	scaleexecutor	Successfully set ScaleTarget replicas count to ScaledObject fallback.replicas	{"scaledobject.Name": "keda-scaledobject-test", "scaledObject.Namespace": "keda", "scaleTarget.Name": "keda-fallback-deployment-test", "Original Replicas Count": 1, "New Replicas Count": 5}

Basically, every 10 seconds I can see how the operator is scaling out to fallback, and I can't see any row saying that it's scaled in. I'll try to replicate the issue in my own env

lwebbz · 2023-02-17T16:16:53Z

Ah I get you, thanks

JorTurFer · 2023-02-21T14:18:26Z

I have been able to reproduce it, thanks for reporting the issue

lwebbz · 2023-02-21T15:50:23Z

Amazing! Any ideas what's causing the issue?

JorTurFer · 2023-02-21T16:05:18Z

We have an idea, but it's complex, we are debugging the code to discover the root cause

zroubalik · 2023-02-21T16:55:10Z

We have found the problem, should be fixed in the next release, thanks for reporting!

lwebbz · 2023-02-22T08:07:52Z

Thanks for solving the problem! Do you know when the next release will be shipped?

JorTurFer · 2023-02-22T09:58:35Z

it'll be in 2 week approx

lwebbz added the bug Something isn't working label Feb 17, 2023

JorTurFer added this to Roadmap - KEDA Core Feb 17, 2023

github-project-automation bot moved this to To Triage in Roadmap - KEDA Core Feb 17, 2023

JorTurFer self-assigned this Feb 20, 2023

JorTurFer mentioned this issue Feb 21, 2023

fix: Fallback calculation returns correct value #4263

Merged

3 tasks

JorTurFer closed this as completed in #4263 Feb 22, 2023

github-project-automation bot moved this from To Triage to Ready To Ship in Roadmap - KEDA Core Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaledobject fallback not working as expected when prometheus trigger is failing #4249

Scaledobject fallback not working as expected when prometheus trigger is failing #4249

lwebbz commented Feb 17, 2023

JorTurFer commented Feb 17, 2023

lwebbz commented Feb 17, 2023

JorTurFer commented Feb 17, 2023

lwebbz commented Feb 17, 2023 •

edited

Loading

JorTurFer commented Feb 17, 2023

lwebbz commented Feb 17, 2023

JorTurFer commented Feb 17, 2023

lwebbz commented Feb 17, 2023

JorTurFer commented Feb 21, 2023

lwebbz commented Feb 21, 2023

JorTurFer commented Feb 21, 2023

zroubalik commented Feb 21, 2023

lwebbz commented Feb 22, 2023

JorTurFer commented Feb 22, 2023

Scaledobject fallback not working as expected when prometheus trigger is failing #4249

Scaledobject fallback not working as expected when prometheus trigger is failing #4249

Comments

lwebbz commented Feb 17, 2023

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Feb 17, 2023

lwebbz commented Feb 17, 2023

JorTurFer commented Feb 17, 2023

lwebbz commented Feb 17, 2023 • edited Loading

JorTurFer commented Feb 17, 2023

lwebbz commented Feb 17, 2023

JorTurFer commented Feb 17, 2023

lwebbz commented Feb 17, 2023

JorTurFer commented Feb 21, 2023

lwebbz commented Feb 21, 2023

JorTurFer commented Feb 21, 2023

zroubalik commented Feb 21, 2023

lwebbz commented Feb 22, 2023

JorTurFer commented Feb 22, 2023

lwebbz commented Feb 17, 2023 •

edited

Loading