All externalMetricsNames with the same name broken if the data source for one of them is unavailable. #2599

IvanDechovsky · 2022-02-07T11:07:37Z

Report

If you configure multiple ScaledObjects with the same metricName like so:

  triggers:
  - metadata:
      metricName: nginx_ingress_controller_requests

but use different serverAddress and one of those data sources becomes unavailable, all of the scaled objects with that same externalMetricName break with the following error

  Warning  FailedComputeMetricsReplicas  22m (x2 over 22m)  horizontal-pod-autoscaler  invalid metrics (2 invalid out of 2), first error is: failed to get s0-prometheus-nginx_ingress_controller_requests external metric: unable to get external metric default/s0-prometheus-nginx_ingress_controller_requests/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: real,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get s0-prometheus-nginx_ingress_controller_requests.external.metrics.k8s.io)

Expected Behavior

The expected behaviour would be that only the ScaledObject with the unavailable serverAddress to report the metrics unavailable error.

Actual Behavior

All ScaledObjects with the same externalMetricNames report the same metrics unavailable error, regardless of the availability of their serverAddress

Steps to Reproduce the Problem

Create two Scaled objects like so:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: real
  namespace: default
spec:
  maxReplicaCount: 5
  minReplicaCount: 2
  pollingInterval: 30
  scaleTargetRef:
    name: test
  triggers:
  - metadata:
      metricName: nginx_ingress_controller_requests
      query: |
        sum(
            rate(
                nginx_ingress_controller_requests{
                    exported_namespace="default",
                    exported_service="test",
                }[2m]
            )
        )
      serverAddress: http://kube-prometheus-stack-thanos-query-frontend.kube-system:9090/
      threshold: "7"
    type: prometheus
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: fake
  namespace: default
spec:
  maxReplicaCount: 5
  minReplicaCount: 2
  pollingInterval: 30
  scaleTargetRef:
    name: test
  triggers:
  - metadata:
      metricName: nginx_ingress_controller_requests
      query: |
        sum(
            rate(
                nginx_ingress_controller_requests{
                    exported_namespace="default",
                    exported_service="test",
                }[2m]
            )
        )
      serverAddress: http://foo.bar:9090/
      threshold: "7"
    type: prometheus

Describe both ScaledObject

kubectl describe hpa keda-hpa-real
kubectl describe hpa keda-hpa-fake

You should see in both HPA the same error message

  Warning  FailedComputeMetricsReplicas  44m (x2 over 44m)  horizontal-pod-autoscaler  invalid metrics (2 invalid out of 2), first error is: failed to get s0-prometheus-nginx_ingress_controller_requests external metric: unable to get external metric default/s0-prometheus-nginx_ingress_controller_requests/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: real,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get s0-prometheus-nginx_ingress_controller_requests.external.metrics.k8s.io)

Logs from KEDA operator

keda-operator-metrics-apiserver-7fc4c7974d-vs655 keda-operator-metrics-apiserver E0207 11:06:49.308417       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"No matching metrics found for s1-prometheus-nginx_ingress_controller_requests"}: No matching metrics found for s1-prometheus-nginx_ingress_controller_requests

KEDA Version

2.6.0

Kubernetes Version

1.21

Platform

Amazon Web Services

Scaler Details

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

zroubalik · 2022-02-07T11:57:47Z

@IvanDechovsky thanks for reporting this issue, we have indentified some problems with metric names and indexes in 2.6.0, even though this seems like a different issue. Are you able to try to reproduce the problem with images mentined here: #2592 (comment)

Thanks! feel free to ping me on slack for details.

IvanDechovsky · 2022-02-07T13:12:46Z

Hi, thanks for the quick response. Just switching to the images does not fix the issue for existing metric names but it does fix it if they were to be re-created or simply renamed ( to the same name). I guess that triggers them to be re-indexed and problem goes away.

Do you know when we can expect 2.6.1?

Do you know if there is a way to fix the problem for existing metrics without asking our dev teams to rename their metricNames or recreate their ScaledObjects?

JorTurFer · 2022-02-07T13:31:34Z

Restarting the pod should be enough but you could try also deleting the HPA manually. KEDA operator will recreate it in base of the ScaledObject

IvanDechovsky · 2022-02-07T13:55:04Z

I may have been too quick to call it resolved. Upon further testing, I can still reproduce the problem with the indexFix images. In the current version a restart of keda also "ocassionally" removes the issue for a certain period of time.
The externalMetricNames with 2.6.0 and with indexFix stay the same so the problem with non unique metric names still persists.

zroubalik · 2022-02-07T14:23:22Z

@IvanDechovsky thanks for the confirmation, I will take a look on it.

zroubalik · 2022-02-07T19:50:57Z

@IvanDechovsky I haven't been able to reproduce the problem 🤔 I have 2 SO, with the same metricName, took down one service, the respective SO stopped working the other one was working correctly 🤷‍♂️

JorTurFer · 2022-02-07T21:00:52Z

Hey @IvanDechovsky
I have just noticed that you are creation 2 SO for the same deployment. This creates condition races because in the same way as with HPA, only 1 SO could scale 1 workload (without any other extra HPA than the HPA created by KEDA).
Could this be your problem?

IvanDechovsky · 2022-02-07T21:06:35Z

Hey @zroubalik ,
I can consistently reproduce it on 2.6.0 and indexFix images. However, the key is to restart the keda-operator + keda-operator-metrics-server pods. The same way a restart of those pods can help fix the issues temporarily, it can also break working scaledobjects.

@JorTurFer i just tested with different deployments and the result is the same.

keda-hpa-fake   Deployment/thanos-query            <unknown>/7 (avg), <unknown>/1 (avg)   2         50        2          82s
keda-hpa-real   Deployment/thanos-query-frontend   0/7 (avg), 500m/1 (avg)                2         50        2          82s

On initial inspection, it correctly detects the "real" hpa with working data source. However, if you describe it, you see the same error. Again, on the initial apply of the SO things work fine. Once keda pods are restarted, it fails.

  Warning  FailedComputeMetricsReplicas  45s   horizontal-pod-autoscaler  invalid metrics (2 invalid out of 2), first error is: failed to get s0-prometheus-nginx_ingress_controller_requests external metric: unable to get external metric thanos/s0-prometheus-nginx_ingress_controller_requests/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: real,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get s0-prometheus-nginx_ingress_controller_requests.external.metrics.k8s.io)

JorTurFer · 2022-02-07T21:12:47Z

could you share your ScaledObject literary how they are (only removing the secrets if they are)?
I'm trying to reproduce the problem but I can't. This is my configuration:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: prometheus-scaledobject
  namespace: issue-2599
spec:
  cooldownPeriod: 10
  maxReplicaCount: 5
  minReplicaCount: 0
  pollingInterval: 5
  scaleTargetRef:
    name: keda-test-app
  triggers:
    - metadata:
        metricName: http_requests_total
        query: >-
          sum(avg by (mode) (rate(node_cpu_seconds_total{job="node-exporter",
          mode=~"idle|iowait|steal"}[2m]))) * 10
        serverAddress: http://prometheus-operated.issue-2599.svc:9090
        threshold: '3'
      type: prometheus
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: prometheus-scaledobject-2
  namespace: issue-2599
spec:
  cooldownPeriod: 10
  maxReplicaCount: 5
  minReplicaCount: 0
  pollingInterval: 5
  scaleTargetRef:
    name: keda-test-app-2
  triggers:
    - metadata:
        metricName: http_requests_total
        query: >-
          sum(avg by (mode) (rate(node_cpu_seconds_total{job="node-exporter",
          mode=~"idle|iowait|steal"}[2m]))) * 10
        serverAddress: http://prometheus-operated.monitoring.svc:9090
        threshold: '3'
      type: prometheus

IvanDechovsky · 2022-02-08T08:51:47Z

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: real
  namespace: thanos
spec:
  maxReplicaCount: 50
  minReplicaCount: 2
  pollingInterval: 30
  scaleTargetRef:
    name: thanos-query-frontend
  triggers:
  - metadata:
      metricName: nginx_ingress_controller_requests
      query: |
        sum(
            rate(
                nginx_ingress_controller_requests{
                    exported_namespace="thanos",
                    exported_service="thanos-query-frontend",
                }[2m]
            )
        )
      serverAddress: http://kube-prometheus-stack-thanos-query-frontend.kube-system:9090/
      threshold: "7"
    type: prometheus
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: fake
  namespace: thanos
spec:
  maxReplicaCount: 50
  minReplicaCount: 2
  pollingInterval: 30
  scaleTargetRef:
    name: thanos-query
  triggers:
  - metadata:
      metricName: nginx_ingress_controller_requests
      query: |
        sum(
            rate(
                nginx_ingress_controller_requests{
                    exported_namespace="fake",
                    exported_service="fake",
                }[2m]
            )
        )
      serverAddress: http://fake-foo-bar-thanos.kube-system:9090/
      threshold: "7"
    type: prometheus

stale · 2022-04-11T14:27:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

tomkerkhove · 2022-04-19T13:56:09Z

Are you able to test our current main version to see if this is still an issue @IvanDechovsky?

IvanDechovsky · 2022-05-17T12:29:48Z

Sorry for the delay, but I'm happy to report the issue has been resolved in 2.7.0! Thank you for the support.

IvanDechovsky added the bug Something isn't working label Feb 7, 2022

zroubalik added this to the v2.6.1 milestone Feb 7, 2022

tomkerkhove added this to Roadmap - KEDA Core Feb 10, 2022

tomkerkhove moved this to Backlog in Roadmap - KEDA Core Feb 10, 2022

zroubalik modified the milestones: v2.6.1, v2.7.0 Feb 10, 2022

tomkerkhove moved this from To Do to Proposed in Roadmap - KEDA Core Feb 14, 2022

stale bot added the stale All issues that are marked as stale due to inactivity label Apr 11, 2022

zroubalik removed the stale All issues that are marked as stale due to inactivity label Apr 11, 2022

JorTurFer moved this from Proposed to Pending End-User Feedback in Roadmap - KEDA Core May 4, 2022

tomkerkhove removed this from the v2.7.0 milestone May 5, 2022

IvanDechovsky closed this as completed May 17, 2022

Repository owner moved this from Pending End-User Feedback to Ready To Ship in Roadmap - KEDA Core May 17, 2022

tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All externalMetricsNames with the same name broken if the data source for one of them is unavailable. #2599

All externalMetricsNames with the same name broken if the data source for one of them is unavailable. #2599

IvanDechovsky commented Feb 7, 2022

zroubalik commented Feb 7, 2022

IvanDechovsky commented Feb 7, 2022

JorTurFer commented Feb 7, 2022 •

edited

Loading

IvanDechovsky commented Feb 7, 2022

zroubalik commented Feb 7, 2022

zroubalik commented Feb 7, 2022

JorTurFer commented Feb 7, 2022

IvanDechovsky commented Feb 7, 2022 •

edited

Loading

JorTurFer commented Feb 7, 2022 •

edited

Loading

IvanDechovsky commented Feb 8, 2022

stale bot commented Apr 11, 2022

tomkerkhove commented Apr 19, 2022

IvanDechovsky commented May 17, 2022 •

edited

Loading

All externalMetricsNames with the same name broken if the data source for one of them is unavailable. #2599

All externalMetricsNames with the same name broken if the data source for one of them is unavailable. #2599

Comments

IvanDechovsky commented Feb 7, 2022

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

zroubalik commented Feb 7, 2022

IvanDechovsky commented Feb 7, 2022

JorTurFer commented Feb 7, 2022 • edited Loading

IvanDechovsky commented Feb 7, 2022

zroubalik commented Feb 7, 2022

zroubalik commented Feb 7, 2022

JorTurFer commented Feb 7, 2022

IvanDechovsky commented Feb 7, 2022 • edited Loading

JorTurFer commented Feb 7, 2022 • edited Loading

IvanDechovsky commented Feb 8, 2022

stale bot commented Apr 11, 2022

tomkerkhove commented Apr 19, 2022

IvanDechovsky commented May 17, 2022 • edited Loading

JorTurFer commented Feb 7, 2022 •

edited

Loading

IvanDechovsky commented Feb 7, 2022 •

edited

Loading

JorTurFer commented Feb 7, 2022 •

edited

Loading

IvanDechovsky commented May 17, 2022 •

edited

Loading