Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All externalMetricsNames with the same name broken if the data source for one of them is unavailable. #2599

Closed
IvanDechovsky opened this issue Feb 7, 2022 · 13 comments
Labels
bug Something isn't working

Comments

@IvanDechovsky
Copy link

Report

If you configure multiple ScaledObjects with the same metricName like so:

  triggers:
  - metadata:
      metricName: nginx_ingress_controller_requests

but use different serverAddress and one of those data sources becomes unavailable, all of the scaled objects with that same externalMetricName break with the following error

  Warning  FailedComputeMetricsReplicas  22m (x2 over 22m)  horizontal-pod-autoscaler  invalid metrics (2 invalid out of 2), first error is: failed to get s0-prometheus-nginx_ingress_controller_requests external metric: unable to get external metric default/s0-prometheus-nginx_ingress_controller_requests/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: real,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get s0-prometheus-nginx_ingress_controller_requests.external.metrics.k8s.io)

Expected Behavior

The expected behaviour would be that only the ScaledObject with the unavailable serverAddress to report the metrics unavailable error.

Actual Behavior

All ScaledObjects with the same externalMetricNames report the same metrics unavailable error, regardless of the availability of their serverAddress

Steps to Reproduce the Problem

  1. Create two Scaled objects like so:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: real
  namespace: default
spec:
  maxReplicaCount: 5
  minReplicaCount: 2
  pollingInterval: 30
  scaleTargetRef:
    name: test
  triggers:
  - metadata:
      metricName: nginx_ingress_controller_requests
      query: |
        sum(
            rate(
                nginx_ingress_controller_requests{
                    exported_namespace="default",
                    exported_service="test",
                }[2m]
            )
        )
      serverAddress: http://kube-prometheus-stack-thanos-query-frontend.kube-system:9090/
      threshold: "7"
    type: prometheus
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: fake
  namespace: default
spec:
  maxReplicaCount: 5
  minReplicaCount: 2
  pollingInterval: 30
  scaleTargetRef:
    name: test
  triggers:
  - metadata:
      metricName: nginx_ingress_controller_requests
      query: |
        sum(
            rate(
                nginx_ingress_controller_requests{
                    exported_namespace="default",
                    exported_service="test",
                }[2m]
            )
        )
      serverAddress: http://foo.bar:9090/
      threshold: "7"
    type: prometheus
  1. Describe both ScaledObject
kubectl describe hpa keda-hpa-real
kubectl describe hpa keda-hpa-fake
  1. You should see in both HPA the same error message
  Warning  FailedComputeMetricsReplicas  44m (x2 over 44m)  horizontal-pod-autoscaler  invalid metrics (2 invalid out of 2), first error is: failed to get s0-prometheus-nginx_ingress_controller_requests external metric: unable to get external metric default/s0-prometheus-nginx_ingress_controller_requests/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: real,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get s0-prometheus-nginx_ingress_controller_requests.external.metrics.k8s.io)

Logs from KEDA operator

keda-operator-metrics-apiserver-7fc4c7974d-vs655 keda-operator-metrics-apiserver E0207 11:06:49.308417       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"No matching metrics found for s1-prometheus-nginx_ingress_controller_requests"}: No matching metrics found for s1-prometheus-nginx_ingress_controller_requests

KEDA Version

2.6.0

Kubernetes Version

1.21

Platform

Amazon Web Services

Scaler Details

No response

Anything else?

No response

@IvanDechovsky IvanDechovsky added the bug Something isn't working label Feb 7, 2022
@zroubalik
Copy link
Member

@IvanDechovsky thanks for reporting this issue, we have indentified some problems with metric names and indexes in 2.6.0, even though this seems like a different issue. Are you able to try to reproduce the problem with images mentined here: #2592 (comment)

Thanks! feel free to ping me on slack for details.

@IvanDechovsky
Copy link
Author

Hi, thanks for the quick response. Just switching to the images does not fix the issue for existing metric names but it does fix it if they were to be re-created or simply renamed ( to the same name). I guess that triggers them to be re-indexed and problem goes away.

Do you know when we can expect 2.6.1?

Do you know if there is a way to fix the problem for existing metrics without asking our dev teams to rename their metricNames or recreate their ScaledObjects?

@JorTurFer
Copy link
Member

JorTurFer commented Feb 7, 2022

Restarting the pod should be enough but you could try also deleting the HPA manually. KEDA operator will recreate it in base of the ScaledObject

@IvanDechovsky
Copy link
Author

I may have been too quick to call it resolved. Upon further testing, I can still reproduce the problem with the indexFix images. In the current version a restart of keda also "ocassionally" removes the issue for a certain period of time.
The externalMetricNames with 2.6.0 and with indexFix stay the same so the problem with non unique metric names still persists.

@zroubalik
Copy link
Member

@IvanDechovsky thanks for the confirmation, I will take a look on it.

@zroubalik zroubalik added this to the v2.6.1 milestone Feb 7, 2022
@zroubalik
Copy link
Member

@IvanDechovsky I haven't been able to reproduce the problem 🤔 I have 2 SO, with the same metricName, took down one service, the respective SO stopped working the other one was working correctly 🤷‍♂️

@JorTurFer
Copy link
Member

Hey @IvanDechovsky
I have just noticed that you are creation 2 SO for the same deployment. This creates condition races because in the same way as with HPA, only 1 SO could scale 1 workload (without any other extra HPA than the HPA created by KEDA).
Could this be your problem?

@IvanDechovsky
Copy link
Author

IvanDechovsky commented Feb 7, 2022

Hey @zroubalik ,
I can consistently reproduce it on 2.6.0 and indexFix images. However, the key is to restart the keda-operator + keda-operator-metrics-server pods. The same way a restart of those pods can help fix the issues temporarily, it can also break working scaledobjects.

@JorTurFer i just tested with different deployments and the result is the same.

keda-hpa-fake   Deployment/thanos-query            <unknown>/7 (avg), <unknown>/1 (avg)   2         50        2          82s
keda-hpa-real   Deployment/thanos-query-frontend   0/7 (avg), 500m/1 (avg)                2         50        2          82s

On initial inspection, it correctly detects the "real" hpa with working data source. However, if you describe it, you see the same error. Again, on the initial apply of the SO things work fine. Once keda pods are restarted, it fails.

  Warning  FailedComputeMetricsReplicas  45s   horizontal-pod-autoscaler  invalid metrics (2 invalid out of 2), first error is: failed to get s0-prometheus-nginx_ingress_controller_requests external metric: unable to get external metric thanos/s0-prometheus-nginx_ingress_controller_requests/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: real,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get s0-prometheus-nginx_ingress_controller_requests.external.metrics.k8s.io)

@JorTurFer
Copy link
Member

JorTurFer commented Feb 7, 2022

could you share your ScaledObject literary how they are (only removing the secrets if they are)?
I'm trying to reproduce the problem but I can't. This is my configuration:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: prometheus-scaledobject
  namespace: issue-2599
spec:
  cooldownPeriod: 10
  maxReplicaCount: 5
  minReplicaCount: 0
  pollingInterval: 5
  scaleTargetRef:
    name: keda-test-app
  triggers:
    - metadata:
        metricName: http_requests_total
        query: >-
          sum(avg by (mode) (rate(node_cpu_seconds_total{job="node-exporter",
          mode=~"idle|iowait|steal"}[2m]))) * 10
        serverAddress: http://prometheus-operated.issue-2599.svc:9090
        threshold: '3'
      type: prometheus
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: prometheus-scaledobject-2
  namespace: issue-2599
spec:
  cooldownPeriod: 10
  maxReplicaCount: 5
  minReplicaCount: 0
  pollingInterval: 5
  scaleTargetRef:
    name: keda-test-app-2
  triggers:
    - metadata:
        metricName: http_requests_total
        query: >-
          sum(avg by (mode) (rate(node_cpu_seconds_total{job="node-exporter",
          mode=~"idle|iowait|steal"}[2m]))) * 10
        serverAddress: http://prometheus-operated.monitoring.svc:9090
        threshold: '3'
      type: prometheus

@IvanDechovsky
Copy link
Author

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: real
  namespace: thanos
spec:
  maxReplicaCount: 50
  minReplicaCount: 2
  pollingInterval: 30
  scaleTargetRef:
    name: thanos-query-frontend
  triggers:
  - metadata:
      metricName: nginx_ingress_controller_requests
      query: |
        sum(
            rate(
                nginx_ingress_controller_requests{
                    exported_namespace="thanos",
                    exported_service="thanos-query-frontend",
                }[2m]
            )
        )
      serverAddress: http://kube-prometheus-stack-thanos-query-frontend.kube-system:9090/
      threshold: "7"
    type: prometheus
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: fake
  namespace: thanos
spec:
  maxReplicaCount: 50
  minReplicaCount: 2
  pollingInterval: 30
  scaleTargetRef:
    name: thanos-query
  triggers:
  - metadata:
      metricName: nginx_ingress_controller_requests
      query: |
        sum(
            rate(
                nginx_ingress_controller_requests{
                    exported_namespace="fake",
                    exported_service="fake",
                }[2m]
            )
        )
      serverAddress: http://fake-foo-bar-thanos.kube-system:9090/
      threshold: "7"
    type: prometheus

@tomkerkhove tomkerkhove moved this to Backlog in Roadmap - KEDA Core Feb 10, 2022
@zroubalik zroubalik modified the milestones: v2.6.1, v2.7.0 Feb 10, 2022
@tomkerkhove tomkerkhove moved this from To Do to Proposed in Roadmap - KEDA Core Feb 14, 2022
@stale
Copy link

stale bot commented Apr 11, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Apr 11, 2022
@zroubalik zroubalik removed the stale All issues that are marked as stale due to inactivity label Apr 11, 2022
@tomkerkhove
Copy link
Member

Are you able to test our current main version to see if this is still an issue @IvanDechovsky?

@JorTurFer JorTurFer moved this from Proposed to Pending End-User Feedback in Roadmap - KEDA Core May 4, 2022
@tomkerkhove tomkerkhove removed this from the v2.7.0 milestone May 5, 2022
@IvanDechovsky
Copy link
Author

IvanDechovsky commented May 17, 2022

Sorry for the delay, but I'm happy to report the issue has been resolved in 2.7.0! Thank you for the support.

Repository owner moved this from Pending End-User Feedback to Ready To Ship in Roadmap - KEDA Core May 17, 2022
@tomkerkhove tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

4 participants