Continuous update of HPA objects #1531

surki · 2021-01-25T11:05:04Z

Even when there are no changes to ScaledObject or HPA, keda-operator continuously updates the HPA after it "finds" the current HPA that's in apiserver and the new HPA that is computed from ScaledObject (in here). We have hundreds of ScaledObjects. leading to too many unnecessary updates to api server.

Enabled debug logs, the difference seems to boil down to the order of metrics listed in the metrics array (probably equality.Semantic.DeepDerivative expects slices to be in same order?)

(note that Resource types comes first in new whereas the Resource type comes last in the current. If we rearrange it in the ScaledObject YAML, the issue gets resolved, see steps in Steps to Reproduce the Problem )

--- current.json        2021-01-25 15:06:19.664240010 +0530
+++ new.json    2021-01-25 15:16:54.423692540 +0530
@@ -8,6 +8,16 @@
   "maxReplicas": 5,
   "metrics": [
     {
+      "type": "Resource",
+      "resource": {
+        "name": "cpu",
+        "target": {
+          "type": "Utilization",
+          "averageUtilization": 60
+        }
+      }
+    },
+    {
       "type": "External",
       "external": {
         "metric": {
@@ -23,16 +33,6 @@
           "averageValue": "70"
         }
       }
-    },
-    {
-      "type": "Resource",
-      "resource": {
-        "name": "cpu",
-        "target": {
-          "type": "Utilization",
-          "averageUtilization": 60
-        }
-      }
     }
   ],
   "behavior": {

Expected Behavior

No unncessary updates

Actual Behavior

Unnecessary updates

Steps to Reproduce the Problem

Have a scaledObject like below (note the trigger order, first Resource and then Prometheus scaler

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: envoy-blue
  namespace: foo
spec:
  cooldownPeriod: 180
  maxReplicaCount: 5
  minReplicaCount: 2
  pollingInterval: 30
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: envoy-blue
  triggers:
  - metadata:
      type: Utilization
      value: "60"
    type: cpu
  - metadata:
      metricName: container_cpu_usage_seconds_total
      query: sum((sum(rate(container_cpu_usage_seconds_total{region="us-east-1", namespace="foo", pod=~"envoy-blue.*", container="envoy"}[3m])) by (pod,container))/(sum(kube_pod_container_resource_requests_cpu_cores{region="us-east-1", namespace="foo", pod=~"envoy-blue.*", container="envoy"}) by (pod,container))) * 100.0
      serverAddress: http://foo/query
      threshold: "70"
    type: prometheus

Apply this YAML, verify that ScaledObject and HPAs are created
Tail the keda-operator logs, and notice that it continuously updates the HPA object with log line Updated HPA according to ScaledObject
In the YAML given in step 1, rearrange the triggers such that Prometheus comes first and then the Resource scaler and apply the change
Tail the keda-operator logs, the issue should be resolved now

Logs from KEDA operator

2021-01-25T11:00:34.457Z        INFO    controllers.ScaledObject        Reconciling ScaledObject        {"ScaledObject.Namespace": "foo", "ScaledObject.Name": "envoy-green"}
2021-01-25T11:00:34.457Z        DEBUG   controllers.ScaledObject        Parsed Group, Version, Kind, Resource   {"ScaledObject.Namespace": "foo", "ScaledObject.Name": "envoy-green", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
2021-01-25T11:00:34.483Z        DEBUG   controllers.ScaledObject        Found difference in the HPA spec accordint to ScaledObject      {"ScaledObject.Namespace": "foo", "ScaledObject.Name": "envoy-green", "currentHPA": {"scaleTargetRef":{"kind":"Deployment","name":"envoy-green","apiVersion":"apps/v1"},"minReplicas":1,"maxReplicas":5,"metrics":[{"type":"External","external":{"metric":{"name":"prometheus-http---foo-container_cpu_usage_seconds_total","selector":{"matchLabels":{"scaledObjectName":"envoy-green"}}},"target":{"type":"AverageValue","averageValue":"70"}}},{"type":"Resource","resource":{"name":"cpu","target":{"type":"Utilization","averageUtilization":60}}}],"behavior":{"scaleUp":{"stabilizationWindowSeconds":0,"selectPolicy":"Max","policies":[{"type":"Pods","value":4,"periodSeconds":15},{"type":"Percent","value":100,"periodSeconds":15}]},"scaleDown":{"stabilizationWindowSeconds":180,"selectPolicy":"Max","policies":[{"type":"Percent","value":100,"periodSeconds":15}]}}}, "newHPA": {"scaleTargetRef":{"kind":"Deployment","name":"envoy-green","apiVersion":"apps/v1"},"minReplicas":1,"maxReplicas":5,"metrics":[{"type":"Resource","resource":{"name":"cpu","target":{"type":"Utilization","averageUtilization":60}}},{"type":"External","external":{"metric":{"name":"prometheus-http---foo-query-staging-freshworks-edge-container_cpu_usage_seconds_total","selector":{"matchLabels":{"scaledObjectName":"envoy-green"}}},"target":{"type":"AverageValue","averageValue":"70"}}}],"behavior":{"scaleDown":{"stabilizationWindowSeconds":180,"policies":[{"type":"Percent","value":100,"periodSeconds":15}]}}}}
2021-01-25T11:00:34.492Z        INFO    controllers.ScaledObject        Updated HPA according to ScaledObject   {"ScaledObject.Namespace": "foo", "ScaledObject.Name": "envoy-green", "HPA.Namespace": "foo", "HPA.Name": "keda-hpa-envoy-green"}
2021-01-25T11:00:34.492Z        DEBUG   controllers.ScaledObject        ScaledObject is defined correctly and is ready for scaling      {"ScaledObject.Namespace": "foo", "ScaledObject.Name": "envoy-green"}
2021-01-25T11:00:34.505Z        DEBUG   controller      Successfully Reconciled {"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "envoy-green", "namespace": "foo"}

Specifications

KEDA Version: 2.0.0
Platform & Version: Linux x64
Kubernetes Version: AWS EKS 1.18.9
Scaler(s): CPU and Prometheus

The text was updated successfully, but these errors were encountered:

zroubalik · 2021-01-25T12:04:39Z

Thanks a lot @surki for the detailed analysis! Agree, the current behavior is not good :(

To fix this, we can probably append the Resources metrics always at the end of the generated metrics list, in here:

keda/controllers/hpa.go

Line 135 in 1b913f1

    
           func (r *ScaledObjectReconciler) getScaledObjectMetricSpecs(logger logr.Logger, scaledObject *kedav1alpha1.ScaledObject) ([]autoscalingv2beta2.MetricSpec, error) {

surki · 2021-01-25T12:47:03Z

hmm, we don't support other HPA resource types? wondering if should we do something like this

diff --git a/controllers/hpa.go b/controllers/hpa.go
index 3354889..beaf570 100644
--- a/controllers/hpa.go
+++ b/controllers/hpa.go
@@ -3,6 +3,7 @@ package controllers
 import (
        "context"
        "fmt"
+       "sort"

        "github.com/go-logr/logr"
        version "github.com/kedacore/keda/v2/version"
@@ -161,6 +162,10 @@ func (r *ScaledObjectReconciler) getScaledObjectMetricSpecs(logger logr.Logger,
                scaler.Close()
        }

+       sort.Slice(scaledObjectMetricSpecs, func(i, j int) bool {
+               return scaledObjectMetricSpecs[i].Type < scaledObjectMetricSpecs[j].Type
+       })
+
        // store External.MetricNames,Resource.MetricsNames used by scalers defined in the ScaledObject
        status := scaledObject.Status.DeepCopy()
        status.ExternalMetricNames = externalMetricNames

zroubalik · 2021-01-25T13:21:29Z

@surki in KEDA we are using Resource metric type for CPU/Memory scaler and External for all the other scaler, no other types are used. Yeah, this could probably do the job :)

Would you mind sending a PR for this?

zroubalik · 2021-01-26T14:32:03Z

@surki we plan to release KEDA 2.1 tomorrow, so please let me know if you are willing to send a PR with this, if not, I'll do that :)

surki · 2021-01-26T14:39:37Z

I might not be able to get to this by tomorrow, can you please fix this?

…

On Tue, Jan 26, 2021 at 8:02 PM Zbynek Roubalik ***@***.***> wrote: @surki <https://github.com/surki> we plan to release KEDA 2.1 tomorrow, so please let me know if you are willing to send a PR with this, if not, I'll do that :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1531 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAL3BHZPIWAC3KNFLJ2J6DS33G7JANCNFSM4WRSJODA> .

zroubalik · 2021-01-26T14:47:56Z

@surki no worries, thanks for the heads up!

surki added the bug Something isn't working label Jan 25, 2021

zroubalik self-assigned this Jan 26, 2021

zroubalik mentioned this issue Jan 27, 2021

Fix unnecessary HPA updates when Resource based Trigger is used #1541

Merged

2 tasks

tomkerkhove closed this as completed in #1541 Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous update of HPA objects #1531

Continuous update of HPA objects #1531

surki commented Jan 25, 2021

zroubalik commented Jan 25, 2021 •

edited

Loading

surki commented Jan 25, 2021 •

edited

Loading

zroubalik commented Jan 25, 2021

zroubalik commented Jan 26, 2021

surki commented Jan 26, 2021 via email

zroubalik commented Jan 26, 2021

Continuous update of HPA objects #1531

Continuous update of HPA objects #1531

Comments

surki commented Jan 25, 2021

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

Specifications

zroubalik commented Jan 25, 2021 • edited Loading

surki commented Jan 25, 2021 • edited Loading

zroubalik commented Jan 25, 2021

zroubalik commented Jan 26, 2021

surki commented Jan 26, 2021 via email

zroubalik commented Jan 26, 2021

zroubalik commented Jan 25, 2021 •

edited

Loading

surki commented Jan 25, 2021 •

edited

Loading