Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service Bus Scaler has issues with the minReplicaCount parameter #4541

Closed
eugen-nw opened this issue May 15, 2023 · 16 comments
Closed

Service Bus Scaler has issues with the minReplicaCount parameter #4541

eugen-nw opened this issue May 15, 2023 · 16 comments
Labels
bug Something isn't working

Comments

@eugen-nw
Copy link

Report

This parameter is documented as below. Yet it does not seem to be operational.

The min number of jobs that is created by default. This can be useful to avoid bootstrapping time of new jobs. If minReplicaCount is greater than maxReplicaCount, minReplicaCount will be set to maxReplicaCount.

Expected Behavior

  1. If I set it to 2 for example, there will be 2 Jobs always available.
  2. If I set it in the script the Jobs scaling will work. For the example above, if the count of Messages in the Queue exceeds 2, new Jobs will be started.

Actual Behavior

Setting minReplicaCount: 2 in the Job's script, after having deployed it I've noticed:

  1. No new Containers were created automatically. I were expecting to see 2 of them.
  2. Sending Messages to the Queue did not create any Containers. I'd have expected to see one Container being created for each Message I sent to the Queue.

Steps to Reproduce the Problem

Below is the YAML script that I'm experimenting with:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: aks-aci-genericdev-runner-linux
  labels:
    app: aks-aci-genericdev-runner-linux
    deploymentName: aks-aci-boldiq-genericdev-runner-linux
spec:
  jobTargetRef:
    template:
      spec:
        containers:  # this section is identical as for a "kind: Deployment"
        - name: boldiq-genericdev-runner-linux
          image: <removed>
          imagePullPolicy: Always
          resources:
            requests:
              memory: 6G
              cpu: 2
            limits:
              memory: 6G
              cpu: 2
          env:
          - name: KEDA_SERVICEBUS_CONNECTIONSTRING_LINUX
            value: <removed>
        tolerations:
          - key: virtual-kubelet.io/provider
            operator: Exists
          - key: azure.com/aci
            effect: NoSchedule
        imagePullSecrets:
          - name: docker-registry-secret-linux
        nodeName: virtual-node-aci-linux
  pollingInterval: 1  # 1 second polling for max. responsiveness
  minReplicaCount: 2  # keeping two instances running permanently in order to improve low loads' performance
  triggers:
  - type: azure-servicebus
#    metricType: Value // The default AverageValue with messageCount: '1' starts up a new Container for each Message in the Queue.  We want that for responsiveness.
    metadata:
      queueName: requests
      connectionFromEnv: KEDA_SERVICEBUS_CONNECTIONSTRING_LINUX
      messageCount: '1'

AKS 1.25.6
KEDA 2.10.2
The Containers run on the virtual-node-aci-linux virtual node.

Logs from KEDA operator

example

KEDA Version

2.10.1

Kubernetes Version

1.25

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

No response

@eugen-nw eugen-nw added the bug Something isn't working label May 15, 2023
@JorTurFer
Copy link
Member

Hello,
Could you share keda-operator logs in debug please? You can set it using helm: https://github.com/kedacore/charts/blob/main/keda/values.yaml#L198

In debug, you will see the value of the metric on each iteration, current active jobs count and required jobs

@eugen-nw
Copy link
Author

Let's hope I did not do something too silly. Please correct my course if necessary.

I ran this command:
helm upgrade -f https://github.com/kedacore/charts/blob/main/keda/values.yaml#L198 keda kedacore/keda --namespace keda

And got this error:

Error: failed to parse https://github.com/kedacore/charts/blob/main/keda/values.yaml#L198: error converting YAML to JSON: yaml: line 28: mapping values are not allowed in this context

@JorTurFer
Copy link
Member

I didn't know that your command was possible O.O
try with helm upgrade --set logging.operator.level=debug keda kedacore/keda --namespace keda

@eugen-nw
Copy link
Author

eugen-nw commented May 18, 2023

Thank you. Neither did I, I'm just learning Helm as I go. The --set command completed successfully.

However, that value was already present in the output of the helm get all keda -n keda > keda.yaml command:
image

It appears that the keda-operator-.... pod is non-functional. Earlier I've seen this, now it has the CrashLoopBackOff status:
image

I ran kubectl logs -n keda keda-operator-db56bccc8-vnjqw, the output is below:

2023-05-18T22:25:44Z    INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8080"}
2023-05-18T22:25:44Z    DEBUG   setup   setting up cert rotation
2023-05-18T22:25:44Z    INFO    setup   Starting manager
2023-05-18T22:25:44Z    INFO    setup   KEDA Version: 2.10.1
2023-05-18T22:25:44Z    INFO    setup   Git Commit: 8adb70e97a08a4690613eef4c4f00391e44e1603
2023-05-18T22:25:44Z    INFO    setup   Go Version: go1.19.7
2023-05-18T22:25:44Z    INFO    setup   Go OS/Arch: linux/amd64
2023-05-18T22:25:44Z    INFO    setup   Running on Kubernetes 1.25      {"version": "v1.25.6"}
I0518 22:25:44.204296       1 leaderelection.go:248] attempting to acquire leader lease keda/operator.keda.sh...
2023-05-18T22:25:44Z    INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
2023-05-18T22:25:44Z    INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
I0518 22:26:01.302501       1 leaderelection.go:258] successfully acquired lease keda/operator.keda.sh
2023-05-18T22:26:01Z    DEBUG   events  keda-operator-db56bccc8-vnjqw_12f276b8-89fe-441d-a569-c0521635f640 became leader        {"type": "Normal", "object": {"kind":"Lease","namespace":"keda","name":"operator.keda.sh","uid":"206ed27f-3b4f-4ebf-8c15-d53ab82cc0c0","apiVersion":"coordination.k8s.io/v1","resourceVersion":"30945"}, "reason": "LeaderElection"}
2023-05-18T22:26:01Z    INFO    Starting EventSource    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
2023-05-18T22:26:01Z    INFO    Starting EventSource    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
2023-05-18T22:26:01Z    INFO    Starting Controller     {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2023-05-18T22:26:01Z    INFO    Starting EventSource    {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
2023-05-18T22:26:01Z    INFO    Starting Controller     {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2023-05-18T22:26:01Z    INFO    Starting EventSource    {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
2023-05-18T22:26:01Z    INFO    Starting Controller     {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2023-05-18T22:26:01Z    INFO    cert-rotation   starting cert rotator controller
2023-05-18T22:26:01Z    INFO    Starting EventSource    {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
2023-05-18T22:26:01Z    INFO    Starting Controller     {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2023-05-18T22:26:01Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2023-05-18T22:26:01Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2023-05-18T22:26:01Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2023-05-18T22:26:01Z    INFO    Starting Controller     {"controller": "cert-rotator"}
2023-05-18T22:26:01Z    INFO    Starting workers        {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
2023-05-18T22:26:01Z    INFO    Reconciling ScaledJob   {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"aks-aci-genericdev-runner-linux","namespace":"default"}, "namespace": "default", "name": "aks-aci-genericdev-runner-linux", "reconcileID": "09fae947-d2f4-4aa8-b0bf-85478d97e2c7"}
2023-05-18T22:26:01Z    INFO    cert-rotation   no cert refresh needed
2023-05-18T22:26:01Z    INFO    Starting workers        {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
2023-05-18T22:26:01Z    INFO    cert-rotation   certs are ready in /certs
2023-05-18T22:26:01Z    INFO    Starting workers        {"controller": "cert-rotator", "worker count": 1}
2023-05-18T22:26:01Z    INFO    cert-rotation   Ensuring CA cert        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2023-05-18T22:26:01Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2023-05-18T22:26:01Z    DEBUG   Starting a new ScaleLoop        {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"aks-aci-genericdev-runner-linux","namespace":"default"}, "namespace": "default", "name": "aks-aci-genericdev-runner-linux", "reconcileID": "09fae947-d2f4-4aa8-b0bf-85478d97e2c7"}
2023-05-18T22:26:01Z    INFO    Starting workers        {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
2023-05-18T22:26:01Z    INFO    Starting workers        {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
2023-05-18T22:26:01Z    INFO    Initializing Scaling logic according to ScaledJob Specification {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"aks-aci-genericdev-runner-linux","namespace":"default"}, "namespace": "default", "name": "aks-aci-genericdev-runner-linux", "reconcileID": "09fae947-d2f4-4aa8-b0bf-85478d97e2c7"}
2023-05-18T22:26:01Z    DEBUG   ScaledJob is defined correctly and is ready to scaling  {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"aks-aci-genericdev-runner-linux","namespace":"default"}, "namespace": "default", "name": "aks-aci-genericdev-runner-linux", "reconcileID": "09fae947-d2f4-4aa8-b0bf-85478d97e2c7"}
2023-05-18T22:26:01Z    DEBUG   scale_handler   Watching with pollingInterval   {"type": "ScaledJob", "namespace": "default", "name": "aks-aci-genericdev-runner-linux", "PollingInterval": "1s"}
2023-05-18T22:26:01Z    DEBUG   events  Started scalers watch   {"type": "Normal", "object": {"kind":"ScaledJob","namespace":"default","name":"aks-aci-genericdev-runner-linux","uid":"1ba89476-1c93-4fc6-ae6e-da9fed530dbf","apiVersion":"keda.sh/v1alpha1","resourceVersion":"29896"}, "reason": "KEDAScalersStarted"}
2023-05-18T22:26:01Z    DEBUG   scalers_cache   Scaler Metric value     {"ScaledJob": "aks-aci-genericdev-runner-linux", "Scaler": "cache.ScalerBuilder:", "isTriggerActive": true, "s0-azure-servicebus-requests": 1, "targetAverageValue": 1}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2f31b34]

goroutine 420 [running]:
github.com/kedacore/keda/v2/apis/keda/v1alpha1.ScaledJob.MinReplicaCount(...)
        /workspace/apis/keda/v1alpha1/scaledjob_types.go:126
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).IsScaledJobActive(0x0?, {0x434cea8, 0xc000ce0080}, 0xc000029800)
        /workspace/pkg/scaling/cache/scalers_cache.go:196 +0x1f4
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers(0xc00064f880, {0x434cea8, 0xc000ce0080}, {0x3a84400?, 0xc000029800?}, {0x4338d38, 0xc001307600})
        /workspace/pkg/scaling/scale_handler.go:253 +0x9fa
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop(0xc000ce5800?, {0x434cea8, 0xc000ce0080}, 0xc0004e6b40, {0x3a84400, 0xc000029800}, {0x4338d38, 0xc001307600})
        /workspace/pkg/scaling/scale_handler.go:167 +0x351
created by github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).HandleScalableObject
        /workspace/pkg/scaling/scale_handler.go:125 +0x7ed

@eugen-nw
Copy link
Author

eugen-nw commented May 18, 2023

Line 126 of the keda/apis/keda/v1alpha1/scaledjob_types.go file is below:

image

I've no GoLang programming experience but realized that exception occurred because I did set minReplicaCount in the deployment .YAML but omitted setting maxReplicaCount, assuming that it will default to 100. Apparently it defaulted to nill. I fixed the deployment .YAML and the Job works now properly.

image

May I suggest initializing s.Spec.Min / MaxReplicaCount to their 0 and 100 default values if not present in the deployment .YAML? That will eliminate some nil checks in the code.

@eugen-nw
Copy link
Author

Now I'm experiencing a different scale-out behavior that's a bit unexpected. As per the script below, I already have 2 Jobs running. If I send 3 Messages to the Queue, I get 3 new Jobs started, for a total of 5 Containers. I'd expect to get only 1 new Job started as I already have 2 of them running.

image

@eugen-nw
Copy link
Author

eugen-nw commented May 19, 2023

I checked the documentation at https://keda.sh/docs/2.9/concepts/scaling-jobs/ and my scaling expectations just above are correct. This later issue is tracked by #4554

image

@JorTurFer
Copy link
Member

Line 126 of the keda/apis/keda/v1alpha1/scaledjob_types.go file is below:

image

I've no GoLang programming experience but realized that exception occurred because I did set minReplicaCount in the deployment .YAML but omitted setting maxReplicaCount, assuming that it will default to 100. Apparently it defaulted to nill. I fixed the deployment .YAML and the Job works now properly.

image

May I suggest initializing s.Spec.Min / MaxReplicaCount to their 0 and 100 default values if not present in the deployment .YAML? That will eliminate some nil checks in the code.

Nice catch!

@JorTurFer
Copy link
Member

I have opened an PR with the fix: https://github.com/kedacore/keda/pull/4565/files

@JorTurFer
Copy link
Member

is this issue duplicated with #4554?

@eugen-nw
Copy link
Author

This is not a duplicate of #4554

@JorTurFer
Copy link
Member

Could you share the logs having both parameters (min and max) set?
The last time you shared the logs but there was a panic (btw, the fix is already merged) and the information that I want to check is missing

@eugen-nw
Copy link
Author

I'll get to this on Tuesday, May 30th. We have a long weekend in the USA.

@eugen-nw
Copy link
Author

This is fixed now, right? I should close then. And it's not a duplicate of #4554

@github-project-automation github-project-automation bot moved this from To Do to Ready To Ship in Roadmap - KEDA Core May 30, 2023
@JorTurFer
Copy link
Member

No no, the fix merged is for not crashing if you set minReplicaCount without setting maxReplicaCount. Was your problem related with that?

@eugen-nw
Copy link
Author

Yes, the problem I had was that I was setting the minReplicaCount, not setting the maxReplicaCount and scale-out was no longer working because of the null dereference crash. I saw that fix, thanks very much. So this problem is completely addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

2 participants