ScaledJob deployments regards the "ProviderFailed" state Containers as if they were in the "Running" state #4866

eugen-nw · 2023-08-07T21:58:01Z

Report

I have a deployment configured to run 12 Containers permanently. As of right now, there's no Message in the Queue and I see 12 Containers running in spite of the fact that 8 of them will do absolutely nothing for us:

Expected Behavior

KEDA should ignore the "ProviderFailed"-state Containers and always provide us with the count of configured minReplicaCount Containers that are either in the "Pending" or "Running" state. The reason for setting minReplicaCount to 12 is that we need to have 12 Containers always on hold, not only 4 of them. Please fix KEDA to always provide the minReplicaCount of usable Containers.

Actual Behavior

We are seeing the minReplicaCount: 12 configured Containers in Kubernetes, but 8 of them are in the non-functional "ProviderFailed" state.

Steps to Reproduce the Problem

Logs from KEDA operator

example

KEDA Version

2.10.1

Kubernetes Version

1.25

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

These are Windows Containers instantiated in the Azure Container Instances service using the Virtual Kubelet installed using the virtual-kubelet-azure-aci-1.5.1 Helm chart.

The text was updated successfully, but these errors were encountered:

JorTurFer · 2023-08-13T14:40:14Z

Hello
Could you share your ScaledJob?

eugen-nw · 2023-08-14T15:51:35Z

Certainly, please find it below.

# Container deployment as Job + KEDA scaleout setup script.
#
# install by running "kubectl apply -f <name of this file>"
# remove by running "kubectl delete -f <name of this file>".

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: boldiq-external-solver-runner-nj-test
  labels:
    app: boldiq-external-solver-runner-nj-test
    deploymentName: boldiq-external-solver-runner-nj-test
spec:
  jobTargetRef:
    template:
      spec:
        containers:  # this section is identical as for a "kind: Deployment"
        - image: externalsolverrunnersnetjets.azurecr.io/boldiq-external-solver-runner:#{Build.BuildNumber}#
          imagePullPolicy: Always
          name: boldiq-external-solver-runner-nj-test
          resources:
            requests:
              memory: 13G
              cpu: 2
            limits:
              memory: 13G
              cpu: 2
          env:
          - name: KEDA_SERVICEBUS_CONNECTIONSTRING
            value: "(removed)"
          - name: ApplicationName
            value: "ExternalSolverRunner"
          - name: AppInsightsInstrumentationKey
            value: "(removed)"
          - name: ServiceBusIncomingConnection
            value: "(removed)"
          - name: BoldIQServicesUri
            value: "(removed)"
          - name: HeaderApiKey
            value: "(removed)"
          - name: HeaderApiKeyValue
            value: "(removed)"
          - name: MinimumBytesOfAvailableMemory
            value: "10737418241"  # 10 GB + 1, so the logs shows that the value came from the environment
          - name: StorageConnectionString
            value: "(removed)"
        nodeSelector:
          kubernetes.io/os: windows
        tolerations:
        - key: virtual-kubelet.io/provider
          operator: Exists
        - key: azure.com/aci
          effect: NoSchedule
        imagePullSecrets:
          - name: docker-registry-secret
        nodeName: virtual-kubelet
  successfulJobsHistoryLimit: 0
  failedJobsHistoryLimit: 0
  pollingInterval: 1  # 1 second polling for max. responsiveness
  minReplicaCount: 12  # keeping these running permanently in order to improve low loads' performance
  maxReplicaCount: 80
  triggers:
  - type: azure-servicebus
#    metricType: Value // The default AverageValue with messageCount: '1' starts up a new Container for each Message in the Queue.  We want that for responsiveness.
    metadata:
      queueName: requestqueue
      connectionFromEnv: KEDA_SERVICEBUS_CONNECTIONSTRING
      messageCount: '1'

JorTurFer · 2023-08-14T16:28:10Z

I'd say that your ScaledJob should work, but I'm not familiar with ProviderFailed state, and maybe it's not correctly processed. I'll take a look

eugen-nw · 2023-08-14T16:36:25Z

Those Pods got into the ProviderFailed state because the Azure subscription had not been configured to provide the desired count of Containers. At any rate, those Pods were in a failed state where they were doing nothing for us.

JorTurFer · 2023-09-12T19:40:02Z

Sorry for the long delay. I was on vacations :)
I'm checking the code and it depends on the job status and not the pod status. Probably we need to add the condition to the not running criteria. Could you share the status of the jobs? (not the pods, the jobs). Currently, we are using the job status to decide if the job has finished or not based on if the status is Complete or Failed. If your jobs are in a different state, we need to add the new state somehow to the condition

zroubalik · 2023-09-13T09:41:15Z

Agree with @JorTurFer on this

stale · 2023-11-12T09:41:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2023-11-19T14:59:26Z

This issue has been automatically closed due to inactivity.

eugen-nw added the bug Something isn't working label Aug 7, 2023

JorTurFer self-assigned this Aug 14, 2023

stale bot added the stale All issues that are marked as stale due to inactivity label Nov 12, 2023

stale bot closed this as completed Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScaledJob deployments regards the "ProviderFailed" state Containers as if they were in the "Running" state #4866

ScaledJob deployments regards the "ProviderFailed" state Containers as if they were in the "Running" state #4866

eugen-nw commented Aug 7, 2023

JorTurFer commented Aug 13, 2023

eugen-nw commented Aug 14, 2023

JorTurFer commented Aug 14, 2023

eugen-nw commented Aug 14, 2023

JorTurFer commented Sep 12, 2023

zroubalik commented Sep 13, 2023

stale bot commented Nov 12, 2023

stale bot commented Nov 19, 2023

ScaledJob deployments regards the "ProviderFailed" state Containers as if they were in the "Running" state #4866

ScaledJob deployments regards the "ProviderFailed" state Containers as if they were in the "Running" state #4866

Comments

eugen-nw commented Aug 7, 2023

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Aug 13, 2023

eugen-nw commented Aug 14, 2023

JorTurFer commented Aug 14, 2023

eugen-nw commented Aug 14, 2023

JorTurFer commented Sep 12, 2023

zroubalik commented Sep 13, 2023

stale bot commented Nov 12, 2023

stale bot commented Nov 19, 2023