Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScaledJob deployments regards the "ProviderFailed" state Containers as if they were in the "Running" state #4866

Closed
eugen-nw opened this issue Aug 7, 2023 · 8 comments
Assignees
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@eugen-nw
Copy link

eugen-nw commented Aug 7, 2023

Report

I have a deployment configured to run 12 Containers permanently. As of right now, there's no Message in the Queue and I see 12 Containers running in spite of the fact that 8 of them will do absolutely nothing for us:

image

Expected Behavior

KEDA should ignore the "ProviderFailed"-state Containers and always provide us with the count of configured minReplicaCount Containers that are either in the "Pending" or "Running" state. The reason for setting minReplicaCount to 12 is that we need to have 12 Containers always on hold, not only 4 of them. Please fix KEDA to always provide the minReplicaCount of usable Containers.

Actual Behavior

We are seeing the minReplicaCount: 12 configured Containers in Kubernetes, but 8 of them are in the non-functional "ProviderFailed" state.

Steps to Reproduce the Problem

Logs from KEDA operator

example

KEDA Version

2.10.1

Kubernetes Version

1.25

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

These are Windows Containers instantiated in the Azure Container Instances service using the Virtual Kubelet installed using the virtual-kubelet-azure-aci-1.5.1 Helm chart.

@eugen-nw eugen-nw added the bug Something isn't working label Aug 7, 2023
@JorTurFer
Copy link
Member

Hello
Could you share your ScaledJob?

@eugen-nw
Copy link
Author

Certainly, please find it below.

# Container deployment as Job + KEDA scaleout setup script.
#
# install by running "kubectl apply -f <name of this file>"
# remove by running "kubectl delete -f <name of this file>".

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: boldiq-external-solver-runner-nj-test
  labels:
    app: boldiq-external-solver-runner-nj-test
    deploymentName: boldiq-external-solver-runner-nj-test
spec:
  jobTargetRef:
    template:
      spec:
        containers:  # this section is identical as for a "kind: Deployment"
        - image: externalsolverrunnersnetjets.azurecr.io/boldiq-external-solver-runner:#{Build.BuildNumber}#
          imagePullPolicy: Always
          name: boldiq-external-solver-runner-nj-test
          resources:
            requests:
              memory: 13G
              cpu: 2
            limits:
              memory: 13G
              cpu: 2
          env:
          - name: KEDA_SERVICEBUS_CONNECTIONSTRING
            value: "(removed)"
          - name: ApplicationName
            value: "ExternalSolverRunner"
          - name: AppInsightsInstrumentationKey
            value: "(removed)"
          - name: ServiceBusIncomingConnection
            value: "(removed)"
          - name: BoldIQServicesUri
            value: "(removed)"
          - name: HeaderApiKey
            value: "(removed)"
          - name: HeaderApiKeyValue
            value: "(removed)"
          - name: MinimumBytesOfAvailableMemory
            value: "10737418241"  # 10 GB + 1, so the logs shows that the value came from the environment
          - name: StorageConnectionString
            value: "(removed)"
        nodeSelector:
          kubernetes.io/os: windows
        tolerations:
        - key: virtual-kubelet.io/provider
          operator: Exists
        - key: azure.com/aci
          effect: NoSchedule
        imagePullSecrets:
          - name: docker-registry-secret
        nodeName: virtual-kubelet
  successfulJobsHistoryLimit: 0
  failedJobsHistoryLimit: 0
  pollingInterval: 1  # 1 second polling for max. responsiveness
  minReplicaCount: 12  # keeping these running permanently in order to improve low loads' performance
  maxReplicaCount: 80
  triggers:
  - type: azure-servicebus
#    metricType: Value // The default AverageValue with messageCount: '1' starts up a new Container for each Message in the Queue.  We want that for responsiveness.
    metadata:
      queueName: requestqueue
      connectionFromEnv: KEDA_SERVICEBUS_CONNECTIONSTRING
      messageCount: '1'

@JorTurFer
Copy link
Member

I'd say that your ScaledJob should work, but I'm not familiar with ProviderFailed state, and maybe it's not correctly processed. I'll take a look

@JorTurFer JorTurFer self-assigned this Aug 14, 2023
@eugen-nw
Copy link
Author

Those Pods got into the ProviderFailed state because the Azure subscription had not been configured to provide the desired count of Containers. At any rate, those Pods were in a failed state where they were doing nothing for us.

@JorTurFer
Copy link
Member

Sorry for the long delay. I was on vacations :)
I'm checking the code and it depends on the job status and not the pod status. Probably we need to add the condition to the not running criteria. Could you share the status of the jobs? (not the pods, the jobs). Currently, we are using the job status to decide if the job has finished or not based on if the status is Complete or Failed. If your jobs are in a different state, we need to add the new state somehow to the condition

@zroubalik
Copy link
Member

Agree with @JorTurFer on this

Copy link

stale bot commented Nov 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Nov 12, 2023
Copy link

stale bot commented Nov 19, 2023

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Nov 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
Archived in project
Development

No branches or pull requests

3 participants