Autoscaling not working #824

barelyreal · 2019-08-28T11:35:33Z

I've been noticing lately that my models no longer autoscale (they stay at the min number of configured pods even when CPU is maxed out). The HPA config seems to be properly generated. Could it have something to do with the resource limits on the sidecar?

Here's an example deployment yaml that isn't working. Seldon version is 0.4.0

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  labels:
    app: seldon
  name: lp-usic-mdl
  namespace: seldon
spec:
  annotations:
    project_name: lp-usic-mdl
    deployment_version: "${project.version}"
    seldon.io/rest-read-timeout: "30000"
    seldon.io/rest-connection-timeout: "30000"
    seldon.io/grpc-read-timeout: "30000"
  name: lp-usic
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: "${lp.docker.image.full.name}"
          imagePullPolicy: Always
          name: usic-classifier
          resources:
            limits:
              cpu: "1"
              memory: "6Gi"
            requests:
              cpu: "1"
              memory: "4Gi"
          env:
            - name: SELDON_LOG_LEVEL
              value: "INFO"
            - name: LOGGER_LEVEL
              value: "INFO"
            - name: PYTORCH_NUM_THREADS
              value: "1"
            - name: MAX_WORKER_THREADS
              value: "1"
          livenessProbe:
            initialDelaySeconds: 600
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
            tcpSocket:
              port: "http"
          readinessProbe:
            initialDelaySeconds: 600
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
            tcpSocket:
              port: "http"
        terminationGracePeriodSeconds: 20
      hpaSpec:
        minReplicas: 4
        maxReplicas: 10
        metrics:
        - type: "Resource"
          resource:
            name: cpu
            targetAverageUtilization: 60

    graph:
      children: []
      name: usic-classifier
      endpoint:
        type: REST
      type: MODEL
    name: mdl
    replicas: 4
    annotations:
      predictor_version: "${project.version}"
    svcOrchSpec:
      env:
        - name: SELDON_LOG_LEVEL
          value: "INFO"
      resources:
        limits:
          cpu: "1"
          memory: "2Gi"
        requests:
          cpu: "500m"
          memory: "1Gi"

The text was updated successfully, but these errors were encountered:

ukclivecox · 2019-08-29T12:38:36Z

Can you try this example with 0.4.1-SNAPSHOT (i.e., from a clone of seldon-core). I retested on a GKE cluster and it works. If that also works for you we would need to look closer at your SeldonDeployment and why its different.

ukclivecox · 2019-08-29T12:41:15Z

But yes - it may be due to your resource limits.

barelyreal · 2019-09-03T11:11:52Z

Do you have recommended settings for the svcOrchSpec resource limits?

ukclivecox · 2019-09-03T12:32:58Z

The svcOrchSpec resource limits should not stop autoscaling from happening. The custom limits you set will depend on the expected load you have for the model you will deploy.

Can you test the example notebook on your cluster: https://docs.seldon.io/projects/seldon-core/en/latest/examples/autoscaling_example.html

ukclivecox · 2019-09-05T10:36:15Z

Please reopen if an issue still exists

ukclivecox self-assigned this Aug 28, 2019

ukclivecox added the bug label Aug 28, 2019

ukclivecox added this to the 1.0.x milestone Aug 28, 2019

ukclivecox closed this as completed Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling not working #824

Autoscaling not working #824

barelyreal commented Aug 28, 2019

ukclivecox commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

barelyreal commented Sep 3, 2019

ukclivecox commented Sep 3, 2019

ukclivecox commented Sep 5, 2019

Autoscaling not working #824

Autoscaling not working #824

Comments

barelyreal commented Aug 28, 2019

ukclivecox commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

barelyreal commented Sep 3, 2019

ukclivecox commented Sep 3, 2019

ukclivecox commented Sep 5, 2019