Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling not working #824

Closed
barelyreal opened this issue Aug 28, 2019 · 5 comments
Closed

Autoscaling not working #824

barelyreal opened this issue Aug 28, 2019 · 5 comments
Assignees
Labels
Milestone

Comments

@barelyreal
Copy link

I've been noticing lately that my models no longer autoscale (they stay at the min number of configured pods even when CPU is maxed out). The HPA config seems to be properly generated. Could it have something to do with the resource limits on the sidecar?

Here's an example deployment yaml that isn't working. Seldon version is 0.4.0

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  labels:
    app: seldon
  name: lp-usic-mdl
  namespace: seldon
spec:
  annotations:
    project_name: lp-usic-mdl
    deployment_version: "${project.version}"
    seldon.io/rest-read-timeout: "30000"
    seldon.io/rest-connection-timeout: "30000"
    seldon.io/grpc-read-timeout: "30000"
  name: lp-usic
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: "${lp.docker.image.full.name}"
          imagePullPolicy: Always
          name: usic-classifier
          resources:
            limits:
              cpu: "1"
              memory: "6Gi"
            requests:
              cpu: "1"
              memory: "4Gi"
          env:
            - name: SELDON_LOG_LEVEL
              value: "INFO"
            - name: LOGGER_LEVEL
              value: "INFO"
            - name: PYTORCH_NUM_THREADS
              value: "1"
            - name: MAX_WORKER_THREADS
              value: "1"
          livenessProbe:
            initialDelaySeconds: 600
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
            tcpSocket:
              port: "http"
          readinessProbe:
            initialDelaySeconds: 600
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
            tcpSocket:
              port: "http"
        terminationGracePeriodSeconds: 20
      hpaSpec:
        minReplicas: 4
        maxReplicas: 10
        metrics:
        - type: "Resource"
          resource:
            name: cpu
            targetAverageUtilization: 60

    graph:
      children: []
      name: usic-classifier
      endpoint:
        type: REST
      type: MODEL
    name: mdl
    replicas: 4
    annotations:
      predictor_version: "${project.version}"
    svcOrchSpec:
      env:
        - name: SELDON_LOG_LEVEL
          value: "INFO"
      resources:
        limits:
          cpu: "1"
          memory: "2Gi"
        requests:
          cpu: "500m"
          memory: "1Gi"
@ukclivecox ukclivecox self-assigned this Aug 28, 2019
@ukclivecox ukclivecox added the bug label Aug 28, 2019
@ukclivecox ukclivecox added this to the 1.0.x milestone Aug 28, 2019
@ukclivecox
Copy link
Contributor

Can you try this example with 0.4.1-SNAPSHOT (i.e., from a clone of seldon-core). I retested on a GKE cluster and it works. If that also works for you we would need to look closer at your SeldonDeployment and why its different.

@ukclivecox
Copy link
Contributor

But yes - it may be due to your resource limits.

@barelyreal
Copy link
Author

Do you have recommended settings for the svcOrchSpec resource limits?

@ukclivecox
Copy link
Contributor

The svcOrchSpec resource limits should not stop autoscaling from happening. The custom limits you set will depend on the expected load you have for the model you will deploy.

Can you test the example notebook on your cluster: https://docs.seldon.io/projects/seldon-core/en/latest/examples/autoscaling_example.html

@ukclivecox
Copy link
Contributor

Please reopen if an issue still exists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants