Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SeldonDeployment stuck on creating when an environment variable is a reference #1211

Closed
bbarn3y opened this issue Dec 6, 2019 · 7 comments · Fixed by #1214
Closed

SeldonDeployment stuck on creating when an environment variable is a reference #1211

bbarn3y opened this issue Dec 6, 2019 · 7 comments · Fixed by #1214
Assignees
Labels
Milestone

Comments

@bbarn3y
Copy link

bbarn3y commented Dec 6, 2019

We have a SeldonDeployment the we want to use with Jaeger (https://www.jaegertracing.io/) and the involves setting certain environment variables.

From what we see our issue stems from environment variables that are references. Namely we want to specify the "JAEGER_AGENT_HOST" variable whose value should come from kubernetes status like so:

- env:
          - name: JAEGER_AGENT_HOST
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP

However if we do this than SeldonDeployment's status is stuck on Creating:

status:
  serviceStatus:
    seldon-d8f6b0ae87be7b12b26309f8931d246b:
      httpEndpoint: seldon-d8f6b0ae87be7b12b26309f8931d246b.namespace:9000
      svcName: seldon-d8f6b0ae87be7b12b26309f8931d246b
    seldon-e4bb6ba3b22cf76a2235e23da7d732df:
      httpEndpoint: seldon-e4bb6ba3b22cf76a2235e23da7d732df.namespace:9000
      svcName: seldon-e4bb6ba3b22cf76a2235e23da7d732df
  state: Creating

If we change the environment variables definition to:

- name: JAEGER_AGENT_HOST
  value: "192.168.0.1"

then status changes to Available.

Our main problem is that model service does not start at all while the status is stuck at creating. The weird thing is that event though the SeldonDeployment's status is Creating the underlying Deployment and pods start successfully.

We use seldon core operator version 0.4.0 and our model image starts from "seldonio/seldon-core-s2i-python3:0.13".
Any help on how to solve this would be appreciated.

@ukclivecox
Copy link
Contributor

Do you add any extra Volumes to your Pod when it get stuck or is the env the only change?

You could check the Seldon manager logs to check if it thinks it needs to keep Reconciling the Deployment which is why this could be stuck.

@bbarn3y
Copy link
Author

bbarn3y commented Dec 6, 2019

Only the envs change, but for Jaeger we actually have to specify it twice, once in "spec.predictors.componentSpecs.spec.containers.env", but also in "spec.predictors.spec.svcOrchSpec.env". If the environment variable is specified in either of them the aforementioned way, SeldonDeployment gets stuck in Creating.

I think you are right I found an error log with message "Reconcile Error":

{"level":"error","ts":1575647169.2801647,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"seldondeployment-controller","request":"namespace/simplem
odelm","error":"Operation cannot be fulfilled on seldondeployments.machinelearning.seldon.io \"simplemodelm\": the object has been modified; please apply your change
s to the latest version and try again","stacktrace":"github.com/seldonio/seldon-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/seld
onio/seldon-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/seldonio/seldon-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*
Controller).processNextWorkItem\n\t/go/src/github.com/seldonio/seldon-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithu
b.com/seldonio/seldon-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/seldonio/seldon-operator
/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/seldonio/seldon-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Jitter
Until.func1\n\t/go/src/github.com/seldonio/seldon-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/seldonio/seldon-operator/vendor/k8s.io/ap
imachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/seldonio/seldon-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/seldonio/seldon-
operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/seldonio/seldon-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

The relevant error seems to be "Operation cannot be fulfilled on seldondeployments.machinelearning.seldon.io "simplemodelm": the object has been modified; please apply your changes to the latest version and try again"

@ukclivecox
Copy link
Contributor

ukclivecox commented Dec 6, 2019

I think that error is transitory and may not be it. But if you see it trying to create the deployment multiple times that could be an error. In the past this has been due to defaults added by k8s and then the Operator thinks the Deployment has changed.

It could be the apiVersion in https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.15/#objectfieldselector-v1-core

@ukclivecox ukclivecox added the bug label Dec 6, 2019
@ukclivecox ukclivecox added this to the 1.0 milestone Dec 6, 2019
@ukclivecox ukclivecox self-assigned this Dec 6, 2019
@ukclivecox
Copy link
Contributor

Can you test by adding the apiVersionto its default: v1 and see if that fixes it?

@bbarn3y
Copy link
Author

bbarn3y commented Dec 6, 2019

Sorry, maybe I misunderstood, I tried modifying the SeldonDeployment's apiVersion to v1 or machinelearning.seldon.io/v1, but I got the following error:

The edited file had a syntax error: unable to recognize "edited-file": no matches for kind "SeldonDeployment" in version "machinelearning.seldon.io/v1"

In our yaml it's set to "machinelearning.seldon.io/v1alpha2", or am I supposed to put apiVersion somewhere else as well?

This is the whole yaml:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  creationTimestamp: "2019-12-06T15:19:12Z"
  generation: 8
  labels:
    app: seldon
    namespace/component: namespace-modeldeployment
    namespace/deploymenttype: namespace-single-model
    namespace/templateversion: "0.1"
  name: simplemodelm
  namespace: namespace
  resourceVersion: "11167588"
  selfLink: /apis/machinelearning.seldon.io/v1alpha2/namespaces/namespace/seldondeployments/simplemodelm
  uid: c1db7a15-183b-11ea-94e4-00155d28dd27
spec:
  annotations:
    deployment_version: v1
    project_name: simplemodelm
  name: simplemodelm
  predictors:
  - annotations:
      predictor_version: v1
    componentSpecs:
    - spec:
        containers:
        - env:
          - name: TRACING
            value: "0"
          - name: JAEGER_AGENT_HOST
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: JAEGER_AGENT_PORT
            value: "5775"
          - name: JAEGER_SAMPLER_TYPE
            value: const
          - name: JAEGER_SAMPLER_PARAM
            value: "1"
          - name: JAEGER_EXTRA_TAGS
            value: data, json, form, args, values
          image: 10.102.239.242:5000/simplemodelm:0.1.9
          livenessProbe:
            failureThreshold: 6
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: http
            timeoutSeconds: 1
          name: simplemodelm
          readinessProbe:
            failureThreshold: 6
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: http
            timeoutSeconds: 1
          volumeMounts: []
        imagePullSecrets:
        - name: namespace-model-service
        terminationGracePeriodSeconds: 1
        volumes: []
    graph:
      children: []
      endpoint:
        service_host: localhost
        service_port: 9000
        type: REST
      implementation: UNKNOWN_IMPLEMENTATION
      name: simplemodelm
      type: MODEL
    labels:
      version: simplemodelm
    name: simplemodelm
    replicas: 1
    svcOrchSpec:
      env:
      - name: TRACING
        value: "0"
      - name: JAEGER_AGENT_PORT
        value: "5775"
      - name: JAEGER_SAMPLER_TYPE
        value: const
      - name: JAEGER_SAMPLER_PARAM
        value: "1"
      - name: JAEGER_EXTRA_TAGS
        value: data, json, form, args, values
status:
  deploymentStatus:
    simplemodelm-simplemodelm-b2876dc:
      availableReplicas: 1
      replicas: 1
  serviceStatus:
    seldon-d8f6b0ae87be7b12b26309f8931d246b:
      httpEndpoint: seldon-d8f6b0ae87be7b12b26309f8931d246b.namespace:9000
      svcName: seldon-d8f6b0ae87be7b12b26309f8931d246b
    simplemodelm-simplemodelm-simplemodelm:
      grpcEndpoint: simplemodelm-simplemodelm-simplemodelm.namespace:5001
      httpEndpoint: simplemodelm-simplemodelm-simplemodelm.namespace:8000
      svcName: simplemodelm-simplemodelm-simplemodelm
  state: Creating

@ukclivecox
Copy link
Contributor

No. Here is an example I tested that works

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  labels:
    app: seldon
  name: seldon-model
spec:
  name: test-deployment
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.0
          imagePullPolicy: IfNotPresent
          name: classifier
          env:
          - name: JAEGER_AGENT_HOST
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: status.hostIP
    graph:
      children: []
      endpoint:
        type: REST
      name: classifier
      type: MODEL
    name: example
    replicas: 1

We'll look into fixing the bug.

@bbarn3y
Copy link
Author

bbarn3y commented Dec 6, 2019

Yep, that fixed it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants